Wednesday, July 27, 2011

The robots text file, or How to get your site properly spidered, crawled, indexed by bots

!±8± The robots text file, or How to get your site properly spidered, crawled, indexed by bots

She feels the importance of the robots.txt file in your site logs and noticed that the robots.txt file because an error, or is somehow on top of the page views up, or read some articles on the death of the robots file . txt and how you should spend time with him again. Or maybe you've never heard of a robots.txt file, but all that talk about spiders, robots and crawlers fascinated. This article will hopefullymake some sense of all the above.

There are many people out there who are vehemently for the uselessness of the robots.txt file, declared obsolete, a thing of the past, just died. I disagree. The robots.txt file is probably not among the top ten ways to promote your affiliate site to get rich quickly within 24 hours or less, but still plays an important role in the long run.

First, the robots.txt file is still a very important factor in promoting and maintaining a website,and I'll show you why. Secondly, the robots.txt file of a simple means by which you can protect your privacy and / or intellectual property. I'll show you how.

We try to understand some of the jargon.

What is the robots.txt file?

The robots.txt file is a simple text file (or an ASCII file, as some like to say), with a very simple set of instructions we give to a Web robot, the robot knows which pages we need to scan (analyzed or spiderindexed - all terms refer to the same thing in this context) and the pages we want you to get from search engines.

Www what is a robot?

A robot is a computer program that reads Web pages and automatically goes through all the links it finds. The purpose of the robot is to collect information. Some of the most famous robots mentioned in this article to search engines to index all the information available on the Internet.

The first robot was developed by MIT andstarted in 1993. It 'been called the World Wide Web and its original purpose was traveling from a purely scientific, his mission was to measure the growth of the Web. The index created by the experiment, the results proved to be a wonderful and effective tool was the first search engine. Most of the things that we now consider essential online tools as a side effect of some scientific experiments to be born.

What is a search engine?

Very generally, a search engine programlooking for a database. In the popular sense, as the railway designated as a search engine is a search form that a user who has a collection of Web pages gathered by a robot for research.

What are spiders and caterpillars?

Spiders and crawlers are robots, just sound cooler names of the press and the underground geek circles.

What are the most popular robots? Is there a list?

Some of the most famous robots Google GooglebotMSNBot MSN, Ask Jeeves and Teoma, Yahoo Slurp (fun). One of the most popular places for active information search robot list is maintained http://www.robots.org.

Why do I need a robots.txt file in all this?

A good reason to use a robots.txt file is actually the fact that making and using as many search engines such as Google, send suggestions to the public of this tool. Because it is a big deal that Google teaches about robots.txt? Well, because today, searchEngines are not a playground for scientists and computer geeks anymore, but the big companies. Google is a search engine is the most mysterious. Very little to inform the public about how it works known as indexes, as it seems, as it creates its ranking, etc. In fact, if you do a thorough research in specialized forums, or wherever problems are discussed, no one is really true, if Google puts more weight on this or that element, to create its ranking. Eif people do not agree on things as accurate as a ranking algorithm, this means two things: that Google is constantly changing their methods, and that makes it not very clear or very public. There is only one thing I think is clear. If it is suggested that a robots.txt file ("Use the robots.txt file in your web server" - Google Technical Guidelines) to use, then do it. It could help your rankings, but definitely not bad.

There are also other reasons for usingrobots.txt. If you use to change your error log and maintain your Web site, without errors, you will notice that most fault with someone or something not to refer the robots.txt file. All you have to do is create a blank page of the base (with the Windows Notepad text editor or easier in Linux or Mac), as robots.txt and upload it to the root of the server (this is where your home page).

On a different note, we now see all the search engines for the robots.txt file as aOnce their robots are coming to your site. There are unconfirmed reports that some robots "angry" and may also leave if you do not find it. Not sure how true this is, but hey, why not be safe?

Even if you're not going to block something or just do not want this stuff to robots.txt is all problems with a blank still a good idea, because in fact act as an invitation on your website.

I do not want my site indexed? Why stop bots?

Some robots arewell-designed, professionally operated, cause no harm and provide a valuable service to humanity (we are not all like "Google"). Some robots are amateurs (remember, a robot is just a program) is written. Robot misspelled network congestion, safety problems, etc. The bottom line is that the robots are invented and managed by humans and are prone to human error. Therefore, the robots are not inherently bad, nor are brilliant and deserve special attention. This isAnother case where the robots.txt file is useful - Control of the robot.

Well, I'm sure that your main goal in life is as a webmaster or site owner to get the first page of Google. So why would you want to block robots?

Here are some scenarios:

1 Unfinished Site

I'm still in the design of your website or parts of them, and do not want to see the unfinished page in search engines. It is said that some search engines penalize sites with the sides still were"Under construction" for a long time.

2 Safety

Always lock your cgi-bin of the robot. In most cases, cgi-bin applications, the configuration file for the application (this could actually confidential information) contains, etc. Even if you do not block currently using CGI scripts or programs, still, better safe than sorry.

3 Private life

Maybe you have some site directory where you keep things that do not want to see the entire galaxy, asas images of a friend who has made clothes, etc. forget

4 Doorway Pages

In addition to illegal attempts to improve the positioning of doors to jump all Internet pages, lead in fact use very moral. This is similar sites, but each is optimized for a particular search engine. In this case you must ensure that the individual robots do not have access to them all. This is extremely important to prevent spam penalized for a search enginewith a number of very similar pages.

5 Bot bad, bad, offered What'cha want to do ...

You might want robots whose goal is known, e-mail or other robots, whose activities are incompatible with the faith around the world come together to agree excluded.

6 Your site will be overwhelmed

In rare cases, a robot goes through your site too fast, eat bandwidth and slow down the server. This will be a "rapid fire" and you'll know it when you read accessLog files. A server should not slow down mid-performance. However you may have problems if you have a low yield in the field, how to run your PC or Mac, if you run the server software from bad, or if you have severe or scripts large documents. These cases, you will see the lost connections, slow severe extremes to get a full system crash. If this happens, check the logs, try to get the IP or the name of the robot, read the list of active robots and try to identify and. Block

What is the robots.txt file in the first place?

There are only two lines for each entry in the robots.txt file, the user program, which will give you the name of the orders of the robot or the '*' wildcard to indicate "all", and the Disallow line, which tells a robot touch all the places it should not. The two-line entry for each file or directory that you do not want indexed are repeated for each robot, or if you want to exclude. If you leave the Disallow line blank, then you do notProhibition of anything, in other words, you are so special that robots index your entire site. Some examples and a couple of scenarios should clarify:

A. Excluding a file from the main robot from Google (Googlebot)

User-agent: Googlebot

Disallow: / private / privatefile.htm

B. Connect a section of the site from all parts of the robot:

User-agent: *

Disallow: / under construction /

Note that the directory is enclosed between two slashes. Although probablybe used to view the URLs, links and folder that does not end with a slash, it is known that a Web server always needs to see a bar at the end. Even if you link to websites that are not a bar if the link is clicked on it in the end, must do for the web server and an extra step before serving the page, adding the bar through what we call a detour. Always use the trailing slash.

C. Do nothing (empty robots.txt):

User-agent: *

Disallow:

Note that if an "emptyrobots.txt "is indicated, a file is completely empty, but contains the two lines above.

D. Do not allow robots to your site:

User-agent: *

Disallow: /

Note that the single slash "root" and the main entrance can be viewed on your site.

E. Do not Google index all your images (using Google Googlebot-Image for images):

User-agent: Googlebot-Image

Disallow: /

F. Do not Google index some of yourImages:

User-agent: Googlebot-Image

Disallow: / images_main /

Disallow: / images_girlfriend /

Disallow: / downloaded_pix /

Note prohibits the use of different. This is allowed, no pun intended.

G. Create (Lycos is a robot T-Rex), a door for Google and Lycos - not to play with this if you are 100% sure, you know what you will:

User-Agent: T-Rex

Disallow: / index1.htm

User-agent: Googlebot

Disallow: / index2.htm

H. LetGooglebot only ..

User-agent: Googlebot

Disallow:

User-agent: *

Disallow: /

Note that the commands are sequential. The above example reads in English: Let Googlebot stopped, then all the others.

When the file is really big, or you feel like writing notes to themselves or to potential viewers (remember, a robots.txt file is public, everyone can see it), you can do before your comment with the # symbol. Although according to the standard, you canComment on the same line with a command, I always start every command and every comment on a new line, so it will never be mistaken for robots from a technical problem of formatting as possible. Examples:

That's right, according to the standard, but is not recommended (or a robot might be a latest bad as the following text "# The prohibits the Directory ..." to read, do not correspond to "reject all" command) :

User-agent: * Disallow: / # We decided to stop all robots, but we were veryfool to go into a lengthy commentary that cut and unusable by robots.txt

The way I recommend this format:

# We have decided to stop all robots and we made sure

# The fact that our observations will not be truncated

# In the process

User-agent: *

Disallow: /

Although theoretically introduced each robot, the standards by 1994 and expanded in 1996 should match, each robot acts a little 'different. You should consult the documentationprovided by the owners of these robots, you may be surprised to discover a world of useful facts and techniques. For example, the Google website, we learn that Googlebot ignores any URL, the "& id =" contains.

Here are some sites to check:

Google: http://www.google.com/bot.html
Yahoo: http://help.yahoo.com/help/us/ysearch/slurp/
MSN: http://search.msn.com/docs/siteowner.aspx
A database is run by robots[Http: / / www.robotstxt.org / wc / active / html / contact.html]
Robots.txt validation tool - invaluable in the search for possible typing errors, which can completely change how search engines see your site, is available at: [http://searchengineworld.com/cgi-bin/robotcheck . cgi]

There are also some extensions of the standard. For example, some robots allow wildcard Disallow line, some even allow different commands. My advice is to not interfere with anything outside the norm and will not beunpleasantly surprised.

A final word of caution:

In this article I will show how things should work in a perfect world. Somewhere on this article, I said there are good bots and bad bots. Let us pause a moment and think that a disturbed person perspective. There is something that reads a letter from a robot program that uses a robots.txt file to prevent and special pages that you look as "unacceptable" marked? The answer is absolutely not, the full standardhonor and is based on the concept that everyone must work hard to make the Internet is based on a better place. Basically you should not be relied upon for real security or privacy. Use strong passwords, if necessary.

Finally, remember that the robots are indexing your best friends. Although you should not build your site for robots, but human visitors, do not underestimate the power of those mindless Crawler - ensure that the pages they want indexed by robots, it was clear, makingBe sure to regularly hyperlinks without roadblocks that robots (robots can not follow, Flash navigation systems, for example) can follow. To make your site in tip top to keep the power to protect clean records and applications, scripts and private data, always use a robots.txt file and make sure you read the logs to monitor any activity in robotics.


The robots text file, or How to get your site properly spidered, crawled, indexed by bots

!8!# Home Phone Service For Sale !8!# Otoscope Diagnostic Set Free Shipping



0 comments:

Post a Comment