How does a company like Google index the Internet? It uses little applications called "bots" to sniff their way around the internet. Like a young adventurous pet, from time to time they can dig a little too deep and expose more web pages than desired. For instance, it might expose a special admin web page or a temporary directory not intended for public viewing. In the mid-1990s, Martijn Koster invented the concept of a robots.txt file. This simple text file was placed on the root of a website and provided special instructions for bots. An example
is provided below:
The concept of this file is very simple. Provide a list of folders and files to be excluded from searching. (For more information about how to build a robots file visit robotstxt.org.) Over time all search engines were updated to abide by these new rules. This was an excellent, simple way to remove unwanted results from search engines. It did not require any programming or complex setup to implement.
Unfortunately, over the past year Google decided to partially ignore the
robots.txt file in an attempt to increase the effectiveness of its results. The following is a direct quote from Google's webmaster documentation about how they interrupt a robots.txt file:
"Blocking Google from crawling a page is likely to decrease that page's ranking or cause it to drop out altogether over time.
It may also reduce the amount of detail provided to users in the text below the search result. ... However, robots.txt Disallow does not guarantee that a page will
not appear in results: Google may still decide, based on external information such as incoming links, that it is relevant."
This minor cliff note means web pages previously excluded from Google's search results may start showing up again. In this circumstance, the search results will display the title, URL, and the message "A description for this result
is not available because of the site's robots.txt - learn more".
If Google isn't completely respecting the robots.txt file anymore, what other options are available? Acceptable alternatives include the use of a robots meta tag or HTTP header. Both options allow for the same directives and take the basic concept of a robots.txt file to the next level. The following is an example of a robots meta tag:
<meta name="robots" content="noindex, nofollow" />
A complete list of options is available on Google's webmaster site. It includes specialized directives such as "noimageindex" which prevents the indexing of images on a web page and "notranslate" which disallows translation of a web page into other languages. Although this does provide more flexibility for web administrators, it requires more time and effort. These tags and/or headers need to be included by a programmer or through a custom configuration on a web server. Once included, Google will drop the offending web page from its search results completely.