A recent post by Google offers in-depth information about robots exclusion protocol. This is second in the series of posts by Google on Robots Exclusion Protocol. The previous post was on controlling how search engines access and index your website. The previous post was all about Robots.txt.file, how Google search engine works.
Google blog says, “The key is a simple file called robots.txt that has been an industry standard for many years. It lets a site owner control how search engines access their web site. With robots.txt you can control access at multiple levels — the entire site, through individual directories, pages of a specific type, down to individual pages.”
The second post on robots exclusion protocol explains the mechanism adopted by Google to control access and indexing of a site. It explains how Googlebot can be prevented from following a link. “Usually when the Googlebot finds a page, it reads all the links on that page and then fetches those pages and indexes them. This is the basic process by which Googlebot "crawls" the web. This is useful as it allows Google to include all the pages on your site, as long as they are linked together.” It further informs that one can simply add the Nofollow tag to a page to inform the Googlebot not to follow any links it finds on that page.
Further on Goggle explains how to control caching and snippets, “Usually you want Google to display both the snippet and the cached link. However, there are some cases where you might want to disable one or both of these. For example, say you were a newspaper publisher, and you have a page whose content changes several times a day. It may take longer than a day for us to reindex a page, so users may have access to a cached copy of the page that is not the same as the one currently on your site. In this case, you probably don’t want the cached link appearing in our results.”
If you feel that the above information was useful then do read the complete post on robots exclusion protocol.
FYI: Goggle blog would be soon posting the third and the final post in this series. It will address some common exclusion issues faced by many webmasters and show how to solve them using the Robots Exclusion Protocol.