Recently, on both Yahoo! Search Blog and the Official Google Blog recently posted on indexing websites and how to use some of the tools that everyone has access to keep pages out of the search engine’s indexes.
Google
Controlling how search engines access and index your website
I’m often asked about how Google and search engines work. One key question is: how does Google know what parts of a website the site owner wants to have show up in search results? Can publishers specify that some parts of the site should be private and non-searchable? The good news is that those who publish on the web have a lot of control over which pages should appear in search results.
The key is a simple file called robots.txt that has been an industry standard for many years. It lets a site owner control how search engines access their web site. With robots.txt you can control access at multiple levels — the entire site, through individual directories, pages of a specific type, down to individual pages. Effective use of robots.txt gives you a lot of control over how your site is searched, but its not always obvious how to achieve exactly what you want. This is the first of a series of posts on how to use robots.txt to control access to your content.
and this was followed up recently:
The Robots Exclusion Protocol
This is the second in a short series of posts about the Robots Exclusion Protocol, the standard for controlling how web pages on your site are indexed. This post provides more details and examples of mechanisms to control access and indexing of your website by Google.
In the first post in this series, I introduced robots.txt and robots META tags, giving an overview of when to use them. In this post, I’ll look at some examples of the power of the protocol. These examples illustrate the detailed and fine-grain control online publishers have over how their websites are indexed.
Yahoo!
Keeping Ad Tracking and Dead URLs out of Yahoo! Search
We’re often asked how Yahoo! Search determines which pages get indexed and which pages are left un-crawled. First and foremost, we honor the industry-standard robots.txt file format, which gives Webmasters several layers of control over which sites, pages and specific URLs should be indexed. Lately we’ve heard from a number of Webmasters asking how best to prevent ad tracking URLs and dead URLs from getting indexed, so we thought we’d respond via this post.
The robots.txt and META Tags controlling the indexing behavior of the search engine crawlers is very important to SEO. Ensuring what is indexed by search engines is the first step to visibility to users and should always have a priority in your optimization efforts.
Related Posts >>
- Cuil is Live
- Cuill – Have We Found a Google Killer?
- Setting an Expiration Date on Content
- The Home Page Interview – Fun and Enlightening
- Glossary of Important SEO and SEM Terminology









