Friday, March 25, 2005

 

the robots.txt is not out to get you

A few weeks ago I experienced the joy that comes with trying to remove a sensitive file from Google's cache. In the process I had to do some research on robots.txt and how meta tags can be used to exclude content from search engines. I also came across some good reference docs on Yahoo and Google to help webmasters build more search friendly sites.

So what is robots.txt anyway?

Well, to put it simply, it is a file you place in your root directory which tells robots, spiders, and webcrawlers what they CAN'T index. But what do you put in the file?

User-agent: googlebot
Disallow: dottactics.htm


The above command would exclude any crawler identifying itself as 'googlebot' from crawling the file named 'dottactics.htm'.

Here is another example:

User-agent: *
Disallow: /dottactics/


This uses the wild card character (*) to exclude ALL crawlers from the /dottactics directory.

There is no technological reason why spiders and bots can't crawl reserved files or directories, but virtually all of the major search engines honor the robot.txt. Want more information? This tutorial and dediacted forum should get you where you need to go.

Comments: Post a Comment



<< Home

This page is powered by Blogger. Isn't yours?