Dot Tactics: the robots.txt is not out to get you

Friday, March 25, 2005

the robots.txt is not out to get you

A few weeks ago I experienced the joy that comes with trying to remove a sensitive file from Google's cache. In the process I had to do some research on robots.txt and how meta tags can be used to exclude content from search engines. I also came across some good reference docs on Yahoo and Google to help webmasters build more search friendly sites.

So what is robots.txt anyway?

Well, to put it simply, it is a file you place in your root directory which tells robots, spiders, and webcrawlers what they CAN'T index. But what do you put in the file?

User-agent: googlebot
Disallow: dottactics.htm

The above command would exclude any crawler identifying itself as 'googlebot' from crawling the file named 'dottactics.htm'.

Here is another example:

User-agent: *
Disallow: /dottactics/

This uses the wild card character (*) to exclude ALL crawlers from the /dottactics directory.

There is no technological reason why spiders and bots can't crawl reserved files or directories, but virtually all of the major search engines honor the robot.txt. Want more information? This tutorial and dediacted forum should get you where you need to go.

# filed by individualist @ 10:22 AM

Comments: Post a Comment

<< Home

Dot Tactics

Friday, March 25, 2005

the robots.txt is not out to get you

Links

Worthwhile Blogs

Tools

archives