I’m Sorry, Googlebot.

by Scott Jangro on 28 September 2006

I’ve been cursing you, Googlebot.

For years, I’ve damned you for your bad behavior. You have been brazenly grabbing up pages that I’ve excluded in my robots.txt file. In the process, you have consumed thousands upon thousands of local tracking scripts, login pages, add-to-profile quick links, search links … all stuff I don’t want crawled, or indexed. And quite frankly, neither do you. I’ve tried to save you from yourself. Really, I have tried.

Yesterday I discovered that, all this time, it’s me.

As it turns out, this robots.txt exclusion does absolutely nothing:


User-agent: *
Disallow: foo.html

Why? The matching that’s done is from the beginning of the string, not anywhere in the string. I think I knew this, but my oversight was that URLs start with a slash (/).

The correct syntax to block access to foo.html is this:


User-agent: *
Disallow: /foo.html

The leading slash matters. Who knew?

I figured this out experimenting with Google’s Webmaster tools. Among other things, they have a nice robots.txt testing tool where you can experiment with changes to your robots.txt. Provided with one or more URLs, it’ll tell you if googlebot will crawl the pages or not.

Here’s what I got when I tested the first example.
googlebot-allowed.jpg
Damn you Googlebot!

Here’s what I got when I added the slash, like in the second example:
googlebot-allowed2.jpg

oops.

I was lulled into this paranoid stupor because I figured that Google just wanted to see what I had in these pages. Matt Cutts wrote recently about how to herd the Googlebot and his scenarios seemed to validate my twisted assumptions.

To block a specific page, he suggests using robots exclusion meta tags. But that requires that googlebot actually gets the page to see that data, which defeats the purpose. He made no mention of using robots.txt to block even the initial access to a specific file. What?

Clearly he was not providing instruction on ALL ways to manipulate the Googlebot. So, as we all know in our hearts, you can block a specific file with robots.txt.

Just don’t forget the leading slash.

I hang my head in shame.

blog comments powered by Disqus