Is Google Misreading Robots.txt?

We noticed a precipitous drop in the past few days in our Google traffic on one well-indexed site.

It's been puzzling us for a few days until we noticed in the Google Webmaster Tools that our robots.txt file was somehow instructing Google to disallow everything, from the root directory, down:


robots.txt URL       http://www.example.com/robots.txt
Last downloaded     December 7, 2006 5:28:15 AM PST
Status     200 (Success)   [?]
Home page access     This file is blocking access to http://www.example.com/
(Note: the above is a copy/paste from the Google interface, the urls have been changed.)

WHAT? There have been no changes to our robots.txt file in months. Google has been coming by daily, grabbing thousands of pages. The site has been well indexed, showing homepage and sub-page results up until earlier this week. Now our robots.txt file is blocking access to the home page?

So what's going on?

First, it's important to understand this robots.txt rule. If you put more than one user-agent line in a row, they're cumulative, like this:

user-agent: googlebot
user-agent: msnbot
allow: /

That tells both googlebot and msnbot that all are allowed. That case is clearly outlined in the Robots RFC, A Method for Web Robots Control document.

So what happened to us? Here's a minimalized version of what was wrong with our robots.txt file.

user-agent: *
crawl-delay: 10

user-agent: badbot
disallow: /

What would you expect this to do? Issue the crawl-delay directive (followed by MSN and Yahoo) to all user agents, and then block access to 'badbot'.

Based on my testing in the Google Webmaster tools, here's how Googlebot parses the above robots.txt file. They ignore the crawl-delay (they'll tell you this outright in the webmaster tools). That's fine, but they actually treat it as nothing. And then they compress the blank lines.

To Google, the above very benign robots.txt file is actually same as this ultra-restrictive robots.txt:

user-agent: *
user-agent: badbot
disallow: /

All robots, get lost. Disaster.

Again, we've had this robots.txt file in place for quite a while now. Our index problems started this week. Did Google introduce a change to their robots.txt parsing?

Our fault? I don't think so. If Google is following the RFC (they link to pages that reference that when suggesting reading on robots.txt), the specs (in "Backus Naur Form", or BNF) allows for "extensions" in addition to allow and disallow. So even if Google ignores a directive in a user-agent block, they probably shouldn't consider it a blank line and compress it to the next block! If they're not following that RFC, they need to document what their own parsing rules are. Google has had problems blocking sites incorrectly before.

Anyway, we've fixed the robots.txt file. Google Webmaster Tools likes it again. I hope that was the problem and we get indexed again.

Do yourself a favor and check your robots.txt file, and check the Google Webmaster Tools robots.txt analysis to see if you've got any surprises lurking.

Posted on Friday, December 8, 2006 at 10:43:03 AM in Search Engines
Scott Jangro

By Scott Jangro

Scott Jangro is a co-founder of Shareist. He's an entrepreneur, an old school affiliate marketer, web developer, a dad, a cyclist, and golfer.

blog comments powered by Disqus