Monitoring Robots.txt: Committing to Disallow

By Ryan Siddle


What is robots.txt?

A file webmasters create to instruct web crawlers which URIs they should or should not access. If no instructions (robots.txt file) are found then all URIs should be crawled, but not all web crawlers obey the robots.txt rules.

Learn more about robots.txt here.

Handle with care

Adding three letters to your robots.txt file could wipe your entire website from search engine ranks.

  • User-Agent: *
  • Allow: /

  • User-Agent: *
  • Disallow: /

A Lucky Escape

During August 2016, we worked on a Request For Proposal that required OCaml research. Typing 'ocaml.org package' in the address bar yielded no relevant results.

The top result instead, was the homepage with the dreaded.

"A description for this result is not available because of this site's robots.txt – learn more."

It seemed strange that an Open Source project would want to exclude their entire documentation and site from web crawlers. In OCaml.org’s case, the robots.txt file never used to exist. However, as of 29 July 2016, it suddenly became available, with the Disallow: / exclusion rule.

On the 28th July 2016, a 404 error was presented.

As a new OCaml user, I decided to look at the OCaml.org GitHub repository. Unbeknown to their robots.txt generation and reasoning logic, I posted an issue on Github.

The response was:

"Yeah, that file should never have made it to the production server (it was destined to copies of the website used for testing)."

The commit that caused this shows a robots.txt file included in the source. The Makefile that deploys the site removes robots.txt file.

Unfortunately, we see scenarios like this happen regularly.

Automated testing

Automated testing may help prevent common mistakes, but may not capture edge cases. However, there is no silver bullet with the endless possibilities of URIs.

Next best solution

We opted to monitor the robots.txt changes and report on the differences between each snapshot.

Robots diff check

Source: https://merj.com

Is one hour excessive? Sure it is! At the same time, blocking an entire website or essential parts of it unknowingly could cost millions in lost revenue.

It has become an essential part of our toolchain, that runs seamlessly in the background, and hopefully, we will never see an unexpected alert come through. We use a 'request check' to an endpoint that we use on our deployment mechanisms that check .

What else can you monitor?

  • Content Delivery Networks - Google wants you to not block assets (images, cascading stylesheets and JavaScript).
  • Competitors - Know when your opponents are making changes to their website URI architecture.

We don't want any business to fail because of such a trivial mistake, so we are providing free robots.txt checking for up to 20 sub(domains) per user.

TL;DR

  • Every domain and subdomain should have a robots.txt file. It's a common URI that Google crawls first.
  • Always have a robots.txt file per environment and keep them separate.
  • Change Production environments with caution.
  • Automated testing cannot capture edge cases.
  • Use our free robots.txt checker and sleep easy.

Technical SEO Roundup August 2017

We’re bringing back our technical SEO monthly roundup! August provided update...

Monitoring Robots.txt: Committing to Disallow

What is robots.txt? A file webmasters create to instruct web crawlers which U...