Join Merj - We're Hiring

Monitoring Robots.txt: Committing to Disallow

Development News

16 Aug 2017
3 minutes
By Ryan Siddle
Author Avatar

What is robots.txt?

A file webmasters create to instruct web crawlers which URIs they should or should not access. If no instructions (robots.txt file) are found then all URIs should be crawled, but not all web crawlers obey the robots.txt rules.

Learn more about robots.txt here.

Handle with care

Adding three letters to your robots.txt file could wipe your entire website from search engine ranks.

  • User-Agent: *
  • Allow: /
  • User-Agent: *
  • Disallow: /

A Lucky Escape

During August 2016, we worked on a Request For Proposal that required OCaml research. Typing 'ocaml.org package' in the address bar yielded no relevant results.

The top result instead, was the homepage with the dreaded.

"A description for this result is not available because of this site's robots.txt – learn more."

It seemed strange that an Open Source project would want to exclude their entire documentation and site from web crawlers. In OCaml.org’s case, the robots.txt file never used to exist. However, as of 29 July 2016, it suddenly became available, with the Disallow: / exclusion rule.

Source: https://web.archive.org/web/20160729171320/https://ocaml.org/robots.txt

On the 28th July 2016, a 404 error was presented.

Source: https://web.archive.org/web/20160728145602/https://ocaml.org/robots.txt

As a new OCaml user, I decided to look at the OCaml.org GitHub repository. Unbeknown to their robots.txt generation and reasoning logic, I posted an issue on Github.

The response was:

"Yeah, that file should never have made it to the production server (it was destined to copies of the website used for testing)."

The commit that caused this shows a robots.txt file included in the source. The Makefile that deploys the site removes robots.txt file.

Unfortunately, we see scenarios like this happen regularly.

Automated testing

Automated testing may help prevent common mistakes, but may not capture edge cases. However, there is no silver bullet with the endless possibilities of URIs.

Next best solution

We opted to monitor the robots.txt changes and report on the differences between each snapshot.

Source: https://merj.com

Is one hour excessive? Sure it is! At the same time, blocking an entire website or essential parts of it unknowingly could cost millions in lost revenue.

It has become an essential part of our toolchain, that runs seamlessly in the background, and hopefully, we will never see an unexpected alert come through. We use a 'request check' to an endpoint that we use on our deployment mechanisms that check .

What else can you monitor?

  • Content Delivery Networks - Google wants you to not block assets (images, cascading stylesheets and JavaScript).
  • Competitors - Know when your opponents are making changes to their website URI architecture.

We don't want any business to fail because of such a trivial mistake, so we are providing free robots.txt checking for up to 20 sub(domains) per user

TL;DR

  • Every domain and subdomain should have a robots.txt file. It's a common URI that Google crawls first.
  • Always have a robots.txt file per environment and keep them separate.
  • Change Production environments with caution.
  • Automated testing cannot capture edge cases.
  • Use our free robots.txt checker and sleep easy.

Get the Newsletter

Stay in the loop with our monthly Newsletter to get the complete round up of market shifts and how to stay ahead.

Let us help you solve your digital problems

We help leading organisations to optimise their digital presence the right way, by tailoring software to integrate business and digital processes, so the humans can focus on strategy, while the machines do the heavy lifting.

We're committed to your privacy. We use the information you provide to us to contact you about our relevant content, products, and services. You may unsubscribe from these communications at any time.