What is robots.txt?
A file webmasters create to instruct web crawlers which URIs they should or should not access. If no instructions (robots.txt file) are found then all URIs should be crawled, but not all web crawlers obey the robots.txt rules.
Handle with care
Adding three letters to your robots.txt file could wipe your entire website from search engine ranks.
- User-Agent: *
- User-Agent: *
A Lucky Escape
During August 2016, we worked on a Request For Proposal that required OCaml research. Typing 'ocaml.org package' in the address bar yielded no relevant results.
The top result instead, was the homepage with the dreaded.
"A description for this result is not available because of this site's robots.txt – learn more."
It seemed strange that an Open Source project would want to exclude their entire documentation and site from web crawlers. In OCaml.org’s case, the robots.txt file never used to exist. However, as of 29 July 2016, it suddenly became available, with the
Disallow: / exclusion rule.
On the 28th July 2016, a 404 error was presented.OCaml.org GitHub repository. Unbeknown to their robots.txt generation and reasoning logic, I posted an issue on Github.
The response was:
"Yeah, that file should never have made it to the production server (it was destined to copies of the website used for testing)."
The commit that caused this shows a robots.txt file included in the source. The Makefile that deploys the site removes robots.txt file.
Unfortunately, we see scenarios like this happen regularly.
Automated testing may help prevent common mistakes, but may not capture edge cases. However, there is no silver bullet with the endless possibilities of URIs.
Next best solution
We opted to monitor the robots.txt changes and report on the differences between each snapshot.
Is one hour excessive? Sure it is! At the same time, blocking an entire website or essential parts of it unknowingly could cost millions in lost revenue.
It has become an essential part of our toolchain, that runs seamlessly in the background, and hopefully, we will never see an unexpected alert come through. We use a 'request check' to an endpoint that we use on our deployment mechanisms that check .
What else can you monitor?
- Competitors - Know when your opponents are making changes to their website URI architecture.
We don't want any business to fail because of such a trivial mistake, so we are providing free robots.txt checking for up to 20 sub(domains) per user.
- Every domain and subdomain should have a robots.txt file. It's a common URI that Google crawls first.
- Always have a robots.txt file per environment and keep them separate.
- Change Production environments with caution.
- Automated testing cannot capture edge cases.
- Use our free robots.txt checker and sleep easy.