Join Merj - We're Hiring

Solved: Inconsistencies in Google Search Console robots.txt Tester

Development News

05 Oct 2017
3 minutes
By Ryan Siddle
Author Avatar

Screaming Frog customers have been noticing inconsistencies between the company’s own robots.txt tester tool and the one in Google Search Console.

This was originally noticed back in 2016 by Giuseppe Pastore. Liam Sharp, a developer behind Screaming Frog SEO Spider, shared his thoughts into the Google Search Console (GSC) Tester Inconsistencies. After a series of experiments, he concluded that the robots.txt tester, in fact, is not reliable.

He believed the reason behind this is because the tester follows a different set of rules to those followed by the Googlebot which accepts the UTF-8, upper and lowercase percentage-escaped forms of URLs in a robots.txt file.

We decided to look into this further, by setting up a disallow rule on our merj.com/robots.txt file for the percentage-escaped version of the UTF-8 path /able_✓ as described in the RFC 3986.

 User-Agent: *
 Disallow: /able_%e2%9c%93

To start with, we tried testing the UTF-8 encoding which showed as being “ALLOWED”

Search Console UTF8 Allowed

With that test failing, we needed to percentage-escaped the URI path, so we used Ruby’s IRB interpreter and added the cgi module:

require 'cgi'

The next step was to use the CGI.escape method on the string to get the percentage-coded encoded version:

CGI.escape("/able_✓")
=> "/able_%e2%9c%93"
Search Console RFC Number Allowed

Unfortunately, there was still no luck at this point, as it was still showing as “ALLOWED”. From my own experience as a developer, I’ve often seen URIs with double percentage-encoding.

The next step was to try to escape the already escaped version (double encoding):

CGI.escape("/able_%e2%9c%93)
=> "/able_%25e2%259c%2593"
Search Console RFC Number Blocked

Success! It shows as “BLOCKED”.

Google Search Console is trying to escape the percentage-escaped string. This is because Google expects UTF-8 as an input rather than percentage-escaped paths.

To show that this works in reverse, use the CGI.unescape method to unescape the URI path.

CGI.unescape("/able_%25e2%259c%2593")
=> "/able_%e2%9c%93"

CGI.unescape("able_%e2%9c%93")
=> "/able_✓"

All together it looks like this:

Search Console CGI Terminal

It’s worth noting that if you decide to batch escape your entire URL, including protocol, then it’ll turn out like this:

CGI.escape("https://merj.com/able_%e2%9c%93")
=> "https%3A%2F%2Fmerj.com%2Fable_%25e2%259c%2593"

I’ve created a simple script that will allow you to batch escape or unescape your URLs. You can grab the script from the GIST below:

Always make sure that when testing your URLs, you use UTF-8 and Google will handle the escaping/unescaping.

Get the Newsletter

Stay in the loop with our monthly Newsletter to get the complete round up of market shifts and how to stay ahead.

Let us help you solve your digital problems

We help leading organisations to optimise their digital presence the right way, by tailoring software to integrate business and digital processes, so the humans can focus on strategy, while the machines do the heavy lifting.

We're committed to your privacy. We use the information you provide to us to contact you about our relevant content, products, and services. You may unsubscribe from these communications at any time.