Screaming Frog customers have been noticing inconsistencies between the company’s own robots.txt tester tool and the one in Google Search Console.
This was originally noticed back in 2016 by Giuseppe Pastore. Liam Sharp, a developer behind Screaming Frog SEO Spider, shared his thoughts into the Google Search Console (GSC) Tester Inconsistencies. After a series of experiments, he concluded that the robots.txt tester, in fact, is not reliable.
He believed the reason behind this is because the tester follows a different set of rules to those followed by the Googlebot which accepts the UTF-8, upper and lowercase percentage-escaped forms of URLs in a robots.txt file.
User-Agent: * Disallow: /able_%e2%9c%93
To start with, we tried testing the UTF-8 encoding which showed as being “ALLOWED”
With that test failing, we needed to percentage-escaped the URI path, so we used Ruby’s IRB interpreter and added the
The next step was to use the
CGI.escape method on the string to get the percentage-coded encoded version:
CGI.escape("/able_✓") => "/able_%e2%9c%93"
Unfortunately, there was still no luck at this point, as it was still showing as “ALLOWED”. From my own experience as a developer, I’ve often seen URIs with double percentage-encoding.
The next step was to try to escape the already escaped version (double encoding):
CGI.escape("/able_%e2%9c%93) => "/able_%25e2%259c%2593"
Success! It shows as “BLOCKED”.
Google Search Console is trying to escape the percentage-escaped string. This is because Google expects UTF-8 as an input rather than percentage-escaped paths.
To show that this works in reverse, use the
CGI.unescape method to unescape the URI path.
CGI.unescape("/able_%25e2%259c%2593") => "/able_%e2%9c%93" CGI.unescape("able_%e2%9c%93") => "/able_✓"
All together it looks like this:
It’s worth noting that if you decide to batch escape your entire URL, including protocol, then it’ll turn out like this:
CGI.escape("https://merj.com/able_%e2%9c%93") => "https%3A%2F%2Fmerj.com%2Fable_%25e2%259c%2593"
I’ve created a simple script that will allow you to batch escape or unescape your URLs. You can grab the script from the GIST below:
Always make sure that when testing your URLs, you use UTF-8 and Google will handle the escaping/unescaping.