Solved: Inconsistencies in Google Search Console robots.txt Tester

By Ryan Siddle

Screaming Frog customers have been noticing inconsistencies between the company’s own robots.txt tester tool and the one in Google Search Console.

This was originally noticed back in 2016 by Giuseppe Pastore. Liam Sharp, a developer behind Screaming Frog SEO Spider, shared his thoughts into the Google Search Console (GSC) Tester Inconsistencies. After a series of experiments, he concluded that the robots.txt tester, in fact, is not reliable.

He believed the reason behind this is because the tester follows a different set of rules to those followed by the Googlebot which accepts the UTF-8, upper and lowercase percentage-escaped forms of URLs in a robots.txt file.

We decided to look into this further, by setting up a disallow rule on our file for the percentage-escaped version of the UTF-8 path /able_✓ as described in the RFC 3986.

    User-Agent: *
    Disallow: /able_%e2%9c%93

To start with, we tried testing the UTF-8 encoding which showed as being “ALLOWED”

Search Console UTF8 Allowed

With that test failing, we needed to percentage-escaped the URI path, so we used Ruby’s IRB interpreter and added the cgi module:

require 'cgi'

The next step was to use the CGI.escape method on the string to get the percentage-coded encoded version:

=> "/able_%e2%9c%93"

Search Console RFC Number Allowed

Unfortunately, there was still no luck at this point, as it was still showing as “ALLOWED”. From my own experience as a developer, I’ve often seen URIs with double percentage-encoding.

The next step was to try to escape the already escaped version (double encoding):

=> "/able_%25e2%259c%2593"

Search Console RFC Number Blocked

Success! It shows as “BLOCKED”.

Google Search Console is trying to escape the percentage-escaped string. This is because Google expects UTF-8 as an input rather than percentage-escaped paths.

To show that this works in reverse, use the CGI.unescape method to unescape the URI path.

=> "/able_%e2%9c%93"

=> "/able_✓"

All together it looks like this:

Search Console CGI Terminal

It’s worth noting that if you decide to batch escape your entire URL, including protocol, then it’ll turn out like this:

=> ""

I’ve created a simple script that will allow you to batch escape or unescape your URLs. You can grab the script from the GIST below:

Always make sure that when testing your URLs, you use UTF-8 and Google will handle the escaping/unescaping.

Dynamic Classifications with Excel: LOOKUP + SEARCH

VLOOKUP is one of the most useful functions in Excel for intermediate and pow...

Technical SEO Roundup November 2017

November has been quite a busy month for SEO. Trust Project, knowledge panels...

Protecting Sensitive Data from Search Engines

With the GDPR fast approaching, we are helping businesses secure their sensit...