Solved: Inconsistencies in Google Search Console robots.txt Tester

Screaming Frog customers have been noticing inconsistencies between the company’s own robots.txt tester tool and the one in Google Search Console.

This was originally noticed back in 2016 by Giuseppe Pastore. Liam Sharp, a developer behind Screaming Frog SEO Spider, shared his thoughts into the Google Search Console (GSC) Tester Inconsistencies. After a series of experiments, he concluded that the robots.txt tester, in fact, is not reliable.

He believed the reason behind this is because the tester follows a different set of rules to those followed by the Googlebot which accepts the UTF-8, upper and lowercase percentage-escaped forms of URLs in a robots.txt file.

We decided to look into this further, by setting up a disallow rule on our merj.com/robots.txt file for the percentage-escaped version of the UTF-8 path /able_✓ as described in the RFC 3986.

User-Agent: *
 Disallow: /able_%e2%9c%93

To start with, we tried testing the UTF-8 encoding which showed as being “ALLOWED”

Example screenshot of the UTF-8 Encoding being "Allowed"

With that test failing, we needed to percentage-escaped the URI path, so we used Ruby’s IRB interpreter and added the cgi module:

require 'cgi'

The next step was to use the CGI.escape method on the string to get the percentage-coded encoded version:

CGI.escape("/able_✓")
=> "/able_%e2%9c%93"

CGI.escape example of the Robots.txt tester

Unfortunately, there was still no luck at this point, as it was still showing as “ALLOWED”. From my own experience as a developer, I’ve often seen URIs with double percentage-encoding.

The next step was to try to escape the already escaped version (double encoding):

CGI.escape("/able_%e2%9c%93)
=> "/able_%25e2%259c%2593"

Example of a blocked resource with double percentage-encoding.

Success! It shows as “BLOCKED”.

Google Search Console is trying to escape the percentage-escaped string. This is because Google expects UTF-8 as an input rather than percentage-escaped paths.

To show that this works in reverse, use the CGI.unescape method to unescape the URI path.

CGI.unescape("/able_%25e2%259c%2593")
=> "/able_%e2%9c%93"

CGI.unescape("able_%e2%9c%93")
=> "/able_✓"

All together it looks like this:

It’s worth noting that if you decide to batch escape your entire URL, including protocol, then it’ll turn out like this:

CGI.escape("https://merj.com/able_%e2%9c%93")
=> "https%3A%2F%2Fmerj.com%2Fable_%25e2%259c%2593"

I’ve created a simple script that will allow you to batch escape or unescape your URLs. You can grab the script from the GIST below:

#!/usr/bin/env ruby

require 'cgi'

CGI_METHODS = %w[escape unescape]
USAGE = "usage: #{File.basename(__FILE__)} input output type"

input = ARGV[0]
output = ARGV[1]
type = ARGV[2]

if !output || !input ||!File.exist?(input) || !CGI_METHODS.include?(type)
  $stderr.puts USAGE
  exit(1)
end

begin
  File.open(output, 'w') do |file|
    IO.readlines(input).each do |line|
      file.puts(CGI.send(type, line))
    end
  end
end

Always make sure that when testing your URLs, you use UTF-8 and Google will handle the escaping/unescaping.

Solved: Inconsistencies in Google Search Console robots.txt Tester

Useful Links

People

How Can We Help?