Investigating Reddit's robots.txt Cloaking Strategy

Introduction

We recently noticed a post on X by @pandraus regarding Reddit’s robots.txt file. On 25 June 2024, u/traceroo announced that any automated agents accessing Reddit must comply with their terms and policies.

“In the next few weeks, we’ll be updating our robots.txt instructions to be as clear as possible: if you are using an automated agent to access Reddit, you need to abide by our terms and policies, and you need to talk to us. We believe in the open internet, but we do not believe in the misuse of public content.”
u/traceroo

The robots.txt file is crucial for managing web crawler interactions. An instruction to disallow all access, as seen in Reddit’s latest update, can lead to deindexing the entire site, posing significant risks to a site’s search engine presence and overall accessibility. Let’s explore the details and implications of this change.

Analysis of the New Robots.txt

The new robots.txt file is quite…blunt:

# Welcome to Reddit's robots.txt # Reddit believes in an open internet, but not the misuse of public content. # See https://support.reddithelp.com/hc/en-us/articles/26410290525844-Public-Content-Policy Reddit's Public Content Policy for access and use restrictions to Reddit content. # See https://www.reddit.com/r/reddit4researchers/ for details on how Reddit continues to support research and non-# Welcome to Reddit's robots.txt
# Reddit believes in an open internet, but not the misuse of public content.
# See https://support.reddithelp.com/hc/en-us/articles/26410290525844-Public-Content-Policy Reddit's Public Content Policy for access and use restrictions to Reddit content.
# See https://www.reddit.com/r/reddit4researchers/ for details on how Reddit continues to support research and non-commercial use.
# policy: https://support.reddithelp.com/hc/en-us/articles/26410290525844-Public-Content-Policy

User-agent: *
Disallow: /

This directive essentially blocks all crawlers from accessing Reddit. Normally, this would be a critical issue as it could lead to deindexing the entire domain. This situation has precedents, such as when OCaml inadvertently blocked their entire website.

Real-World Implications

Reddit’s move raises several questions. For example, Reddit still wants search engines and archivers to index their content, especially given recent agreements with entities like Google.

“There are folks like the Internet Archive, who we’ve talked to already, who will continue to be allowed to crawl Reddit.”
u/traceroo

This leads us to question whether search engines like Google might make exceptions for Reddit. However, this seems unlikely. It’s more plausible that Reddit might be serving different robots.txt files to different user-agents.

Testing and Findings

To investigate, we conducted a test using Google’s tools. Normally, a user-agent switcher could be used, but Reddit blocks agents pretending to be Google, thanks to search engines providing their IP address ranges.

We used Google’s rich snippet testing tool to retrieve the raw HTML. Our findings confirmed that Reddit is serving an entirely different robots.txt file to Google, which is commonly known as cloaking.

Conclusion

Reddit’s updated robots.txt file appears to block all crawlers, but our tests show this is more of a public stance than a practical restriction. Robots.txt files are generally not meant for consumption by your average user (i.e. not developers, SEO practioners and such), therefore it is not perceived as deceptive for the purpose of “good bot” search engine content discovery.

For those interested, the full robots.txt rules can be found here.

Investigating Reddit’s robots.txt Cloaking Strategy

Introduction

Analysis of the New Robots.txt

Real-World Implications

Testing and Findings

Conclusion

Useful Links

People

How Can We Help?