Introduction

AdSense is Google’s advertising content platform where publishers can be paid to place advertisements on their webpages. While performing a search engine optimisation test that involved crawling and rendering tracking for a React application, we discovered an anomaly in the client’s server log data,  which required further investigation.

We found that a part of the AdSense technology stack is not working as expected and the programmatic matching between ads and website content is impacted, thus creating an incomplete understanding of webpage content.

Disclosure Ethics and Communication

We uphold a strong ethical framework when it comes to disclosing discovered anomalies or bugs, especially those associated with pivotal platforms like Google AdSense. We strictly follow responsible disclosure guidelines, informing relevant parties well in advance of any public announcements.

This issue does not fall under the Google Bug Hunter Program, as it pertains primarily to a product operational anomaly rather than a security vulnerability. Despite not reporting this through the Google Bug Bounty Program, we reported this bug to Google representatives who liaise with the internal Google teams and observed a standard 90-day grace period before considering public disclosure.

We have extended these standard disclosure timelines to facilitate a resolution, although the issue remains unresolved.

Bug Disclosure Timeline

DateSubjectAction
June 1st, 2023MerjDiscovered the bug
June 8th, 2023MerjWe sent an email to Gary Illyes, a member of the Google Search Team, describing the bug and its impact.
June 27th, 2023GoogleGary Illyes responded, stating that he had consulted with the rendering team and would notify the administrators responsible for the Mediapartners-Google crawlers. As the owners of the Mediapartners-Google crawlers are not part of the Search team, Search team members have no influence over them.
September 15th, 2023MerjWe sent a follow-up email asking for any updates regarding the bug.
October 24th, 2023Merj & GoogleWe had an in-person conversation with Gary Illyes about the bug at the Google Search Central Live Zurich event.
April 23rd, 2024MerjPublic disclosure of the issue

Google Adsense and Google Ads

Google AdSense is an advertising program run by Google. It allows website owners (publishers) to monetise their content by displaying targeted advertisements. These ads are generated by Google and can be customised to match the website’s content. Publishers earn revenue when visitors click on or view these advertisements.

Google AdSense offers multiple formats of ads

Source: https://adsense.google.com/start/resources/best-format-your-site-for-adsense/

Google AdSense offers publishers a variety of ad units to display on their websites. Here are some of the common types of ad units:

Google Ads is a platform that enables businesses (advertisers) to create and manage online advertisements, targeting specific audiences based on keywords, demographics, and interests. These advertisements can appear on various Google services, such as search results, YouTube, and partner websites.

Advertisers that use Google Ads can place their ads on websites that participate in the AdSense program. This symbiotic relationship enables businesses to reach a wider audience through targeted advertising, while website owners can generate revenue by hosting relevant advertisements on their platforms.

Google Adsense targeting

Google Adsense works by matching ads to your site based on your content and visitors; having webpages with incorrect, partially rendered, or blank content impacts the matching of ads and webpages. To analyse webpage content, Google AdSense employs a specific User-Agent known as ‘Mediapartners-Google’.

Google AdSense employs various methods for delivering targeted ads. Contextual Targeting uses factors such as keyword analysis, word frequency, font size, and the overall link structure of the web to ascertain a webpage’s content accurately and match ads accordingly.

However, without access to a page’s full content, any targeting based on page content cannot be accurate.

Impact of the Google AdSense Rendering Bug

Impact on Google Adsense

Websites that block AdSense infrastructure from accessing their content can considerably affect the precision of ad targeting, potentially resulting in diminished clicks and, consequently, reduced revenue. This has an impact on both publishers and advertisers.
Misunderstanding the content on the page could result in more severe consequences. If the publisher sends irrelevant traffic to advertisers, the Adsense Platform may limit or disable ad serving.

Impact on Server Access Logs Analysis

The misattribution of User-Agents in server access logs can lead to incorrect assumptions about search engines’ crawling and rendering of webpages.

Additionally, it can result in inaccurate conclusions about the sources of crawling traffic and the effectiveness of strategies or updates made on the website, potentially leading to misguided decision-making.

Technical Analysis of the Bug

The TL;DR

  • The use of Mediapartners-Google and Googlebot for different parts of the AdSense algorithm process creates a conflict of rules which are not immediately obvious.
  • The initial request to download the webpage’s HTML source code uses the “Mediapartners-Google” User-Agent.
  • The Google Web Rendering Service (WRS) then processes and renders the page to generate the final HTML. During this phase, supplementary rendering resources are requested using the “Googlebot” User-Agent. If a necessary resource cannot be downloaded because a robots.txt rule is blocking the “Googlebot” User-Agent, the webpage may be partially rendered or completely blank.
  • Not being able to get the correct content of the webpage can affect AdSense content understanding and ad targeting, consequently affecting publisher revenues.

The Details

Robots.txt effect on crawling & Rendering

Every time a web browser requests a website, it sends an HTTP Header called the “User-Agent”. The User-Agent value contains information about the web browser name, operating system, and device type. The User-Agent is present in both webpages and page resource requests.

Search Engine crawlers use their own custom User-Agent, when fetching webpages and page resources. Before starting to download a specific URL, Search Engines check if they are allowed to fetch a specific URL by parsing the robots.txt.

Without debating on “if” and “how” the use of the robots.txt to block crawlers is suitable, here below is a simplified step-by-step pipeline of robots.txt effect on a search engine’s crawling and rendering process:

  • Step 1: Checking robots.txt before fetching the webpage
  • Step 2: Fetching the webpage
  • Step 3: Parsing the HTML to get the webpage resource
  • Step 4: Checking robots.txt for each webpage resource
  • Step 5: Downloading webpage resources
  • Step 6: Start rendering the webpage
  • Step 7: Checking robots.txt for additional page resources needed to complete the rendering
  • Step 8: Downloading additional webpage resources
  • Step 9: Complete the webpage rendering
Before each fetch the crawler has to check if the resource can be downloaded or not, respecting robots.txt rules

If Step 1 fails:

  • the crawler is not allowed to download the webpage HTML source code.
  • subsequent steps are ignored.

If Step 1 is completed but one of the other steps fails:

  • the rendering of the webpage may not be correct due to missing resources.
  • the final webpage’s rendered HTML may be missing some information or be completely blank.

Our Investigative Process

With ongoing efforts to bring our Search Engine Web Rendering Monitoring solution into production, we have been closely monitoring the number of webpages being crawled and the time delta within which those webpages are rendered. Working with server logs that contain terabytes of data, we utilise a custom in-house enrichment and query engine (similar to Splunk) that enables us to drill into the data with complex logic.

Validating the Data Source

The server access logs have started showing anomalies over a 6-week period, with fetches of page resources where the referrer points to webpages that are normally blocked for Googlebot. First, we needed to check the data pipelines and data integrity. This involved reviewing any code changes and container failures that may have created some unexpected edge cases both at our source and further upstream. We are often second consumers of server logs because of Personal Identifiable Information (PII) and Payment Card Industry Data Security Standard (PCI-DSS) requirements. Examples of transformations include:

  • Redacting sensitive URLs such as logged-in areas.
  • IP address restriction. Often the IP address is redacted, so Google crawler verification needs to be done further upstream by IP range checks.
  • Scrubbing emails, names, and addresses.

The Traffic Engineering and Edge teams managing the upstream ingress point (for instance, a CDN like Cloudflare, Akamai, or Fastly) confirmed that no changes had been made. We reprocessed our data, which yielded the same anomalies.

Reproduction and Isolation of the Issue

Once the data source has been validated, the next step is to reproduce and isolate the anomaly to confirm its existence and understand its behaviour. Here’s how to replicate the issue:

  1. Identify Target Webpages: Start by identifying webpages that are accessible to the “Mediapartners-Google” User-Agent, but blocked for the “Googlebot” User-Agent. This can be determined by looking for “Disallow” directives in the website’s robots.txt file.
# Googlebot
user-agent: Googlebot
disallow: /reviews

# Mediapartners-Google
user-agent: Mediapartners-Google
allow: /
  1. Utilise the Referer HTTP Request Header: Tracing the webpage resources through the Referer HTTP header reveals the webpage from which a particular resource has been requested.
APACHE
66.249.64.4 - - [28/Jul/2023:04:17:10 +0000] 808840 "POST /graphql-enpoint HTTP/1.1" 200 56 "https://domain.com/reviews/139593" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/114.0.1.2 Safari/537.36"

NGINX
66.249.64.4 - - [28/Jul/2023:04:17:10 +0000] "POST /graphql-enpoint HTTP/1.1" 200 56 "https://domain.com/reviews/139593" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/114.0.1.2 Safari/537.36"
  1. Use a Robots.txt Parser: Use a reliable robots.txt parser to verify the accessibility of the webpage origin address for different User-Agents. We recommend using the official Google open-source C++ version available on GitHub. If using another parser, refer to the Google official documentation and the specification to check for accurate parsing.
  2. Verify User-Agent Attribution: By combining the Referer HTTP request header and the robots.txt parser, check whether the resource requests during rendering are correctly attributed to the “Googlebot” User-Agent or if they originate from a different User-Agent, specifically “Mediapartners-Google”.

Note: For webpages accessible by both “Mediapartners-Google” and “Googlebot,” the above Server Access Logs approach to detect incorrect User-Agent attribution may not be effective. In such specific cases, more advanced solutions, such as our Search Engine Rendering Monitoring tool, are required.

Impact Analysis

Number of Websites using Google AdSense potentially impacted 

To assess the potential implications of the issue on actual websites, we acquired the list of US and UK websites utilising Google AdSense from BuiltWith.com and developed a tool to identify the possible impact of the issue on these websites.

The robots.txt files of most websites we analysed are small and contain rules only for the global User-Agent (*), which the AdSense crawler ignores. As the AdSense crawler only respects rules set specifically for Mediapartners-Google, this significantly impacts the number of websites that may be affected.

We did this by using the following simplified logic that approximates the potential magnitude of the websites that may be impacted:

Flowchart to define if a domain is impacted or not.

Upon executing the tool on the BuiltWith list, which covers around 7 Million US websites and 2 Million UK websites, we determined that around 5.5 Million websites may potentially be impacted by this issue.

UK websites

StatusNumber
Websites from the BuiltWith list1,946,633
Testable websites974,536
Potentially impacted websites 938,413
Non-impacted websites36,123

US websites

StatusNumber
Websites from the BuiltWith list6,827,954
Testable websites4,540,894
Potentially impacted websites 4,363,028
Non-impacted websites177,866

On analysis of the robots.txt files, we can see that most of the body bytes are relatively small.

UK websites robots.txt bytes (compressed)

PercentileBytes
25th137
50th (Median)137
75th137
95th213
99th748

US websites robots.txt bytes (compressed)

PercentileBytes
25th137
50th (Median)137
75th137
95th575
99th1,145

It is difficult, within the scope of this article, to provide an exact prediction of the number of websites currently impacted. While 5.5 million websites may be affected by the issue, they would only experience a negative impact if they exhibit certain specific characteristics, such as serving primary content via JavaScript and blocking a portion of requests using specific robots.txt rules.

Our analysis provides a broad overview of potential impacts without hands-on verification. To identify if a site is affected, a more complex assessment would be necessary, involving the comparison of a site’s initial and rendered HTML. This requires a level of testing that goes beyond our current scope, emulating search engine behaviours to extract and analyse a page’s primary content.

The web is inherently broken, and simple methods, like checking the <main> HTML tag, fall short due to the web’s inconsistency and the varying adherence to best practices among servers and websites. Other approaches, such as comparing initial and rendered HTML sizes or word count differences, are imprecise and unreliable, potentially leading to the publication of incorrect data.

Given the complexity of automating the test, we have opted to describe a straightforward method for self-diagnosing the issue in the FAQ section. This approach allows users to assess their websites independently.

Google AdSense impact 

The ideal test to assess the impact on Google AdSense in this scenario would be to quantify the number of websites affected by the issues that display inappropriate ads, yet this is unfeasible.

Google AdSense utilises a variety of ad-matching techniques that go well beyond contextual targeting. This comprehensive approach offers a broad spectrum of ad targeting possibilities, ranging from matches based on content to ads chosen by advertisers for specific placements and those tailored to user interests.

While publishers can customise the types of ad categories permitted on their site, they have limited influence over the exact ads that are shown. Moreover, the presence of ads that seem to not align with the site content could be attributed to advertisers who have set overly broad or generic targeting criteria rather than an issue with the ad targeting system itself.

Due to this complexity, it’s not possible to determine whether a website is displaying incorrect ads based solely on the issue we discovered.

As an alternative method to estimate whether websites affected by the issue might see an impact on revenue, publishers can use the revenue calculator to get an idea of how much they should earn with AdSense.

The Adsense revenue calculator quantifies your potential annual revenue

In the calculator, you can select region, category, and monthly page views to get an estimate. The calculator itself emphasises that the estimate should only be used as a reference and that numbers may vary, but it could be useful to have an idea of the missing revenue if the numbers differ significantly from what publishers can see in the AdSense dashboard.

Google Ads impact

Google Ads is not directly affected by the issue. We have examined the Google Ads crawler’s requests, and for the tested websites, it is sending the correct User Agent for all fetches. Nonetheless, Advertisers may observe an impact of this issue on the quality of traffic, click-through rate (CTR), and indirectly on revenue.

Server Access Logs impact

Access logs are not commonly utilised by publishers or advertisers, yet these logs might be utilised by others for analysis or to establish a business case for technical modifications.

Using the methodology described in the ‘Reproduction and Isolation of the Issue’ section, we examined the access logs of multiple websites for different clients. Our findings revealed that, depending on the scale of the website, the percentage of misattributed ‘Mediapartners-Google’ fetches using the ‘Googlebot’ User-Agent can range from 20% to 70% of the total ‘Googlebot’ requests.

This substantial discrepancy in the access logs analysis can significantly distort any analysis.

Solutions and Recommendations

Best Practices for AdSense

While Google has confirmed it is a bug, they have not yet fixed it. Businesses can work around the issue by ensuring essential assets that are used to render a webpage, such as API endpoints, scripts, stylesheets and images, are not blocked by robots.txt for both “Mediapartners-Google” and “Googlebot” User-Agents.

Best Practices for Server Access Logs Analysis

To effectively understand the impact of issues within server access logs, it is crucial to employ a systematic approach to log analysis. The method outlined in the “Reproduction and Isolation of the Issue” section provides a simple way to filter the access logs removing the pages that Googlebot can’t crawl. It’s worth remembering that this approach would offer only a partial view of the problem, excluding only those pages blocked by Googlebot and not for Mediapartners-Google.

It is recommended that you use more advanced filtering techniques to fully understand the issue’s impact. For a detailed and comprehensive analysis of your server access logs, we encourage you to get in touch with us.

FAQ

What is the Google AdSense rendering bug?

The Google AdSense rendering bug is a technical issue in which ads served by Google AdSense might not display correctly on publishers’ websites.

This problem presents itself due to discrepancies in how pages are rendered when different rules are applied to Googlebot and the AdSense bot (“Mediapartners-Google”). If these bots are treated differently by your site’s robots.txt, it can lead to improper ad display.

What steps can I take to diagnose the AdSense rendering issue on my site easily?

To diagnose the issue, review your robots.txt checking for any directives that might block “Googlebot” from accessing certain URL paths on your site that are not similarly restricted for the AdSense bot (“Mediapartners-Google”).

If your website is using Client Side Rendering and/or the main content of the webpages is generated dynamically at rendering time using additional JavaScript requests, it’s crucial to ensure that both “Googlebot” and “Mediapartners-Google” have equal access to these JavaScript resources and the resultant content paths.

Discrepancies in access permissions between these bots can lead to issues and prevent proper rendering.

Are there any quick fixes or workarounds for the rendering bug?

A quick fix to address the rendering bug involves aligning the access rules for both “Googlebot” and the Google AdSense bot (“Mediapartners-Google”) in your robots.txt file.

Ensuring both bots have the same level of access to your site’s content can mitigate rendering issues. This approach helps ensure that even if requests are misattributed in server access logs, page rendering works as expected.

Are my Server Access Logs affected?

Server Access Logs play a crucial role in diagnosing and understanding how web crawlers and bots interact with your website. These logs contain detailed records of every request made to your server, including those by Googlebot and the AdSense bot (“Mediapartners-Google”).

Even if your website is not affected by the rendering bug, the logs may contain misattributed requests. The consequence of this misattribution would be an inaccurate count of Googlebot requests, you would see more requests than there actually are. In your analysis, the number of Googlebot requests would be the sum of actual Googlebot requests plus the misattributed Google AdSense requests that use Googlebot as the User-Agent.

Can I use IP ranges to filter the Server Access Logs?

Google’s documentation details the IP ranges for verifying Googlebot and other Google crawlers, organising these ranges into multiple files.

This categorisation seemingly simplifies filtering processes for our use case: Googlebot IPs are classified as “Common Crawlers”, while Google AdSense IPs are deemed “Special Case Crawlers”. Initially, one might expect to filter Googlebot requests using the googlebot.json IP ranges and exclude those listed in special-crawler.json.

However, the situation is more complex. The misattributed requests actually originate from genuine Googlebot IP addresses. It appears that the Google AdSense bot uses Googlebot’s infrastructure to crawl resources rather than just misattributing the User-Agent string.

How can I fix the Server Access Logs for my analysis?

The most straightforward approach to verifying and filtering Server Access Logs is examining the request referrer URLs. Specifically, for requests identified with a Googlebot User-Agent, the presence of a referrer page that is blocked to Googlebot but accessible to the Google AdSense bot (‘Mediapartners-Google’) could indicate incorrect attribution.

This technique, however, is limited in its applicability. It does not yield reliable insights for paths that are accessible to both Googlebot and the Google AdSense crawlers, as these scenarios do not facilitate clear differentiation based on robots.txt rules. To have a comprehensive filtering method, more advanced solutions, such as our Search Engine Rendering Monitoring tool, are required.

We would like to thank Aleyda Solis (LinkedIn, X/Twitter), Barry Adams (LinkedIn, X/Twitter), and Jes Scholz (LinkedIn, X/Twitter) for their thorough peer review of this article. Their experience and insightful suggestions have enhanced the depth and clarity of our analysis, allowing us to highlight key aspects and decisions made during the writing process for a more coherent and impactful delivery.