Managing Webpages Resources for Efficient Crawling and Rendering

Introduction

Search engines have a large, but finite amount of resources. Some websites can be hundreds of millions of webpages in size, particularly ecommerce platforms. It is our job to make sure search engines can discover new webpages as fast as possible, as well as whenever a webpage has a significant modification. Part of this challenge is understanding what search engines should not discover or download. We have covered in the past ways how to Protect Sensitive Information from Search Engines for security reasons, but we have not yet explored webpage resources to optimise compute power. In particular, focusing on resources that do not alter the webpage content or structure.

As part of a content time-to-discover initiative focused on enhancing crawling and rendering efficiency for one of our clients, we identified a substantial volume of search engine requests directed at internal analytics and user-tracking endpoints.

The primary objective was to enhance efficiency by reallocating compute resources and time for search engines to crawl and discover new HTML webpages, eliminating unnecessary bot requests, particularly those associated with internal user analytics. Ultimately to get greater visibility on revenue generating pages.

After a discovery, prototype, and monitoring phase, we determined that the project has delivered notable and beneficial results. A phased rollout across the remaining domains is underway and is planned for completion by year-end with an expected reduction of 30 billion resource requests across all client market domains leading to a 22% increase in additional HTML webpage fetches.

Noteworthy technical findings:

Rendering time improvement: Using an average network request time of 200 milliseconds and an approximate daily fetch frequency of 80 million for those endpoints, an estimated daily saving of 4,440 hours of network wait time across all concurrently active machines used by search engines to render pages and send requests is projected.
Robots.txt faster alternative: Further research on rendering optimisation unveiled that, in certain scenarios, using a JavaScript function to block non-cacheable resource requests proves to be more customisable and 100-200 times faster compared to the robots.txt rules.

How Search Engines render webpages at scale

Google introduced JavaScript rendering in 2014. Crawling and rendering at scale is a complex task that requires a substantial amount of computation and time. Recently, Google stated that optimisation to Core Web Vitals metrics had saved over 10,000 years of compute time. This is both for users and for Google’s servers. The following information is based on official documentation, supplemented by insights gathered from our research.

Googlebot takes a URL from the crawl queue, crawls it, then passes it into the processing stage. The processing stage extracts links that go back on the crawl queue and queues the page for rendering. The page goes from the render queue to the renderer which passes the rendered HTML back to processing which indexes the content and extracts links to put them into the crawl queue.

Source: Understand the JavaScript SEO basics

Google’s Web Rendering Service (WRS) is a key element within Google’s infrastructure dedicated to rendering web pages, mainly for indexing purposes. It operates similarly to a web browser, utilising automated browsers to process HTML, CSS, and JavaScript of webpages. This enables Google to accurately access and index content, including dynamically generated content through JavaScript. This is crucial because traditional web crawlers might struggle to interpret dynamically generated content without proper rendering capabilities.

During the rendering process, the automated browser closely mimics the requests made by a standard web browser. However, it handles GET and POST requests differently.

HTTP GET requests: To optimise rendering, these requests might undergo caching with a Time-To-Live (TTL) determined by internal heuristics, rather than following the HTTP Cache-Control header.
HTTP POST requests: Unlike GET requests, POST requests can’t be cached and are dispatched with each rendering operation.

Hence, while the WRS handles content rendering triggered by both GET and POST requests, POST requests are notably less efficient due to their inability to be cached. Understanding this distinction is crucial in identifying the starting point for optimisation efforts within the rendering process.

Server Access Logs Analysis

To obtain a comprehensive understanding of the resource requests made during rendering, we relied on the server access logs as our primary source of information. This allowed us to gather thorough data about the requests sent while web pages were being rendered by a search engine.

Referer HTTP Request Header

Server access logs are full of insights for dev ops, infosec, and SEO teams. We filter the content access logs through a validation service to ensure the traffic is from Google. In constructing a webpage resource map, our focus turned towards an often overlooked HTTP header—the Referer HTTP Request Header.

This header indicates the webpage that initiated a resource request. When a browser accesses a webpage, requests for items like CSS, JavaScript, or APIs include a reference (the Referer HTTP Request Header) to the page of origin.

Significantly, major search engines also utilise the Referer HTTP Request Header for resources during webpage rendering, providing us with valuable insights through this information.

Here is an example of Googlebot entries using the default Nginx Access log format:

'$remote_addr - $remote_user [$time_local] '"$request" $status $bytes_sent' '"$http_referer" "$http_user_agent"

66.249.64.4 - - [28/Jul/2023:04:10:34 +0000] "GET /product-xyz.html HTTP/1.1" 200 201 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/114.0.1.2 Safari/537.36"

66.249.64.4 - - [28/Jul/2023:04:12:10 +0000] "POST /analytics HTTP/1.1" 200 56 "https://domain.com/product-xyz.html" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/114.0.1.2 Safari/537.36"

66.249.64.4 - - [28/Jul/2023:04:15:10 +0000] "GET /style.css HTTP/1.1" 200 104 "https://domain.com/product-xyz.html" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/114.0.1.2 Safari/537.36"

66.249.64.4 - - [28/Jul/2023:04:15:12 +0000] "GET /main.js HTTP/1.1" 200 312 "https://domain.com/product-xyz.html" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/114.0.1.2 Safari/537.36"

66.249.64.4 - - [28/Jul/2023:04:10:34 +0000] "GET /category-abc.html HTTP/1.1" 200 201 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/114.0.1.2 Safari/537.36"

66.249.64.4 - - [28/Jul/2023:04:18:35 +0000] "POST /analytics HTTP/1.1" 200 58 "https://domain.com/category-abc.html" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/114.0.1.2 Safari/537.36"

Breaking down the first log entry:

Value	NGINX variable name	Description
66.249.64.4	$remote_addr	The IP address of the client making the request.
– –	$remote_user	Placeholders for the remote user identity and user authentication, which are not filled in for this request.
[28/Jul/2023:04:10:34 +0000]	$time_local	Timestamp indicating the date and time of the request in UTC.
“GET /product-xyz.html HTTP/1.1”	$request	The request method was a GET for the URL path “/product-xyz.html” using protocol HTTP/1.1.
200	$status	The status code returned by the server, indicating a successful request (status code 200).
201	$bytes_sent	The size of the response sent to the client in bytes (201 bytes in this case).
–	$http_referer	Indicates there’s no Referrer information provided in this request.
“Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/114.0.1.2 Safari/537.36”	$http_user_agent	The User-Agent header, specifying that Googlebot made this request.

In summary, this log entry shows that a request was made by Googlebot for the “/product-xyz.html” page using the GET method. The request received a successful response (status code 200) with a response size of 201 bytes.

The second log entry reveals that a request was made by Googlebot to the “/analytics” endpoint, coming from a page named “product-xyz.html” on the domain “domain.com”. This shows how the Referer HTTP Request Header helps connect these log entries, where the first log entry is an HTML webpage request, and the second log entry involves a resource related to the previously requested webpage.

Webpages Resource Map

We can map resources by linking webpages to subresources. We note requests for https://domain.com/product-xyz.html:

GET requests: /style.css, /main.js
POST requests: /analytics

Instead, for the webpage https://domain.com/category-abc.html we have:

POST requests: /analytics

To create the resource map, assume subsequent resource requests after a webpage request are linked to that webpage. If the page is requested again, the following resources are associated with the second occurrence of the webpage.

Considering the following requests:

GET /page-1.html
GET /style.css (resource of /page-1.html)
GET /main.js (resource of /page-1.html)
POST /analytics (resource of /page-1.html)
POST /graphql (resource of /page-1.html)
POST /analytics (resource of /page-1.html)
GET /page-2.html
POST /analytics (resource of /page-2.html)
POST /graphql (resource of /page-2.html)
GET /page-1.html
POST /analytics (resource of /page-1.html)

From the provided requests, the initial occurrence of /page-1.html connects to /style.css, /main.js, /analytics (x2), and /graphql resources. The subsequent /page-1.html request links solely to the /analytics resource.

A script can be crafted to parse Access Logs and compute the combined average count of resource requests for each webpage. The final output would look like this:

URL: "/page-1.html"
  - Resources:
    - URL: /style.css, 	Method: "GET",	Hits: 0.5
    - URL: /main.js, 	Method: "POST",	Hits: 0.5
    - URL: /analytics, 	Method: "POST",	Hits: 1
    - URL: /graphql, 	Method: "POST",	Hits: 1.5

URL: "/page-2.html"
  - Resources:
    - URL: /analytics, 	Method: "POST",	Hits: 1
    - URL: /graphql, 	Method: "POST",	Hits: 1

Moreover, it’s possible to aggregate by resource, providing a list of the most frequently requested resources:

Resource: "/analytics"
  - Method: POST
  - Hits: 4
  - Pages:
    - URL: /page-1.html
    - URL: /page-2.html

Resource: "/graphql"
  - Method: POST
  - Hits: 2
  - Pages:
    - URL: /page-1.html
    - URL: /page-2.html

Resource: "/style.css"
  - Method: GET
  - Hits: 1
  - Pages:
    - URL: /page-1.html

Resource: "/main.js"
  - Method: GET
  - Hits: 1
  - Pages:
    - URL: /page-1.html

To enhance the logic, consider normalising resource request URLs that point to the same URL paths but include appended query parameters (e.g., /analytics?track=1). Additionally, broaden the understanding of the website by categorising webpages into sections and page types for better aggregation.

Robots.txt approach

After mapping the resources of webpages, the focus shifted to analysing fetched resources during each rendering. Specifically, attention was given to recurring POST requests, pinpointing endpoints exclusively used for user analytics and deemed unnecessary for search engines. To restrict access to these URLs, the decision was made to employ robots.txt.

It’s crucial to highlight that certain web crawlers, like Google Ads or Google AdSense, ignore the universal user agent (*) outlined in robots.txt. To effectively restrict access to the designated URLs for these crawlers, explicit rules must be defined.

Uncovering limits and Inefficiencies of the robots.txt approach

In our project retrospective, while reviewing successes, areas for improvement, and ideas applicable to future client work, two potential areas for enhancement emerged:

Accuracy of Resource Mapping: Acknowledging the necessity for more precise webpage resource mapping based on server access logs.
Robots.txt Limitations: Noting limitations or challenges associated with using robots.txt to restrict access, prompting exploration of potential alternative approaches.

Accuracy of a Webpage Resource Map

In earlier sections, we covered how to construct a webpage resource map and its application in projects. While the server access logs approach yields promising results, it also has limitations:

Limited Visibility: Access logs for third-party services will most likely not be possible unless the third-party service is owned by the company.
Multiple Renderings: A webpage might render multiple times without direct individual requests each time.
Caching of GET Requests: GET requests can be cached, affecting the visibility of resource accesses.

To address these constraints of observability, we developed a plugin for our Web Rendering Monitoring tool, enabling a comprehensive resource map of items requested during search engine rendering. For further information about our Google Web Rendering Monitoring tool, schedule a call.

Robots.txt limitations

In this section, we explore the limitations of rule selection within robots.txt directives. Additionally, we delve into the fascinating task of estimating the latency involved in assessing URLs for potential robots.txt blocking during page crawling by search engines.

Rule selection limitation

Robots.txt rules exhibit a certain level of rigidity. This lack of flexibility means that while it is possible to specify which URLs to block or allow, it is not possible to selectively apply these rules with nuanced control. As a result, a URL might be either blocked or allowed without the capacity to create exceptions or conditionally permit access in specific cases.

Consider scenarios where a single endpoint serves multiple functions, like providing analytics data alongside A/B testing experiments. With robots.txt, there is a limit to either allowing or disallowing access to the entire endpoint. For instance, while there may be a need to grant access exclusively for the A/B testing experiments, blocking the analytics component becomes challenging due to the inability to make detailed distinctions.

Latency estimation

During the crawling or rendering of pages, search engines undergo a process to determine if a URL is fetchable, involving internal calls and the evaluation of robots.txt rules (which might be cached).

In the theoretical rendering infrastructure of a search engine like Google’s WRS, the process occurs within a distributed system consisting of specialised microservices, each assigned specific tasks. It’s important to note that this description is hypothetical, relying on external observations and assumptions, as the exact structure of Google’s WRS is undisclosed and proprietary.

The sequence of events during page rendering may follow this outline:

Browser Rendering: The initiation of the page rendering process by the browser instance.
Service Wrapper: An internal mediator service intercepts browser-initiated network requests.
Robots.txt Check: The service wrapper forwards the request to the Robots.txt service, examining compliance with directives in the website’s robots.txt file.
Robots.txt Disallow: In case of a request being blocked based on robots.txt directives, the service wrapper communicates an error response, denying access to the resource, back to the browser instance.
Cache Service Check: If instead, the request aligns with robots.txt directives, the service wrapper checks the resource’s availability and eligibility for caching through a dedicated cache service. If ineligible, the service wrapper proceeds accordingly.
Crawler Resource Queue: Resources lacking a cached version or requiring re-crawling are queued within a crawler resource queue by the service wrapper for further processing.
Response: The crawler retrieves the resource, and the data is subsequently dispatched back to the browser instance.

Step-by-step process flowchart depicting Google WRS browser instance rendering, service wrapper interception of network requests, robots.txt compliance check, handling of disallowed requests, cache service evaluation, resource queueing, crawler retrieval, and dispatching of data back to the browser instance.

If we accept the presented logic as reasonably feasible, a hypothesis emerges: to verify if a request is allowed or not, adhering to the robots.txt directives, a minimum of two internal requests within the same data centre might be necessary.

The process could involve:

First Internal Request: The browser or rendering instance triggers a request for a particular resource, forwarding it to the internal service wrapper.
Second Internal Request: Another internal service, potentially a specialised microservice specifically handling robots.txt, is then engaged to examine the robots.txt directives in relation to the URL of the requested resource.

The round trip latency (RTT) within the same data centre, indicating the time for a request and response to traverse the wires, was previously estimated at around 500 µs based on the Latency Comparison Numbers. However, recent research [1] [2] suggests a revised estimate of 50-100 µs per round trip, indicating a potentially faster network response time.

We can omit considering browser instance request handling and potential robots.txt service impact within nanosecond scales in this calculation. Rounding off and accounting for two data centre round trips, we can estimate that Google WRS would approximately need 100-200 µs to verify each webpage resource request during page rendering.

With the provided data and accounting for two round trips, alongside 40 million daily fetches blocked by robots.txt during webpage rendering for a sizable corporate site, Google’s infrastructure would collectively wait approximately per year:

40,000,000 requests per day * 100 µs * 365 days = 16.9 days per year

40,000,000 requests per day * 200 µs * 365 days = 33.8 days per year

Discovering New Approaches

In this section, we explore alternative methodologies designed to address the limitations posed by traditional robots.txt directives. All these alternative approaches can be implemented where there is internal advocacy emphasising the advantages and benefits of Search and Technical SEO across the company.

Server-side approach

One primary solution that comes to mind to address this issue is server-side implementation. Various methods exist to differentiate between user and search engine requests [1][2][3], enabling the delivery of tailored content without altering the actual webpage content.

This approach closely resembles dynamic rendering, allowing websites to offer different content to search engines, especially beneficial when JavaScript-generated content isn’t accessible to them.

While this strategy optimises content for search engine rendering, it brings a potential concern. Selective webpage rendering, depending on the website infrastructure, could challenge maintaining consistent functionality across user experiences. Moreover, managing cache could become problematic due to generating two distinct webpages.

Edge workers

Edge workers are another option for this task. They can dynamically modify content, presenting versions tailored for search engines while excluding elements like analytics calls.

However, using edge workers for dynamic content alterations poses challenges. Content Delivery Networks (“CDNs”) act as the main entry point for traffic, so any modifications through edge workers directly affect the content delivered to all users.

Managing these edge workers within the CDN infrastructure and maintaining dynamic content adjustments adds complexity. This poses a potential risk, as misconfiguration or errors in content adaptation can impact the entire user base.

JavaScript function approach

Choosing JavaScript to block requests instead of relying solely on robots.txt might seem counterintuitive initially, but it brings enhanced flexibility and precision.

JavaScript allows dynamic and conditional request blocking, offering detailed control over which requests to block based on specific criteria or user interactions. It enables a single endpoint to serve dual purposes by dynamically allowing or inhibiting requests irrelevant to search engines.

Integrating this feature into the JavaScript library responsible for request creation gives developers control. This promotes efficient management and swift adaptation to evolving requirements or scenarios, moving beyond rigid directives like robots.txt.

Web Performance Impact Estimation

Implementing server-side or edge worker approaches may smoothly integrate into user experiences, but adding JavaScript to a webpage demands careful consideration.

Introducing functions within your JavaScript library means this code isn’t only executed by Search Engines but also by regular users. Therefore, evaluating the implications on performance and user experience becomes crucial when incorporating such JavaScript functions. While using a Real User Monitoring (“RUM”) tool remains the best way to understand the impact on user experience, making estimations is a valuable initial step. It offers a useful starting point to grasp potential effects before conducting thorough testing with RUM tools.

There are various ways to implement this JavaScript function. For simplicity, especially considering the possibility of multiple individuals needing to update the list of bots, in this article, we choose a more readable regex implementation:

// This function returns true if the regex matches the userAgent.
// Can be used to check if a request should be sent or not.
// Regex should include only bots that render web pages.

function isBot(userAgent){
const re = /googlebot|adsbot-google|mediapartners-google|bingbot|adidxbot|bingpreview|microsoftpreview|yandex|baiduspider|applebot|yeti|sogou|ia_archiver|ahrefs|botify|contentking|deepcrawl|fandango|jetoctopus|oncrawl|rytebot|screaming frog|semrush|claritybot|sitebulb/i;
	return re.test(userAgent)
}

// Constant variable shared across the JS library
const isBot = isBot(window.Navigator.userAgent);

if(!isBot) {
// insert scripts here
}

After testing the regex (in the order of 250 ns) and evaluating the boolean IF statement (in the order of 100 ns), the time needed for this task might appear trivial. However, within a browser handling numerous concurrent operations, the performance might be slightly less optimal.

Taking a more conservative stance, an estimation of 1µs seems reasonable.

Comparing this to robots.txt evaluation (in the range of 100-200 µs), the JavaScript function appears to be approximately 100/200 times faster.

With these figures, we anticipate a faster rendering process for search engines with minimal impact on user rendering. However, it’s advisable to verify this estimation with RUM data for a more comprehensive understanding.

NextJS’s approach to avoiding prefetching links

An elegant demonstration of using a JavaScript function to halt unnecessary requests is showcased in the canary branch of the NextJS framework pull request 40435.

Through the <Link> component, NextJS extends the functionality of the HTML <a> element, enabling developers to incorporate prefetching and client-side navigation between routes. Default prefetching of links begins in the background for those within the viewport, whether initially visible or upon scrolling.

However, these prefetches might not serve the purpose of a search engine, which renders a single page without navigating through links. This can force the search engine to download data that isn’t useful for the current rendering.

In this scenario, using robots.txt to block these URLs isn’t feasible since these prefetched links are crucial parts of the website architecture. Blocking them via robots.txt would exclude them from crawling, which isn’t a viable solution.

To address this, the NextJS team introduced a function to prevent link prefetching when a search engine or tool renders a webpage. Integrated within a widely used framework’s standard library, this function seamlessly removes numerous requests without requiring developers to find alternative solutions.

Conclusion

After thoroughly exploring methodologies to optimise webpage resource requests and enhance crawling efficiency, achieving a reduction of approximately 30 billion requests annually demonstrates the significant impact of thoughtful optimisation in web crawling and rendering.

By delving into server access logs and exploring alternative approaches like JavaScript functions in lieu of robots.txt, the research uncovered various ways to enhance search engine rendering efficiency. Rather than a singular solution, the integration of diverse methodologies, alongside a profound understanding of search engine behavior, presents a comprehensive approach to effective resource and function management across websites.

In conclusion, this article underscores the importance of continual exploration and adaptation in the realm of web crawling and rendering. For businesses who want to improve their search engine optimisation crawling and rendering capabilities, please schedule a call with us.