Web Crawling Research and Development

Most of our research and development never gets to see the light of day. We help our enterprise clients innovate in competitive spaces by taking on challenges. Some research projects run for days, while others can take months. We are making more of our internal research publicly available. We are looking to support marketing technology and demand generation teams who lean into challenges. Get in touch with Ryan Siddle (Email, Twitter or LinkedIn).

Abstract

Web crawling tools aim to replicate search engines’ crawling and rendering behaviours by implementing and using web rendering systems. This offers insights into what search engines might see when they are crawling and rendering web pages.

While there is no defined standard for an automated rendering process, popular search engines (e.g. Google, Bing, Yandex) render pages in isolated rendering sessions. This way, they avoid having the rendering of one web page affect the functionality or the content of another. Isolated rendering sessions should have isolated storage and avoid cross-tab talking.

Web crawling tools which do not have isolated rendering sessions might render some web page elements inaccurately, which has three main implications:

  • Lack of data integrity
  • The rendered pages are not an accurate representation of what search engines will render and use
  • Developers may waste time (and money) investigating issues which are not present

This research evaluates 14 web crawling tools’ rendering session isolation capabilities across a series of six tests to identify any potential issues and highlight improvement opportunities.

Session Isolation Primer

As new web rendering patterns got traction on the web, we moved from static HTML pages to more complex ways of rendering content. With the massive use of rendering patterns such as Client-Side Rendering and Progressive Hydration, search engines were somehow forced to start rendering web pages and retrieve almost as much content as the users would get with their browsers.

As such, they have developed their own web rendering systems, which are a piece of software that is able to render a large number of web pages by using automated browsers. Trying to keep up with the evolution of the web and somehow mimic search engines’ capabilities, many web crawling tools also started to build rendering systems.

Rendering is hard. There is no industry standard for rendering pages, which means that not even leading search engines such as Google with their Web Rendering Services are doing it in the “correct” way. Each rendering system is built to serve specific use cases, which results in inevitable tradeoffs.

It’s worth noting that in web rendering systems there are many non-Javascript related factors that can influence the rendering of a page such as network errors, timeouts, robots.txt rules, HTTP/HTTPS mixed content, caching mismatch errors, and CORS errors. Hence, reducing everything to “Rendering = JS” is a huge error.

Research Context

At Merj we’ve been happy users of many web crawling tools. At the same time, for more specific or complex needs, we have been building our own web crawling systems for use cases such as custom data sources in complex data pipelines for enterprise companies.

The starting point of this research was a recent project that required us to provide assurances to a legal and compliance team about the data quality and integrity of a data source (rendered pages). These were to be ingested into a machine learning model.

In addition to other checks present in our data integrity validation process, we tested the output of multiple web crawling tools. We found some unexpected values which varied across tools. This research is the result of the analysis carried out to understand the reasons for the differences between various web crawlers’ outputs.

What is Session Isolation?

While rendering a page in an isolated rendering session, the page must not be able to get any data from previous rendering sessions and be influenced by other pages’ renderings. A web crawling tool with session isolation issues might create additional HTML content or new – i.e. not present or wanted – dynamic links. Those additional content or links won’t be present in search engines’ rendering output, which creates a risky situation when it comes to analysing the outcome of a crawl/render process.

This is similar to the concept of “stateless” as used for web crawlers, where all fetches are completed without reusing cookies and without keeping in memory any specific data.

This issue is difficult to identify and is worth mentioning that is not limited to web crawlers. All systems that use browser-based functionalities might be affected such as dynamic rendering services, web performance analysis tools, and CI/CD pipeline tests.

There are some cases where you need to keep data for specific tests, for example testing repeated views of a page to understand the web performance of returning visitors, but that option should be really clear and intended, not a side effect of a hidden problem.

Session Isolation in the Wild

To better understand the possible implications a session isolation issue may cause, we need to analyse websites that have custom personalisation features based on the navigation history.

A clear example can be found on the Ikea.com website. After visiting a few product pages, in addition to “Related products”, “You might like”, and “Goes well with”, you can see at the bottom of a product page an extra “Your recently viewed products” box.

"Your recently viewed products" feature from the Ikea.com website
“Your recently viewed products” feature from the Ikea.com website

This additional “box” is not present when you first visit these websites with a “clean” browser, but if you keep navigating them, product pages are then populated with your product’s view history. On the asos.com and adidas.com websites we can find similar features:

"Recently Viewed" feature from the Asos.com website
“Recently Viewed” feature from the Asos.com website
"Recently Viewed Items" feature from the Adidas.com website
“Recently Viewed Items” feature from the Adidas.com website

For all three previous examples, the “Recently Viewed” feature is implemented by saving the recently viewed products in the browser storage. Similar features are present on a huge amount of websites all over the internet.

If we use the Adidas “Recently View Items” feature as an example, a web crawling tool affected by the issue, not having an isolated rendering session, might have product pages linked to other pages just because they have been crawled before and stored in the memory. This will produce a considerable percentage of “ghost links” on the pages, only visible by that specific web crawler.

Rendering process without session isolation
Rendering process without session isolation

Search engines and web crawlers that implement correct session isolation won’t have additional content on a page and the final results will be different.

Rendering process with session isolation
Rendering process with session isolation

Looking at the results of both processes, we can clearly see the differences in the final rendered HTML:

Results of rendering without session isolation
Results of rendering without session isolation
Results of rendering with session isolation
Results of rendering with session isolation

Also, depending on the crawling/rendering order, a web crawling tool with session isolation issues might create arbitrary HTML content or links. In the example below it’s clear that starting the crawling/rendering process from PAGE 1 is producing a different result than starting from PAGE 3.

Crawling/Rendering process starting from PAGE 1
Crawling/Rendering process starting from PAGE 1
Crawling/Rendering process starting from PAGE 3
Crawling/Rendering process starting from PAGE 3

Solving Session Isolation

When you render a sequence of pages in a browser, all pages from the same domain can access storage and even communicate with each other. It’s worth remembering that closing the browser doesn’t delete the data and when you open it again you keep having all the information.

If you’re building a web rendering system, this is a problem.

Partial and Incorrect Solutions

There are many partial and incorrect ways of tackling this issue for web crawling purposes, some of them are:

  1. Clearing cookies after the rendering of a page. The problem here is that Cookies are not the only Web API that can store data.
  2. Opening and closing the browser for each page you want to render, manually deleting the folders where the browser stores data. This option is not efficient at all.
  3. Using the incognito profile hides some possible pitfalls as well. Even if an incognito profile data is stored on RAM and should be discarded when you close the window, within an incognito profile the rendered pages might share storage and cross-tab communication is possible. This option would solve our problem only if, again, we don’t render pages in parallel and we start/stop the browser for each page.

The optimal solution

Without changing the source code, session isolation can be solved on a headless Chromium browser by using Browser Contexts. Introduced at BlinkOn 6, Browser Context is an efficient way to have correct session isolation. Every Browser Context session runs in a separate renderer process, isolating the storage (cookies, cache, local storage, etc.) and preventing cross-tab communication.

Rendering a single page per Browser Context, closing it at the end of the rendering, and then opening a new Browser Context for the next page will guarantee isolated rendering sessions without the need to restart the browser every time.

Image from BlinkOn 6: "Headless Chrome"
Image from BlinkOn 6: “Headless Chrome”

Using this solution will have a minimal effect on the web crawlers’ performance. The slight degradations in performance can easily be offset by using many other methods to improve performance without affecting the validity of the output. In most real-world cases, the majority of web crawling tools users would not compromise data integrity caused by session isolation for an overall performance difference of a few seconds.

Browser Context

In the testing section, you’ll find a few references about Browsing Context. It is important to differentiate between *Browser Context *and Browsing Context.

  • A Browser Context, in the case of headless Chromium, corresponds to a special user profile (similar to the Incognito Profile).
  • A Browsing Context is an environment in which a browser displays a document to users. This can be a tab or a window but also part of a page such as an Iframe. Each Browsing Context has its specific origin.

Thanks to Sami Kyostila for the clarifications about the difference between Browser Context and Browsing Context.

Opening and using a new Browser Context is actually quite simple, this is the example used in the official Puppeteer documentation:

(async () => {
     const browser = await puppeteer.launch();

// Create a new incognito browser context.
     const context = await browser.createIncognitoBrowserContext();

// Create a new page in a pristine context.
     const page = await context.newPage();

// Do stuff
     await page.goto('https://example.com');
})();

Additional documentation and examples on the use of Browser Context can be found here:

Methodology

We set up a testing environment with 1,000 pages that try to communicate with each other using shared storage and cross-tab communication. Rendering 1,000 pages will increase the chances of having two or more pages rendered at the same time in parallel on the same machine or by the same browser. Using fewer pages may cause false negatives if the tested web rendering system uses a high number of machines in parallel.

These pages were tested against six tests which can either result in a pass or fail, split across two categories:

1) Storage Isolation
2) Cross Tab Communication

Test code

If you want to check if a tool not present in the list is affected by session isolation issues or if you want to verify one of the tools we tested, you can replicate the testing environment using the following code.

Download the Docker Crawl Testing Environment

GitHub link: https://github.com/merj/test-crawl-session-isolation

Test Descriptions and Results

The results in the tables below refer to our initial tests, with the majority of them completed between July and August 2022. In the meantime, some of the web crawler vendors fixed the problem. For the current status refer to the “Web crawlers’ status” appendices.

It’s worth mentioning that Screaming Frog fixed the problem autonomously in August after our tests but before we reached out to them.

From the date of the last test, we provided the platforms with a 30-day grace period to validate or fix any session isolation issues before publishing this article.

The tables below show each test we performed along with the result of each web crawler (listed A-Z).

Storage Isolation

  1. Cookie
  2. IndexedDB
  3. LocalStorage
  4. SessionStorage

Cross-tab communication

  1. Broadcast Channel
  2. Shared Worker

Storage Isolation

Storage isolation tests are focused on Web APIs that save or access data from the browser’s memory. The goal of each test is to find race conditions in accessing data saved from previous or parallel page renderings.

1. Cookie

Cookies don’t need presentation. The Cookie interface lets you read and write small pieces of information in the browser storage. Cookies are mainly used for session management, personalization, and tracking.

Reference: https://developer.mozilla.org/en-US/docs/Web/API/Document/cookie

Test explanation: When the rendering starts the page creates and saves a Cookie, then reads if there are cookies saved from other pages.

Fail criterion: If there are cookies other than the ones created for the rendered page, the test fails.

Web CrawlerCookie
AhrefsFAILED
BotifyPASSED
ContentKingPASSED
FandangoSEOFAILED
JetOctopusPASSED
Lumar (formerly Deepcrawl)PASSED
Netpeak SpiderPASSED
OnCrawlPASSED
RytePASSED
Screaming FrogPASSED
SEO PowerSuite WebSite AuditorPASSED
SEOClarityFAILED
SistrixPASSED
SitebulbFAILED

2. IndexedDB

IndexedDB is a transactional database system that lets you store and retrieve objects from Browser memory.

Reference: https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API

Test explanation: When the rendering starts the page, it creates or connects to an IndexedDB database. Then, it creates and saves a record in the database to eventually start reading if there are records saved from other pages.

Fail criterion: If there are records other than the ones created for the rendered page, the test fails.

Web CrawlerIndexedDB
AhrefsFAILED
BotifyPASSED
ContentKingFAILED
FandangoSEOFAILED
JetOctopusFAILED
Lumar (formerly Deepcrawl)PASSED
Netpeak SpiderFAILED
OnCrawlPASSED
RyteFAILED
Screaming FrogFAILED *
SEO PowerSuite WebSite AuditorFAILED
SEOClarityPASSED
SistrixPASSED
SitebulbFAILED

*Screaming Frog fixed the problem autonomously in August after our tests but before we reached out to them

3. LocalStorage

LocalStorage is a mechanism that uses the Web Storage API by which browsers can store key/value pairs. Data persists when the browser is closed and reopened.

Reference: https://developer.mozilla.org/en-US/docs/Web/API/Window/localStorage

Test explanation: When the rendering starts, the page creates or saves a data item in the Local Storage, and then it reads if there are data items saved from other pages.

Fail criterion: If there are data items other than the ones created for the rendered page, the test fails.

Web CrawlerLocalStorage
AhrefsFAILED
BotifyPASSED
ContentKingFAILED
FandangoSEOFAILED
JetOctopusFAILED
Lumar (formerly Deepcrawl)PASSED
Netpeak SpiderFAILED
OnCrawlPASSED
RyteFAILED
Screaming FrogFAILED *
SEO PowerSuite WebSite AuditorFAILED
SEOClarityFAILED
SistrixPASSED
SitebulbFAILED

*Screaming Frog fixed the problem autonomously in August after our tests but before we reached out to them

4. SessionStorage

SessionStorage is a mechanism that uses the Web Storage API by which browsers can store key/value pairs. Data lasts as long as the tab or the browser is open and survives over page reloads and restores.

Reference: https://developer.mozilla.org/en-US/docs/Web/API/Window/sessionStorage

Test explanation: When the rendering service starts the page, creates, or saves a data item in the Session Storage, and then it reads if there are data items saved from other pages.

Fail criterion: If there are data items other than the ones created for the rendered page, the test fails.

Web CrawlerSessionStorage
AhrefsPASSED
BotifyPASSED
ContentKingPASSED
FandangoSEOFAILED
JetOctopusPASSED
Lumar (formerly Deepcrawl)PASSED
Netpeak SpiderFAILED
OnCrawlPASSED
RytePASSED
Screaming FrogPASSED
SEO PowerSuite WebSite AuditorFAILED
SEOClarityPASSED
SistrixPASSED
SitebulbPASSED

Cross-tab communication

Cross-tab communication tests are focused on Web APIs that send or receive data. The goal of each test is to find if during rendering a page can receive messages from other pages rendered in parallel.

5. Broadcast Channel

The Broadcast Channel API allows communication between Browsing Contexts such as windows, tabs, frames, iframes, and workers of the same origin.

A client can join a channel specifying a name, if the channel exists the client joins the channel if it doesn’t exist it creates a new one. On this channel, a client can send messages and receive all messages sent by the other client connected to the channel.

Reference: https://developer.mozilla.org/en-US/docs/Web/API/Broadcast_Channel_API

Test explanation: When the rendering starts the page connects to the channel and then starts sending its page title as a message to the channel. If there are other pages connected that are sending messages through the channel the page gets and saves them.

Fail criterion: If the rendered page gets even a single message from the Broadcast Channel sent by other pages, the test fails.

Web CrawlerBroadcast Channel
AhrefsPASSED
BotifyPASSED
ContentKingPASSED
FandangoSEOPASSED
JetOctopusPASSED
Lumar (formerly Deepcrawl)PASSED
Netpeak SpiderFAILED
OnCrawlPASSED
RytePASSED
Screaming FrogPASSED
SEO PowerSuite WebSite AuditorPASSED
SEOClarityPASSED
SistrixPASSED
SitebulbFAILED

6. Shared Worker

The Shared Worker is a Worker that allows communication between Browsing Contexts such as windows, tabs, frames, iframes, and workers on the same origin.

Reference: https://developer.mozilla.org/en-US/docs/Web/API/SharedWorker

Test explanation: When the rendering starts the page connects to the Shared Worker, then it starts sending messages to the Worker and eventually starts listening for messages from other pages sent through the Worker.

Fail criterion: If the rendered page gets even a single message from the Shared Worker sent by other pages, the test fails.

Web CrawlerShared Worker
AhrefsPASSED
BotifyPASSED
ContentKingPASSED
FandangoSEOPASSED
JetOctopusPASSED
Lumar (formerly Deepcrawl)PASSED
Netpeak SpiderFAILED
OnCrawlPASSED
RytePASSED
Screaming FrogPASSED
SEO PowerSuite WebSite AuditorPASSED
SEOClarityPASSED
SistrixPASSED
SitebulbFAILED

Conclusion

Looking at the table of results below, we can clearly see the session isolation issue is quite common: 71% of the tools failed at least one of the tests.

If it is complex to predict what’s causing the storage isolation issue, a possible cause for failing the cross-tab communication tests (Broadcast Channel and Shared Worker) is having the same browser instance used to render pages in parallel using multiple windows and/or tabs.

Regarding the SessionStorage test, due to the peculiar properties of SessionStorage, the web crawling tools that fail the test are probably not closing the tab after using it but, after completing the first rendering, reusing the same tab to render a second page and so on.

Without knowing the actual technical implementation, web crawlers might pass all tests included in this research by manually cleaning every single storage at the end of every page rendering session, but this approach is not a secure and viable solution to guarantee data integrity.

Our research focuses on a limited amount of Web APIs and browser interfaces, it’s worth mentioning that those aren’t the only ones that might have access to browser memory/cache and trying to keep up with the development of all new standards and web features is a complex and time-consuming process.

Web CrawlerCookieIndexedDBLocalStorageSessionStorageBroadcast ChannelShared Worker
AhrefsFAILEDFAILEDFAILEDPASSEDPASSEDPASSED
BotifyPASSEDPASSEDPASSEDPASSEDPASSEDPASSED
ContentKingPASSEDFAILEDFAILEDPASSEDPASSEDPASSED
FandangoSEOFAILEDFAILEDFAILEDFAILEDPASSEDPASSED
JetOctopusPASSEDFAILEDFAILEDPASSEDPASSEDPASSED
Lumar (formerly Deepcrawl)PASSEDPASSEDPASSEDPASSEDPASSEDPASSED
Netpeak SpiderPASSEDFAILEDFAILEDFAILEDFAILEDFAILED
OnCrawlPASSEDPASSEDPASSEDPASSEDPASSEDPASSED
RytePASSEDFAILEDFAILEDPASSEDPASSEDPASSED
Screaming FrogPASSEDFAILED *FAILED *PASSEDPASSEDPASSED
SEO PowerSuite WebSite AuditorPASSEDFAILEDFAILEDFAILEDPASSEDPASSED
SEOClarityFAILEDPASSEDFAILEDPASSEDPASSEDPASSED
SistrixPASSEDPASSEDPASSEDPASSEDPASSEDPASSED
SitebulbFAILEDFAILEDFAILEDPASSEDFAILEDFAILED

*Screaming Frog fixed the problem autonomously in August after our tests but before we reached out to them

Appendices

Web crawlers’ status

Not all web crawlers have been able to fix the session isolation issues yet while they investigate further. The Docker crawling test framework was able to support those who have fixed the session isolation and will be included in their future release checks. Some web crawlers included us through the entire remediation process.

Below you can find a table with the current status of the web crawlers. We’ll update the table every time we’ll have news from the vendors.

Last update: 2022/11/15

Web CrawlerStatus
AhrefsFixed – 15 Nov 2022
BotifyPassed all tests
ContentKingFixed – 27 Oct 2022
FandangoSEOLooking into this
JetOctopusLooking into this
Lumar (formerly Deepcrawl)Passed all tests
Netpeak SpiderLooking into this
OnCrawlPassed all tests
RyteFixed – 10 Oct 2022
Screaming FrogFixed – 17 Aug 2022 *
SEO PowerSuite WebSite AuditorLooking into this
SEOClarityLooking into this
SistrixPassed all tests
SitebulbLooking into this

*Screaming Frog fixed the problem autonomously in August after our tests but before we reached out to them

If you are a web crawling provider, consider using the crawl testing framework to ensure that you are isolating sessions correctly.

Additional information about the research

  • The majority of tests have been completed during July / August 2022;
  • We used the standard configuration for all tools, enabling the “JS Rendering” if not default, and disabling the “keep storage” option where possible (this is an advanced option and should be disabled by default);
  • Merj is channel partner with Botify. Botify have not been involved in designing or conducting these tests.
  • All featured vendors were sent the results after testing was completed and were offered a 30-day grace period to validate or fix any session isolation issues before publishing this article.