Most of our research and development never gets to see the light of day. We help our enterprise clients innovate in competitive spaces by taking on challenges. Some research projects run for days, while others can take months. We are making more of our internal research publicly available. We are looking to support marketing technology and demand generation teams who lean into challenges. Get in touch with Ryan Siddle (Email, Twitter or LinkedIn).
Abstract
Web crawling tools aim to replicate search engines’ crawling and rendering behaviours by implementing and using web rendering systems. This offers insights into what search engines might see when they are crawling and rendering web pages.
While there is no defined standard for an automated rendering process, popular search engines (e.g. Google, Bing, Yandex) render pages in isolated rendering sessions. This way, they avoid having the rendering of one web page affect the functionality or the content of another. Isolated rendering sessions should have isolated storage and avoid cross-tab talking.
Web crawling tools which do not have isolated rendering sessions might render some web page elements inaccurately, which has three main implications:
- Lack of data integrity
- The rendered pages are not an accurate representation of what search engines will render and use
- Developers may waste time (and money) investigating issues which are not present
This research evaluates 14 web crawling tools’ rendering session isolation capabilities across a series of six tests to identify any potential issues and highlight improvement opportunities.
Session Isolation Primer
As new web rendering patterns got traction on the web, we moved from static HTML pages to more complex ways of rendering content. With the massive use of rendering patterns such as Client-Side Rendering and Progressive Hydration, search engines were somehow forced to start rendering web pages and retrieve almost as much content as the users would get with their browsers.
As such, they have developed their own web rendering systems, which are a piece of software that is able to render a large number of web pages by using automated browsers. Trying to keep up with the evolution of the web and somehow mimic search engines’ capabilities, many web crawling tools also started to build rendering systems.
Rendering is hard. There is no industry standard for rendering pages, which means that not even leading search engines such as Google with their Web Rendering Services are doing it in the “correct” way. Each rendering system is built to serve specific use cases, which results in inevitable tradeoffs.
It’s worth noting that in web rendering systems there are many non-Javascript related factors that can influence the rendering of a page such as network errors, timeouts, robots.txt rules, HTTP/HTTPS mixed content, caching mismatch errors, and CORS errors. Hence, reducing everything to “Rendering = JS” is a huge error.
Research Context
At Merj we’ve been happy users of many web crawling tools. At the same time, for more specific or complex needs, we have been building our own web crawling systems for use cases such as custom data sources in complex data pipelines for enterprise companies.
The starting point of this research was a recent project that required us to provide assurances to a legal and compliance team about the data quality and integrity of a data source (rendered pages). These were to be ingested into a machine learning model.
In addition to other checks present in our data integrity validation process, we tested the output of multiple web crawling tools. We found some unexpected values which varied across tools. This research is the result of the analysis carried out to understand the reasons for the differences between various web crawlers’ outputs.
What is Session Isolation?
While rendering a page in an isolated rendering session, the page must not be able to get any data from previous rendering sessions and be influenced by other pages’ renderings. A web crawling tool with session isolation issues might create additional HTML content or new – i.e. not present or wanted – dynamic links. Those additional content or links won’t be present in search engines’ rendering output, which creates a risky situation when it comes to analysing the outcome of a crawl/render process.
This is similar to the concept of “stateless” as used for web crawlers, where all fetches are completed without reusing cookies and without keeping in memory any specific data.
This issue is difficult to identify and is worth mentioning that is not limited to web crawlers. All systems that use browser-based functionalities might be affected such as dynamic rendering services, web performance analysis tools, and CI/CD pipeline tests.
There are some cases where you need to keep data for specific tests, for example testing repeated views of a page to understand the web performance of returning visitors, but that option should be really clear and intended, not a side effect of a hidden problem.
Session Isolation in the Wild
To better understand the possible implications a session isolation issue may cause, we need to analyse websites that have custom personalisation features based on the navigation history.
A clear example can be found on the Ikea.com website. After visiting a few product pages, in addition to “Related products”, “You might like”, and “Goes well with”, you can see at the bottom of a product page an extra “Your recently viewed products” box.
This additional “box” is not present when you first visit these websites with a “clean” browser, but if you keep navigating them, product pages are then populated with your product’s view history. On the asos.com and adidas.com websites we can find similar features:
For all three previous examples, the “Recently Viewed” feature is implemented by saving the recently viewed products in the browser storage. Similar features are present on a huge amount of websites all over the internet.
If we use the Adidas “Recently View Items” feature as an example, a web crawling tool affected by the issue, not having an isolated rendering session, might have product pages linked to other pages just because they have been crawled before and stored in the memory. This will produce a considerable percentage of “ghost links” on the pages, only visible by that specific web crawler.
Search engines and web crawlers that implement correct session isolation won’t have additional content on a page and the final results will be different.
Looking at the results of both processes, we can clearly see the differences in the final rendered HTML:
Also, depending on the crawling/rendering order, a web crawling tool with session isolation issues might create arbitrary HTML content or links. In the example below it’s clear that starting the crawling/rendering process from PAGE 1 is producing a different result than starting from PAGE 3.
Solving Session Isolation
When you render a sequence of pages in a browser, all pages from the same domain can access storage and even communicate with each other. It’s worth remembering that closing the browser doesn’t delete the data and when you open it again you keep having all the information.
If you’re building a web rendering system, this is a problem.
Partial and Incorrect Solutions
There are many partial and incorrect ways of tackling this issue for web crawling purposes, some of them are:
- Clearing cookies after the rendering of a page. The problem here is that Cookies are not the only Web API that can store data.
- Opening and closing the browser for each page you want to render, manually deleting the folders where the browser stores data. This option is not efficient at all.
- Using the incognito profile hides some possible pitfalls as well. Even if an incognito profile data is stored on RAM and should be discarded when you close the window, within an incognito profile the rendered pages might share storage and cross-tab communication is possible. This option would solve our problem only if, again, we don’t render pages in parallel and we start/stop the browser for each page.
The optimal solution
Without changing the source code, session isolation can be solved on a headless Chromium browser by using Browser Contexts. Introduced at BlinkOn 6, Browser Context is an efficient way to have correct session isolation. Every Browser Context session runs in a separate renderer process, isolating the storage (cookies, cache, local storage, etc.) and preventing cross-tab communication.
Rendering a single page per Browser Context, closing it at the end of the rendering, and then opening a new Browser Context for the next page will guarantee isolated rendering sessions without the need to restart the browser every time.
Using this solution will have a minimal effect on the web crawlers’ performance. The slight degradations in performance can easily be offset by using many other methods to improve performance without affecting the validity of the output. In most real-world cases, the majority of web crawling tools users would not compromise data integrity caused by session isolation for an overall performance difference of a few seconds.
Browser Context
In the testing section, you’ll find a few references about Browsing Context. It is important to differentiate between *Browser Context *and Browsing Context.
- A Browser Context, in the case of headless Chromium, corresponds to a special user profile (similar to the Incognito Profile).
- A Browsing Context is an environment in which a browser displays a document to users. This can be a tab or a window but also part of a page such as an Iframe. Each Browsing Context has its specific origin.
Thanks to Sami Kyostila for the clarifications about the difference between Browser Context and Browsing Context.
Opening and using a new Browser Context is actually quite simple, this is the example used in the official Puppeteer documentation:
(async () => {
const browser = await puppeteer.launch();
// Create a new incognito browser context.
const context = await browser.createIncognitoBrowserContext();
// Create a new page in a pristine context.
const page = await context.newPage();
// Do stuff
await page.goto('https://example.com');
})();
Additional documentation and examples on the use of Browser Context can be found here:
- https://chromedevtools.github.io/devtools-protocol/tot/Target/#method-createBrowserContext
- https://pptr.dev/next/api/puppeteer.browser.createincognitobrowsercontext
- https://playwright.dev/docs/api/class-browsercontext
Methodology
We set up a testing environment with 1,000 pages that try to communicate with each other using shared storage and cross-tab communication. Rendering 1,000 pages will increase the chances of having two or more pages rendered at the same time in parallel on the same machine or by the same browser. Using fewer pages may cause false negatives if the tested web rendering system uses a high number of machines in parallel.
These pages were tested against six tests which can either result in a pass or fail, split across two categories:
1) Storage Isolation
2) Cross Tab Communication
Test code
If you want to check if a tool not present in the list is affected by session isolation issues or if you want to verify one of the tools we tested, you can replicate the testing environment using the following code.
GitHub link: https://github.com/merj/test-crawl-session-isolation
Test Descriptions and Results
The results in the tables below refer to our initial tests, with the majority of them completed between July and August 2022. In the meantime, some of the web crawler vendors fixed the problem. For the current status refer to the “Web crawlers’ status” appendices.
It’s worth mentioning that Screaming Frog fixed the problem autonomously in August after our tests but before we reached out to them.
From the date of the last test, we provided the platforms with a 30-day grace period to validate or fix any session isolation issues before publishing this article.
The tables below show each test we performed along with the result of each web crawler (listed A-Z).
Storage Isolation
- Cookie
- IndexedDB
- LocalStorage
- SessionStorage
Cross-tab communication
- Broadcast Channel
- Shared Worker
Storage Isolation
Storage isolation tests are focused on Web APIs that save or access data from the browser’s memory. The goal of each test is to find race conditions in accessing data saved from previous or parallel page renderings.
1. Cookie
Cookies don’t need presentation. The Cookie interface lets you read and write small pieces of information in the browser storage. Cookies are mainly used for session management, personalization, and tracking.
Reference: https://developer.mozilla.org/en-US/docs/Web/API/Document/cookie
Test explanation: When the rendering starts the page creates and saves a Cookie, then reads if there are cookies saved from other pages.
Fail criterion: If there are cookies other than the ones created for the rendered page, the test fails.
Web Crawler | Cookie |
---|---|
Ahrefs | FAILED |
Botify | PASSED |
ContentKing | PASSED |
FandangoSEO | FAILED |
JetOctopus | PASSED |
Lumar (formerly Deepcrawl) | PASSED |
Netpeak Spider | PASSED |
OnCrawl | PASSED |
Ryte | PASSED |
Screaming Frog | PASSED |
SEO PowerSuite WebSite Auditor | PASSED |
SEOClarity | FAILED |
Sistrix | PASSED |
Sitebulb | FAILED |
2. IndexedDB
IndexedDB is a transactional database system that lets you store and retrieve objects from Browser memory.
Reference: https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API
Test explanation: When the rendering starts the page, it creates or connects to an IndexedDB database. Then, it creates and saves a record in the database to eventually start reading if there are records saved from other pages.
Fail criterion: If there are records other than the ones created for the rendered page, the test fails.
Web Crawler | IndexedDB |
---|---|
Ahrefs | FAILED |
Botify | PASSED |
ContentKing | FAILED |
FandangoSEO | FAILED |
JetOctopus | FAILED |
Lumar (formerly Deepcrawl) | PASSED |
Netpeak Spider | FAILED |
OnCrawl | PASSED |
Ryte | FAILED |
Screaming Frog | FAILED * |
SEO PowerSuite WebSite Auditor | FAILED |
SEOClarity | PASSED |
Sistrix | PASSED |
Sitebulb | FAILED |
*Screaming Frog fixed the problem autonomously in August after our tests but before we reached out to them
3. LocalStorage
LocalStorage is a mechanism that uses the Web Storage API by which browsers can store key/value pairs. Data persists when the browser is closed and reopened.
Reference: https://developer.mozilla.org/en-US/docs/Web/API/Window/localStorage
Test explanation: When the rendering starts, the page creates or saves a data item in the Local Storage, and then it reads if there are data items saved from other pages.
Fail criterion: If there are data items other than the ones created for the rendered page, the test fails.
Web Crawler | LocalStorage |
---|---|
Ahrefs | FAILED |
Botify | PASSED |
ContentKing | FAILED |
FandangoSEO | FAILED |
JetOctopus | FAILED |
Lumar (formerly Deepcrawl) | PASSED |
Netpeak Spider | FAILED |
OnCrawl | PASSED |
Ryte | FAILED |
Screaming Frog | FAILED * |
SEO PowerSuite WebSite Auditor | FAILED |
SEOClarity | FAILED |
Sistrix | PASSED |
Sitebulb | FAILED |
*Screaming Frog fixed the problem autonomously in August after our tests but before we reached out to them
4. SessionStorage
SessionStorage is a mechanism that uses the Web Storage API by which browsers can store key/value pairs. Data lasts as long as the tab or the browser is open and survives over page reloads and restores.
Reference: https://developer.mozilla.org/en-US/docs/Web/API/Window/sessionStorage
Test explanation: When the rendering service starts the page, creates, or saves a data item in the Session Storage, and then it reads if there are data items saved from other pages.
Fail criterion: If there are data items other than the ones created for the rendered page, the test fails.
Web Crawler | SessionStorage |
---|---|
Ahrefs | PASSED |
Botify | PASSED |
ContentKing | PASSED |
FandangoSEO | FAILED |
JetOctopus | PASSED |
Lumar (formerly Deepcrawl) | PASSED |
Netpeak Spider | FAILED |
OnCrawl | PASSED |
Ryte | PASSED |
Screaming Frog | PASSED |
SEO PowerSuite WebSite Auditor | FAILED |
SEOClarity | PASSED |
Sistrix | PASSED |
Sitebulb | PASSED |
Cross-tab communication
Cross-tab communication tests are focused on Web APIs that send or receive data. The goal of each test is to find if during rendering a page can receive messages from other pages rendered in parallel.
5. Broadcast Channel
The Broadcast Channel API allows communication between Browsing Contexts such as windows, tabs, frames, iframes, and workers of the same origin.
A client can join a channel specifying a name, if the channel exists the client joins the channel if it doesn’t exist it creates a new one. On this channel, a client can send messages and receive all messages sent by the other client connected to the channel.
Reference: https://developer.mozilla.org/en-US/docs/Web/API/Broadcast_Channel_API
Test explanation: When the rendering starts the page connects to the channel and then starts sending its page title as a message to the channel. If there are other pages connected that are sending messages through the channel the page gets and saves them.
Fail criterion: If the rendered page gets even a single message from the Broadcast Channel sent by other pages, the test fails.
Web Crawler | Broadcast Channel |
---|---|
Ahrefs | PASSED |
Botify | PASSED |
ContentKing | PASSED |
FandangoSEO | PASSED |
JetOctopus | PASSED |
Lumar (formerly Deepcrawl) | PASSED |
Netpeak Spider | FAILED |
OnCrawl | PASSED |
Ryte | PASSED |
Screaming Frog | PASSED |
SEO PowerSuite WebSite Auditor | PASSED |
SEOClarity | PASSED |
Sistrix | PASSED |
Sitebulb | FAILED |
6. Shared Worker
The Shared Worker is a Worker that allows communication between Browsing Contexts such as windows, tabs, frames, iframes, and workers on the same origin.
Reference: https://developer.mozilla.org/en-US/docs/Web/API/SharedWorker
Test explanation: When the rendering starts the page connects to the Shared Worker, then it starts sending messages to the Worker and eventually starts listening for messages from other pages sent through the Worker.
Fail criterion: If the rendered page gets even a single message from the Shared Worker sent by other pages, the test fails.
Web Crawler | Shared Worker |
---|---|
Ahrefs | PASSED |
Botify | PASSED |
ContentKing | PASSED |
FandangoSEO | PASSED |
JetOctopus | PASSED |
Lumar (formerly Deepcrawl) | PASSED |
Netpeak Spider | FAILED |
OnCrawl | PASSED |
Ryte | PASSED |
Screaming Frog | PASSED |
SEO PowerSuite WebSite Auditor | PASSED |
SEOClarity | PASSED |
Sistrix | PASSED |
Sitebulb | FAILED |
Conclusion
Looking at the table of results below, we can clearly see the session isolation issue is quite common: 71% of the tools failed at least one of the tests.
If it is complex to predict what’s causing the storage isolation issue, a possible cause for failing the cross-tab communication tests (Broadcast Channel and Shared Worker) is having the same browser instance used to render pages in parallel using multiple windows and/or tabs.
Regarding the SessionStorage test, due to the peculiar properties of SessionStorage, the web crawling tools that fail the test are probably not closing the tab after using it but, after completing the first rendering, reusing the same tab to render a second page and so on.
Without knowing the actual technical implementation, web crawlers might pass all tests included in this research by manually cleaning every single storage at the end of every page rendering session, but this approach is not a secure and viable solution to guarantee data integrity.
Our research focuses on a limited amount of Web APIs and browser interfaces, it’s worth mentioning that those aren’t the only ones that might have access to browser memory/cache and trying to keep up with the development of all new standards and web features is a complex and time-consuming process.
Web Crawler | Cookie | IndexedDB | LocalStorage | SessionStorage | Broadcast Channel | Shared Worker |
---|---|---|---|---|---|---|
Ahrefs | FAILED | FAILED | FAILED | PASSED | PASSED | PASSED |
Botify | PASSED | PASSED | PASSED | PASSED | PASSED | PASSED |
ContentKing | PASSED | FAILED | FAILED | PASSED | PASSED | PASSED |
FandangoSEO | FAILED | FAILED | FAILED | FAILED | PASSED | PASSED |
JetOctopus | PASSED | FAILED | FAILED | PASSED | PASSED | PASSED |
Lumar (formerly Deepcrawl) | PASSED | PASSED | PASSED | PASSED | PASSED | PASSED |
Netpeak Spider | PASSED | FAILED | FAILED | FAILED | FAILED | FAILED |
OnCrawl | PASSED | PASSED | PASSED | PASSED | PASSED | PASSED |
Ryte | PASSED | FAILED | FAILED | PASSED | PASSED | PASSED |
Screaming Frog | PASSED | FAILED * | FAILED * | PASSED | PASSED | PASSED |
SEO PowerSuite WebSite Auditor | PASSED | FAILED | FAILED | FAILED | PASSED | PASSED |
SEOClarity | FAILED | PASSED | FAILED | PASSED | PASSED | PASSED |
Sistrix | PASSED | PASSED | PASSED | PASSED | PASSED | PASSED |
Sitebulb | FAILED | FAILED | FAILED | PASSED | FAILED | FAILED |
*Screaming Frog fixed the problem autonomously in August after our tests but before we reached out to them
Appendices
Web crawlers’ status
Not all web crawlers have been able to fix the session isolation issues yet while they investigate further. The Docker crawling test framework was able to support those who have fixed the session isolation and will be included in their future release checks. Some web crawlers included us through the entire remediation process.
Below you can find a table with the current status of the web crawlers. We’ll update the table every time we’ll have news from the vendors.
Last update: 2022/11/15
Web Crawler | Status |
---|---|
Ahrefs | Fixed – 15 Nov 2022 |
Botify | Passed all tests |
ContentKing | Fixed – 27 Oct 2022 |
FandangoSEO | Looking into this |
JetOctopus | Looking into this |
Lumar (formerly Deepcrawl) | Passed all tests |
Netpeak Spider | Looking into this |
OnCrawl | Passed all tests |
Ryte | Fixed – 10 Oct 2022 |
Screaming Frog | Fixed – 17 Aug 2022 * |
SEO PowerSuite WebSite Auditor | Looking into this |
SEOClarity | Looking into this |
Sistrix | Passed all tests |
Sitebulb | Looking into this |
*Screaming Frog fixed the problem autonomously in August after our tests but before we reached out to them
If you are a web crawling provider, consider using the crawl testing framework to ensure that you are isolating sessions correctly.
Additional information about the research
- The majority of tests have been completed during July / August 2022;
- We used the standard configuration for all tools, enabling the “JS Rendering” if not default, and disabling the “keep storage” option where possible (this is an advanced option and should be disabled by default);
- Merj is channel partner with Botify. Botify have not been involved in designing or conducting these tests.
- All featured vendors were sent the results after testing was completed and were offered a 30-day grace period to validate or fix any session isolation issues before publishing this article.