The agent-readable web: serving the right representation to the right crawler

Ryan Siddle

Managing Director

Will Nye

SEO & AEO Director

May 29, 2026 · 17 min read

AI Search

For decades, the web was optimised around the browser.

Search crawlers, scrapers, command-line tools, and HTTP libraries have always consumed web content too. But the browser shaped the default representation.

Agents change that. For years, the browser-first model shaped how we built pages: navigation, layout, JavaScript, CSS, analytics, accessibility metadata, structured data, and increasingly complex component trees.

Agents do not experience pages the way people do. They may not need a header, footer, cookie banner, carousel, responsive grid, or client-side bundle. They may not execute JavaScript. They may not benefit from a deeply nested DOM.

What they need is a high-fidelity representation of the content.

That is why Markdown has become interesting. Not because Markdown is magic, and not because HTML is broken. Markdown is useful because it can preserve the structure agents need: headings, paragraphs, links, tables, lists, code blocks, dates, image context, and canonical URLs: while removing much of the noise created for browsers.

The future is not HTML versus Markdown.

It is one canonical resource with multiple representations: HTML for browsers, Markdown for agents that ask for it, and simplified HTML for crawlers that still prefer the DOM.

The cleanest long-term model is HTTP content negotiation.

If agents, search engines, and LLM crawlers want a Markdown representation, they should be able to ask for it using the Accept header. It won’t be a clean path to get there though.

Who this is for (and what to take away)

This is written for product managers, engineers, and SEO/AEO consultants, working on how to serve the same underlying content to both humans (browsers) and machines (AI crawlers / agentic retrieval) efficiently.

What you should get from this:

A clear model: one canonical URL, multiple representations
The trade-offs between .md endpoints, rel="alternate", llms.txt, DOM flattening, and Accept content negotiation
The implementation gotchas that cause regressions (especially caching + bot verification)
A simple way to judge whether the “agent-readable” output is actually higher-fidelity

If you only do one thing: treat “agent-readable” as a representation problem, not “a new website problem”. Keep one canonical URL, then offer better representations (Markdown, simplified HTML) via explicit mechanisms such as Accept content negotiation and rel="alternate".

HTML is the browser representation

HTML is still the right format for the web.

It gives browsers a rich representation of a document. It supports links, forms, media, scripts, accessibility semantics, structured data, and interactive applications. For humans, HTML is the correct representation.

But for many agentic workflows, HTML carries extra weight.

A modern page may include the main article or documentation content, but it also includes navigation, tracking scripts, related links, client-side state, style hooks, layout wrappers, hidden elements, embedded JSON, and framework-generated markup.

A browser can turn that into an experience. A retrieval system has to work around it.

That creates three costs.

Cost	What happens	Why it matters
Network cost	Agents download more bytes than they need.	More egress, slower retrieval, and more wasted work.
Parsing cost	Retrieval systems have to extract the main content from HTML markup designed for browser rendering.	Extraction quality depends on the parser.
Fidelity cost	Content can be reordered, omitted, duplicated, or surrounded by unrelated modules.	The model receives a lower-quality representation of the page.

Anyone who has tried to extract useful text from arbitrary HTML has seen this.

A simple text dump can collapse spacing. A naïve tree walk can pick up navigation, hidden content, or footer links. A stricter parser may drop useful image context such as src, alt, captions, or ARIA labels.

The problem is not that HTML is wrong. The problem is that HTML is a browser representation, and agents increasingly need a content representation.

So what? If you want agents to retrieve answers from your content reliably, assume raw HTML will impose extraction cost and occasional fidelity loss. Either provide a cleaner representation (Markdown / simplified HTML) or design your HTML so the “main content” is unambiguous and complete.

Markdown is useful when it improves fidelity

The point is not to send agents less content, but to send them a better representation of the content that matters.

Content fidelity is the degree to which a representation preserves the meaning, structure, provenance, and intent of a page while removing irrelevant noise.

A full HTML page may be complete, but low fidelity for retrieval. The main answer can be surrounded by navigation, related links, promotions, scripts, layout wrappers, and unrelated page modules.

A Markdown representation may be shorter but higher fidelity if it preserves the title, date, headings, body copy, code blocks, tables, links, image context, and canonical URL.

Markdown is useful when it improves fidelity, not merely when it makes the response smaller.

That distinction matters. A Markdown file that strips the publication date from a press release is worse than the HTML page. A Markdown export that drops code fences from a framework guide is worse than the HTML page. A Markdown version of a product page that removes image alt text, availability, regional context, or policy links may be too lossy to trust.

The best agent representation is the least noisy representation that preserves the facts.

So what? Treat Markdown as a content contract. If your Markdown representation drops publication dates, code fences, tables, or image context, you haven’t created an “agent-readable” version. You’ve created a lossy export that will degrade citations and answers.

The current patterns are useful, but incomplete

There are several ways to expose agent-readable content.

They are not mutually exclusive. They solve different problems at different levels of maturity.

Pattern	What it helps with	Limitation
Copy as Markdown	Helps humans move clean context into AI tools.	Manual. Not a discovery or retrieval mechanism.
`.md` endpoints	Gives agents and developers direct Markdown access.	Creates duplicate URLs and citation questions.
`rel="alternate"`	Lets HTML pages advertise Markdown equivalents.	Discovery only. The client still has to fetch the alternate version.
`llms.txt` / `sitemap.md`	Gives agents a curated map of important content.	Can go stale if not generated automatically.
DOM flattening	Reduces HTML noise for crawlers that still consume HTML.	Still requires cache-safe representation switching.
Bot-aware rendering	Serves different representations to verified crawlers.	Risky without strong verification, logging, and parity checks.
Forced format rendering	Overrides the response format for a recognised crawler.	The most aggressive option. A detection mistake can cause ranking regressions.
`Accept: text/markdown`	Serves Markdown from the canonical URL.	Requires clients to send the right header.

The point is not to pick one pattern and treat it as the answer; it is to avoid creating a parallel web.

.md endpoints and Markdown discovery files are useful today. But the better end state is one canonical URL that can serve the right representation to the right consumer.

So what? Pick a primary mechanism (usually Accept negotiation), then add bridges (like .md endpoints and rel="alternate") where adoption is low. Design everything so citations and analytics still point back to the canonical HTML URL.

A safe rollout plan (minimal → scalable)

If you want a path that is shippable in weeks (not quarters) and minimises SEO risk, sequence it.

Phase 1: Start with the highest-leverage page types

Pick 1–2 page types where fidelity matters and templates are consistent (typically: documentation, support articles, changelogs).
Ship .md endpoints for those pages only.
Require basic parity: title, headings, tables, code blocks, and canonical URL preserved.

Phase 2: Make the relationship explicit

Add rel="alternate" type="text/markdown" from the HTML page to the Markdown representation.
Make the canonical relationship explicit by format:
Confirm analytics and citations still resolve to the canonical HTML URL.

Phase 3: Move to negotiated Markdown on the canonical URL

Implement Accept: text/markdown negotiation on the canonical URL.
Add Vary: Accept and ensure caches/CDNs vary correctly (no cache mixing across representations).
Keep .md endpoints as a bridge while client adoption is inconsistent.

Phase 4: Consider crawler-specific representations (only if needed)

Add simplified HTML (DOM flattening) where it improves extraction without changing meaning.
Only consider bot-aware or forced rendering when bot verification is strong and rollback is trivial.

`.md` endpoints are a useful bridge

Dedicated Markdown URLs are easy to understand.

1https://merj.com/blog/the-agent-readable-web-serving-the-right-representation-to-the-right-crawler
2https://merj.com/blog/the-agent-readable-web-serving-the-right-representation-to-the-right-crawler.md

They are also easy to use.

1curl -i https://merj.com/blog/the-agent-readable-web-serving-the-right-representation-to-the-right-crawler.md

On Merj articles, the .md extension is the forced Markdown route. That matters because command-line clients such as curl commonly send Accept: */* by default; the explicit .md URL still returns the Markdown representation.

For documentation, API references, changelogs, support content, and framework guides, this can be powerful. Coding agents, terminal tools, internal support systems, and developer workflows can fetch the Markdown version directly. No special headers are required. No browser rendering is required. No HTML extraction is required.

But .md endpoints create a second URL for the same underlying resource.

A site with 100,000 HTML pages may now have 100,000 additional Markdown URLs. Those URLs can appear in logs, analytics, search systems, and AI citations. If a customer clicks an AI citation and lands on raw Markdown, that may be a poor experience. If search engines crawl both versions, the site has to make the canonical relationship explicit.

For separate Markdown URLs, teams should return the correct text/markdown content type:

1Content-Type: text/markdown; charset=utf-8

They should also consider an HTTP canonical header pointing back to the human-facing URL:

1Link: <https://merj.com/blog/the-agent-readable-web-serving-the-right-representation-to-the-right-crawler>; rel="canonical"

This is not just a technical nicety. It is a risk-control mechanism. If Google, Bing, or another search system discovers the Markdown URL, the canonical signal helps identify the HTML page as the source of truth for indexing, ranking, analytics, and user journeys.

The HTML page should also advertise the alternate representation in the page head. This is useful because many crawlers parse HTML links more reliably than HTTP Link headers:

1<link rel="canonical" href="https://merj.com/blog/the-agent-readable-web-serving-the-right-representation-to-the-right-crawler" />
2<link
3  rel="alternate"
4  type="text/markdown"
5  href="https://merj.com/blog/the-agent-readable-web-serving-the-right-representation-to-the-right-crawler.md"
6/>

Use .md endpoints when direct access matters.

But treat them as alternate representations, not a second website.

So what? If you ship .md endpoints, you need an explicit plan for (1) canonicalisation, (2) what gets crawled and indexed, and (3) preventing “raw Markdown” citations becoming a bad end-user experience.

Content negotiation is the cleaner model

The cleaner model is HTTP content negotiation.

The URL identifies the resource. The Accept header selects the representation.

1GET /blog/the-agent-readable-web-serving-the-right-representation-to-the-right-crawler
2HTTP/1.1
3Host: merj.com
4Accept: text/markdown

The same Merj article can be tested directly with:

1curl -i -H "Accept: text/markdown" https://merj.com/blog/the-agent-readable-web-serving-the-right-representation-to-the-right-crawler

This is not hypothetical. In our own logs, we have seen Claude Code request pages with Accept: text/markdown, text/html, */*. That tells us coding agents are already signalling that Markdown is useful. For developer documentation in particular, a Markdown-facing representation should be treated as part of the documentation interface, because coding agents can explicitly request it when they need high-fidelity source material.

The server can then respond with Markdown:

1HTTP/2 200 OK
2Content-Type: text/markdown; charset=utf-8
3Vary: Accept

This is the model we should want search engines and LLM crawlers to adopt.

It keeps one canonical URL. It avoids citations pointing users to raw .md pages. It keeps analytics and governance centred on one resource. It works with existing HTTP semantics. It allows the same content to have multiple representations without duplicating the URL space.

It also gives clients a clean way to express intent: a browser can ask for HTML, a coding agent can ask for Markdown, and a crawler that still prefers the DOM can ask for HTML. The server can then return the best available representation.

But servers need to implement this correctly.

This substring match is error prone:

1if (request.headers.get("accept")?.includes("text/markdown")) {
2  return markdown();
3}

The Accept header can include multiple media types, explicit weights, wildcards, and exclusions. A server needs to choose the best representation from the media types it actually supports, using the client’s stated preferences.

For example:

1Accept: text/markdown, text/html;q=0.9

In this case, text/markdown has an implicit weight of q=1.0 because no q value is provided, while text/html has an explicit weight of q=0.9. The client is saying: “send Markdown if you can, but HTML is an acceptable fallback.”

The nuance matters. If a request sends Accept: text/markdown, text/html, */* without explicit q values, those options are equally weighted by default. A server should not assume Markdown is preferred just because it appears first.

Use a real Accept parser that respects media ranges, weighting, wildcards, ordering, and exclusions.

A simplified implementation might look like this using the Negotiator library which abstracts the weighting complexity:

1import Negotiator from "negotiator";
2
3const supportedTypes = ["text/markdown", "text/html"];
4
5export async function GET(request: Request) {
6  const negotiator = new Negotiator({
7    headers: Object.fromEntries(request.headers),
8  });
9
10  const mediaType = negotiator.mediaType(supportedTypes) ?? "text/html";
11
12  if (mediaType === "text/markdown") {
13    const markdown = await renderPageAsMarkdown();
14
15    return new Response(markdown, {
16      headers: {
17        "Content-Type": "text/markdown; charset=utf-8",
18        "Vary": "Accept",
19      },
20    });
21  }
22
23  const html = await renderPageAsHtml();
24
25  return new Response(html, {
26    headers: {
27      "Content-Type": "text/html; charset=utf-8",
28      "Vary": "Accept",
29    },
30  });
31}

Vary: Accept matters because the selected representation depends on the request’s Accept header. Without it, caches can mix HTML and Markdown responses for the same URL.

This is why negotiated Markdown is a better long-term model than a parallel Markdown URL set.

The challenge is client adoption. If agents do not send Accept: text/markdown, servers cannot negotiate Markdown. That is why .md endpoints are useful today and content negotiation is the better destination.

So what? For engineers: this is an HTTP + caching project as much as it is a content project. If you don’t implement negotiation and caching correctly, you’ll create hard-to-debug “sometimes I get Markdown, sometimes I get HTML” failures.

DOM flattening is crawler-specific rendering

Not every crawler will ask for Markdown.

Some agents and search systems already have HTML extraction pipelines. For them, the useful representation may not be Markdown. It may be simpler HTML.

This is close to what many companies already do with dynamic rendering and server-side rendering to improve crawlability for Google. SSR makes JavaScript-heavy pages easier for crawlers to render, index, and understand.

DOM flattening is the more extreme version of that idea. The goal is not just to pre-render the same page. The goal is to emit a crawler-focused representation of the article or documentation page.

Instead of returning the full application shell, the server can emit a flatter document:

1<!doctype html>
2<html lang="en">
3<head>
4  <meta charset="utf-8" />
5  <title>Functions</title>
6  <link rel="canonical" href="https://example.com/docs/functions" />
7</head>
8<body>
9  <main>
10    <article>
11      <h1>Functions</h1>
12      <p>Functions allow you to run server-side code...</p>
13
14      <h2>Runtimes</h2>
15      <p>...</p>
16
17      <h2>Examples</h2>
18      <pre><code>...</code></pre>
19    </article>
20  </main>
21</body>
22</html>

The output can drop the header, footer, mega menu, unrelated promos, decorative wrappers, and script payload.

DOM flattening can reduce egress, reduce parser complexity, and preserve familiar HTML semantics. It is useful for crawlers that still consume HTML but do not need the full browser experience.

This is a form of personalisation, but the personalised layer is the representation, not the underlying claims.

The same canonical content can produce multiple outputs: full HTML for browsers, Markdown for agents that ask for it, simplified HTML for crawlers that still consume the DOM, and potentially other structured formats for more specialised consumers.

That distinction matters. Serving a crawler a cleaner representation of the same content is different from serving a crawler different content.

DOM flattening therefore needs the same discipline as Markdown negotiation: semantic equivalence, cache isolation, canonical URLs, logging, and parity checks.

The best implementation is not to scrape your own rendered page after the fact. It is to emit simplified HTML from the same source of truth as the full page.

Forced format rendering is the aggressive version

There is a more aggressive version of representation personalisation: forcing a crawler into a specific format.

For example, a site might recognise that OAI-SearchBot is requesting a page and decide to return Markdown or simplified HTML regardless of what the crawler originally asked for.

That can sometimes work. It may even be useful when a crawler is known to parse a particular representation better than the default page. But it is the highest-risk version of this strategy because the server is no longer responding to explicit client preference. It is overriding the representation based on bot identity.

Risk note: forced rendering only works if your bot verification is excellent. A simple user-agent check is not enough. If a crawler is misidentified or if a traditional search crawler (e.g. Googlebot) is routed into a representation that deviates from expected rendering, you can create indexing issues or ranking regressions. This is the sharp edge of crawler personalisation.

Content negotiation says: the client asked for Markdown, so the server returned Markdown.

Forced rendering says: the server recognised the client and chose the format on its behalf.

That distinction matters. Forced rendering can be a useful experimental path, but it should sit behind strict bot verification, cache isolation, monitoring, and rollback controls. It should also be tested separately from normal search rendering so that experiments for LLM crawlers do not accidentally change the experience for Google, Bing, or other traditional search systems.

In our testing, this has been the most aggressive approach. It can produce a cleaner representation for the target crawler, but the downside risk is much higher than .md endpoints, rel="alternate", or Accept negotiation.

Bot-aware rendering needs a high trust bar

Bot-aware rendering means varying the response based on the verified crawler or agent making the request.

1Human browser        -> full HTML
2Googlebot            -> canonical HTML
3OAI-SearchBot        -> Markdown or simplified HTML
4GPTBot               -> Markdown or simplified HTML
5ClaudeBot            -> Markdown or simplified HTML
6PerplexityBot        -> Markdown or simplified HTML

If you are dealing with specific crawlers, use the official guidance for OpenAI crawlers and user agents, Anthropic crawler controls, and Perplexity bots.

The motivation is reasonable. “Bot” is no longer one audience.

A training crawler may need a clean, stable representation of public documentation. A search retrieval crawler may need freshness, canonical URLs, and citation-friendly content. A user-triggered agent may need the exact page the user requested with JavaScript actions. Traditional search crawlers may need the canonical HTML page to preserve normal search behaviour.

But bot-aware rendering has risk.

If you serve materially different content to crawlers and humans, you can create trust, compliance, and SEO problems. Serving an agent a cleaner representation of the same page is a representation strategy. Serving an agent claims, links, or sections that humans cannot see is a different thing.

The trust boundary is semantic equivalence.

Do not personalise based only on user-agent strings. If you cannot verify the crawler, do not serve crawler-specific content.

Agent-readable content should come from the content model

The hard part is not converting rendered HTML to Markdown.

The hard part is defining what each component means.

A code example is a good place to draw the line between interface and content.

In the browser, the code component may include tabs, a filename, a copy button, syntax highlighting, line numbers, and layout wrappers. Those details help the page experience, but most of them are not part of the content contract.

For an agent-readable representation, the useful output is simpler. Preserve the explanation, filename, language, and code. Drop the UI chrome.

Example output:

1The following middleware configuration matches requests for documentation routes.
2
3File: `middleware.ts`
4Language: TypeScript
5
6Code:
7export const config = {
8  matcher: "/docs/:path*",
9};

That is much more useful than exporting only the code block or dumping the rendered HTML for the component. The agent gets the purpose of the example, where the code belongs, which language it uses, and the code itself.

An image component should preserve the image URL, alt text, caption, and useful surrounding context.

1![Diagram showing the request lifecycle](https://example.com/images/request-lifecycle.png)
2
3Caption: A request moves through routing, middleware, compute, and response caching.

A product component might emit the product name, canonical URL, key attributes, availability, image URL, alt text, and policy links.

A legal page might need exact wording, version history, jurisdiction, and effective date.

Different page types need different content contracts.

Page type	Good agent representation
Documentation	Full body, headings, code blocks, examples, version metadata.
API reference	Endpoints, parameters, responses, auth model, examples.
Product page	Name, description, specs, image URL, `alt`, availability, policies, canonical URL.
Blog post	Title, author, date, body, citations, canonical URL.
Press release	Main announcement, dates, named entities, boilerplate, media contact.
Legal page	Exact text, effective date, jurisdiction, version history.
Support article	Problem, symptoms, affected products, resolution, last updated.

Do not scrape your own DOM and hope the result is good enough.

Emit the representation from the content model. That is how you preserve fidelity across HTML, Markdown, simplified HTML, structured data, XML, JSON, and future agent formats.

This also opens up a useful experiment space. Markdown may be the most readable default for agents, but XML and JSON could work better for some page types.

A single article could expose API-like representations: one Markdown version for reading, one JSON version for structured extraction, and one XML version that behaves more like an article-level feed.

The question is not which format wins everywhere. The question is which representation gives the crawler or agent the highest-fidelity version of the content for the task.

Practical recommendations

For most teams, the path should be layered.

Start with the content types where cleaner representations are most likely to improve retrieval: documentation, API references, changelogs, support articles, and other structured content. Developer documentation should be the priority when coding agents are already requesting Markdown through content negotiation.

A sensible rollout looks like this:

Ship .md endpoints where direct access matters today.
Add rel="alternate" type="text/markdown" from the HTML page when a Markdown version exists.
Build toward Accept: text/markdown on the canonical URL, with Vary: Accept handled correctly.
Use simplified HTML as a fallback for crawlers that still consume the DOM.
Only use bot-aware or forced rendering when crawler identity is verified, caches are isolated, and rollback is straightforward.

The question is not “Did we ship Markdown?”

The question is “Did the agent receive a higher-fidelity representation of the same content?”

How to measure whether this actually worked

Do not measure Markdown by file size or crawlability alone.

Measure whether the alternate representation gives agents a higher-fidelity version of the same resource.

The simplest test is to compare three representations of the same page:

Representation	What to test	Failure mode
Full HTML	Can the agent extract the main content without noise?	Navigation, scripts, related links, or hidden content pollute the answer.
Markdown	Does it preserve the facts, structure, links, dates, code, and image context?	The output is cleaner but lossy.
Simplified HTML	Does it improve extraction for crawlers that still prefer the DOM?	The representation diverges from the canonical page.

1. Test representation fidelity

For a sample of important pages, confirm that the alternate representation preserves:

Title, author, publication date, and canonical URL
Headings, lists, tables, and code blocks
Key internal and external links
Image URLs, alt text, captions, and surrounding context
Product, legal, pricing, regional, or availability constraints where relevant

The goal is not to make the representation shorter, but to remove noise without losing meaning.

2. Test retrieval behaviour

Run the same prompt set against HTML, Markdown, and simplified HTML.

Compare:

Whether answers cite the canonical URL
Whether snippets preserve qualifiers, dates, and constraints
Whether the agent uses the main content rather than navigation or related modules
Whether updates are reflected at the expected cadence
Whether the answer changes materially between representations

If Markdown produces shorter but less accurate answers, it failed.

3. Test operational safety

Log enough to detect representation mistakes:

Verified crawler identity, where available
User-agent
Accept header
Response Content-Type
Canonical URL emitted
Cache key / Vary behaviour
Whether the response came from cache or origin

The main thing to catch is cache mixing: a browser receiving Markdown, or an agent receiving full browser HTML when it asked for Markdown.

Decision rule

A representation is worth shipping only if it:

Preserves the same claims as the canonical page
Improves retrieval quality or extraction reliability
Keeps citations pointed at the canonical URL
Does not introduce cache, indexing, or bot-verification risk

Implementation guardrails

Before rolling this out, make sure five things are true:

The alternate representation is generated from the same content model as the canonical page.
The canonical URL is preserved across HTML, Markdown, and simplified HTML.
Caches vary correctly by representation, especially when using Accept.
Bot-specific rendering only runs for verified crawlers.
Monitoring can detect citation drift, cache mixing, and representation mismatch.

If any of these are missing, the risk is not just a bad Markdown file; it is serving the wrong content to the wrong consumer.

The future is not a duplicate web

The web does not need to become Markdown.

Browsers still need HTML. Search engines still need canonical pages. Humans still need design, interactivity, performance, accessibility, and trust.

But agents need cleaner representations of the same underlying content.

Markdown endpoints are useful today. rel="alternate" helps with discovery. llms.txt gives agents a curated map. DOM flattening is a pragmatic fallback. Bot-aware rendering and forced format rendering are powerful but risky.

The cleanest long-term model is content negotiation: one canonical resource that can return the right representation for the right consumer. Browsers still get HTML, agents can ask for Markdown, and crawlers can receive simplified HTML where it genuinely improves extraction.

That is the agent-readable web: not a duplicate web, but a better set of representations for the same source of truth.

The agent-readable web: serving the right representation to the right crawler

Who this is for (and what to take away)

HTML is the browser representation

Markdown is useful when it improves fidelity

The current patterns are useful, but incomplete

A safe rollout plan (minimal → scalable)

Phase 1: Start with the highest-leverage page types

Phase 2: Make the relationship explicit

Phase 3: Move to negotiated Markdown on the canonical URL

Phase 4: Consider crawler-specific representations (only if needed)

`.md` endpoints are a useful bridge

Content negotiation is the cleaner model

DOM flattening is crawler-specific rendering

Forced format rendering is the aggressive version

Bot-aware rendering needs a high trust bar

Agent-readable content should come from the content model

Practical recommendations

How to measure whether this actually worked

1. Test representation fidelity

2. Test retrieval behaviour

3. Test operational safety

Decision rule

Implementation guardrails

The future is not a duplicate web

Related Articles

Two Years Partnering with Profound to Win in AI Search

How Brave Search discovers new pages: A deep dive into the Web Discovery Project

Your Accept-Language Redirects Could Be Blocking Search Engines and AI Crawlers

The agent-readable web: serving the right representation to the right crawler

Who this is for (and what to take away)

HTML is the browser representation

Markdown is useful when it improves fidelity

The current patterns are useful, but incomplete

A safe rollout plan (minimal → scalable)

Phase 1: Start with the highest-leverage page types

Phase 2: Make the relationship explicit

Phase 3: Move to negotiated Markdown on the canonical URL

Phase 4: Consider crawler-specific representations (only if needed)

.md endpoints are a useful bridge

Content negotiation is the cleaner model

DOM flattening is crawler-specific rendering

Forced format rendering is the aggressive version

Bot-aware rendering needs a high trust bar

Agent-readable content should come from the content model

Practical recommendations

How to measure whether this actually worked

1. Test representation fidelity

2. Test retrieval behaviour

3. Test operational safety

Decision rule

Implementation guardrails

The future is not a duplicate web

Related Articles

Two Years Partnering with Profound to Win in AI Search

How Brave Search discovers new pages: A deep dive into the Web Discovery Project

Your Accept-Language Redirects Could Be Blocking Search Engines and AI Crawlers

`.md` endpoints are a useful bridge