Sitemap — how to make your site agent-ready

What it is

A sitemap is an XML file that lists the URLs a site wants crawlers to know about. The canonical protocol is sitemaps.org/protocol. It can be a flat <urlset> or a <sitemapindex> pointing to multiple sitemap files. For large sites (>50k URLs or >50 MB), the index form is required.

Why it matters

Agents use sitemaps to build their initial URL plan without crawling the whole site.
They also reveal content that isn't linked from the homepage — changelogs, dated content, deep pages.
For llms.txt authors, the sitemap is the source of truth they build from.

Remediation Prompt

I want to improve my site's agent readiness. Please implement the following fix for Sitemap across our codebase.

Instructions:
Please fix the Sitemap issue on my site so it is agent-ready.

How we test it

GET /robots.txt (already fetched for check 01) → collect every Sitemap: line.
If none found, try well-known locations in order:
- /sitemap.xml
- /sitemap_index.xml
- /sitemap.xml.gz (decompress if we see it)
Fetch the first reachable sitemap with a 2 MB body cap.
If the root is a <sitemapindex>, optionally follow the first child entry (max 1 level, just to confirm it's also valid).
Parse with fast-xml-parser — lenient mode on, preserve attributes.

Pass Warn Fail Matrix

Condition	Status	Score
Sitemap found (robots.txt or well-known), well-formed XML, ≥1 URL	pass	1.0
Sitemap found but ≥ 50% of entries missing required fields (`loc`)	warn	0.5
Sitemap found but returns non-XML (HTML error page)	fail	0.0
Sitemap index points only to broken children	warn	0.4
No sitemap anywhere	fail	0.0

Sub Tests

id	Weight within check	Pass when
`exists`	0.4	Reachable 200 response at any discovered sitemap URL
`well-formed`	0.3	Parses as XML with a `<urlset>` or `<sitemapindex>` root
`valid-entries`	0.3	≥90% of entries have a valid absolute `<loc>`

Remediation Prompt

Please add or fix my site's XML sitemap:

1. Generate a sitemap listing every public URL. If the site has > 50,000 URLs or the file would exceed 50MB, split into multiple sitemaps and use a <sitemapindex>.
2. Host it at /sitemap.xml. Use absolute HTTPS URLs in every <loc>.
3. Add `Sitemap: https://<my-site>/sitemap.xml` to /robots.txt.
4. Include <lastmod> dates in ISO-8601 where known — they help agents prioritise fresh content.
5. Keep the sitemap under 50MB uncompressed; gzip at /sitemap.xml.gz if needed.
6. Ensure the XML declaration (<?xml version="1.0" encoding="UTF-8"?>) is present.

Example urlset entry:
    <url>
      <loc>https://example.com/docs/getting-started</loc>
      <lastmod>2026-04-15</lastmod>
    </url>

Implementation Examples

Next.js App Router dynamic sitemap

// src/app/sitemap.ts
import type { MetadataRoute } from 'next';
import { getAllRoutes } from '@/lib/routes';
 
export default async function sitemap(): Promise<MetadataRoute.Sitemap> {
  const routes = await getAllRoutes();
  return routes.map(r => ({ url: `https://example.com${r.path}`, lastModified: r.updatedAt }));
}

Static (any framework)

Pre-generate public/sitemap.xml at build time.

Cloudflare Workers

Emit XML directly with application/xml content-type.

Common Mistakes

Relative URLs inside <loc> — spec requires absolute.
Serving sitemap as text/html (rewrite rules or SPA fallback catching it).
Including authenticated/gated pages that return 401/403 — these should be excluded.
Sitemap index that points to itself.
Exceeding 50,000 entries in a single file.
Using <lastmod> with only a date while server serves files in UTC (minor, but consistent ISO with TZ is safer).

References

Test Fixtures

pass-urlset.xml
pass-sitemapindex.xml + pass-child-urlset.xml
warn-missing-loc.xml
fail-html-error.json
fail-404.json
fail-not-xml.json (YAML masquerading as XML)

← Previous Check

security.txt (RFC 9116)

Next Check →

Universal Commerce Protocol (UCP)