What it is
A sitemap is an XML file that lists the URLs a site wants crawlers to know about. The canonical protocol is sitemaps.org/protocol. It can be a flat <urlset> or a <sitemapindex> pointing to multiple sitemap files. For large sites (>50k URLs or >50 MB), the index form is required.
Why it matters
- Agents use sitemaps to build their initial URL plan without crawling the whole site.
- They also reveal content that isn't linked from the homepage — changelogs, dated content, deep pages.
- For
llms.txtauthors, the sitemap is the source of truth they build from.
Remediation Prompt
I want to improve my site's agent readiness. Please implement the following fix for Sitemap across our codebase. Instructions: Please fix the Sitemap issue on my site so it is agent-ready.
How we test it
GET /robots.txt(already fetched for check 01) → collect everySitemap:line.- If none found, try well-known locations in order:
/sitemap.xml/sitemap_index.xml/sitemap.xml.gz(decompress if we see it)
- Fetch the first reachable sitemap with a 2 MB body cap.
- If the root is a
<sitemapindex>, optionally follow the first child entry (max 1 level, just to confirm it's also valid). - Parse with
fast-xml-parser— lenient mode on, preserve attributes.
Pass Warn Fail Matrix
| Condition | Status | Score |
|---|---|---|
| Sitemap found (robots.txt or well-known), well-formed XML, ≥1 URL | pass | 1.0 |
Sitemap found but ≥ 50% of entries missing required fields (loc) |
warn | 0.5 |
| Sitemap found but returns non-XML (HTML error page) | fail | 0.0 |
| Sitemap index points only to broken children | warn | 0.4 |
| No sitemap anywhere | fail | 0.0 |
Sub Tests
| id | Weight within check | Pass when |
|---|---|---|
exists |
0.4 | Reachable 200 response at any discovered sitemap URL |
well-formed |
0.3 | Parses as XML with a <urlset> or <sitemapindex> root |
valid-entries |
0.3 | ≥90% of entries have a valid absolute <loc> |
Remediation Prompt
Please add or fix my site's XML sitemap:
1. Generate a sitemap listing every public URL. If the site has > 50,000 URLs or the file would exceed 50MB, split into multiple sitemaps and use a <sitemapindex>.
2. Host it at /sitemap.xml. Use absolute HTTPS URLs in every <loc>.
3. Add `Sitemap: https://<my-site>/sitemap.xml` to /robots.txt.
4. Include <lastmod> dates in ISO-8601 where known — they help agents prioritise fresh content.
5. Keep the sitemap under 50MB uncompressed; gzip at /sitemap.xml.gz if needed.
6. Ensure the XML declaration (<?xml version="1.0" encoding="UTF-8"?>) is present.
Example urlset entry:
<url>
<loc>https://example.com/docs/getting-started</loc>
<lastmod>2026-04-15</lastmod>
</url>
Implementation Examples
Next.js App Router dynamic sitemap
// src/app/sitemap.ts
import type { MetadataRoute } from 'next';
import { getAllRoutes } from '@/lib/routes';
export default async function sitemap(): Promise<MetadataRoute.Sitemap> {
const routes = await getAllRoutes();
return routes.map(r => ({ url: `https://example.com${r.path}`, lastModified: r.updatedAt }));
}Static (any framework)
Pre-generate public/sitemap.xml at build time.
Cloudflare Workers
Emit XML directly with application/xml content-type.
Common Mistakes
- Relative URLs inside
<loc>— spec requires absolute. - Serving sitemap as
text/html(rewrite rules or SPA fallback catching it). - Including authenticated/gated pages that return 401/403 — these should be excluded.
- Sitemap index that points to itself.
- Exceeding 50,000 entries in a single file.
- Using
<lastmod>with only a date while server serves files in UTC (minor, but consistent ISO with TZ is safer).
Test Fixtures
pass-urlset.xmlpass-sitemapindex.xml+pass-child-urlset.xmlwarn-missing-loc.xmlfail-html-error.jsonfail-404.jsonfail-not-xml.json(YAML masquerading as XML)