aide
discoverability

Sitemap

Does the site publish a valid sitemap (directly at `/sitemap.xml`, via `robots.txt`, or as a sitemap index), and is it well-formed?

What it is

A sitemap is an XML file that lists the URLs a site wants crawlers to know about. The canonical protocol is sitemaps.org/protocol. It can be a flat <urlset> or a <sitemapindex> pointing to multiple sitemap files. For large sites (>50k URLs or >50 MB), the index form is required.

Why it matters

  • Agents use sitemaps to build their initial URL plan without crawling the whole site.
  • They also reveal content that isn't linked from the homepage — changelogs, dated content, deep pages.
  • For llms.txt authors, the sitemap is the source of truth they build from.

Remediation Prompt

I want to improve my site's agent readiness. Please implement the following fix for Sitemap across our codebase.

Instructions:
Please fix the Sitemap issue on my site so it is agent-ready.

How we test it

  1. GET /robots.txt (already fetched for check 01) → collect every Sitemap: line.
  2. If none found, try well-known locations in order:
    • /sitemap.xml
    • /sitemap_index.xml
    • /sitemap.xml.gz (decompress if we see it)
  3. Fetch the first reachable sitemap with a 2 MB body cap.
  4. If the root is a <sitemapindex>, optionally follow the first child entry (max 1 level, just to confirm it's also valid).
  5. Parse with fast-xml-parser — lenient mode on, preserve attributes.

Pass Warn Fail Matrix

Condition Status Score
Sitemap found (robots.txt or well-known), well-formed XML, ≥1 URL pass 1.0
Sitemap found but ≥ 50% of entries missing required fields (loc) warn 0.5
Sitemap found but returns non-XML (HTML error page) fail 0.0
Sitemap index points only to broken children warn 0.4
No sitemap anywhere fail 0.0

Sub Tests

id Weight within check Pass when
exists 0.4 Reachable 200 response at any discovered sitemap URL
well-formed 0.3 Parses as XML with a <urlset> or <sitemapindex> root
valid-entries 0.3 ≥90% of entries have a valid absolute <loc>

Remediation Prompt

Please add or fix my site's XML sitemap:

1. Generate a sitemap listing every public URL. If the site has > 50,000 URLs or the file would exceed 50MB, split into multiple sitemaps and use a <sitemapindex>.
2. Host it at /sitemap.xml. Use absolute HTTPS URLs in every <loc>.
3. Add `Sitemap: https://<my-site>/sitemap.xml` to /robots.txt.
4. Include <lastmod> dates in ISO-8601 where known — they help agents prioritise fresh content.
5. Keep the sitemap under 50MB uncompressed; gzip at /sitemap.xml.gz if needed.
6. Ensure the XML declaration (<?xml version="1.0" encoding="UTF-8"?>) is present.

Example urlset entry:
    <url>
      <loc>https://example.com/docs/getting-started</loc>
      <lastmod>2026-04-15</lastmod>
    </url>

Implementation Examples

Next.js App Router dynamic sitemap

// src/app/sitemap.ts
import type { MetadataRoute } from 'next';
import { getAllRoutes } from '@/lib/routes';
 
export default async function sitemap(): Promise<MetadataRoute.Sitemap> {
  const routes = await getAllRoutes();
  return routes.map(r => ({ url: `https://example.com${r.path}`, lastModified: r.updatedAt }));
}

Static (any framework)

Pre-generate public/sitemap.xml at build time.

Cloudflare Workers

Emit XML directly with application/xml content-type.

Common Mistakes

  • Relative URLs inside <loc> — spec requires absolute.
  • Serving sitemap as text/html (rewrite rules or SPA fallback catching it).
  • Including authenticated/gated pages that return 401/403 — these should be excluded.
  • Sitemap index that points to itself.
  • Exceeding 50,000 entries in a single file.
  • Using <lastmod> with only a date while server serves files in UTC (minor, but consistent ISO with TZ is safer).

Test Fixtures

  • pass-urlset.xml
  • pass-sitemapindex.xml + pass-child-urlset.xml
  • warn-missing-loc.xml
  • fail-html-error.json
  • fail-404.json
  • fail-not-xml.json (YAML masquerading as XML)