aide
discoverability

robots.txt

Is there a valid `/robots.txt`, does it point to at least one sitemap, and does it address AI bots (either by `User-Agent` or by `Content-Signal`)?

What it is

robots.txt is an informal protocol, formalised as RFC 9309, that tells crawlers what they may access. For agent readiness it plays three roles:

  1. It's the first file any agent fetches.
  2. It's where sitemaps are advertised (Sitemap: …).
  3. It's where new AI-specific directives live — per-bot rules and Content-Signals.

Why it matters

  • 78% of top sites have a robots.txt, but most are written for search engines from a decade ago.
  • Without explicit AI-bot stanzas, well-behaved LLM crawlers (GPTBot, ClaudeBot, Google-Extended, PerplexityBot, etc.) default to general User-agent: * rules — which often disallow them by accident.
  • Without a Sitemap: line, agents fall back to crawling — slower, noisier, worse for both sides.

Remediation Prompt

I want to improve my site's agent readiness. Please implement the following fix for robots.txt across our codebase.

Instructions:
Please fix the robots.txt issue on my site so it is agent-ready.

How we test it

Step Method URL Accept Notes
1 GET https://<host>/robots.txt */* Follow up to 5 redirects (same-origin preferred)
  • Max body size: 64 KB (robots files should be tiny; truncation → warn).
  • Timeout: 5 s.

Parsed with a lenient parser (loosely following RFC 9309: group by User-agent, collect Allow, Disallow, Crawl-delay, Sitemap, Content-Signal).

Pass Warn Fail Matrix

Condition Status Score
File exists, points to ≥1 sitemap, addresses AI bots (rule or Content-Signal) pass 1.0
File exists, points to ≥1 sitemap, no AI-specific rules warn 0.7
File exists, no sitemap, has AI-specific rules warn 0.7
File exists, no sitemap, no AI rules warn 0.4
4xx (no robots.txt) fail 0.0
5xx, timeout, or non-text response fail 0.0

Sub Tests For Ui And Partial Credit

id Weight within check Pass when
exists 0.4 Status 200, content-type is text/plain* (or none with plain body), body non-empty
has-sitemap 0.3 At least one Sitemap: line with a valid absolute URL
addresses-ai-bots 0.3 At least one of: (a) a group targeting a known AI bot user-agent; (b) a Content-Signal directive

Known AI bot user-agents (case-insensitive, substring match): GPTBot, ClaudeBot, Claude-Web, Claude-User, CCBot, PerplexityBot, Google-Extended, Applebot-Extended, anthropic-ai, cohere-ai, ChatGPT-User, ChatGPT, FacebookBot, Meta-ExternalAgent, Bytespider, YouBot, Diffbot, ImagesiftBot, Amazonbot, DuckAssistBot, aideBot.

Remediation Prompt

I'm updating my site's /robots.txt so AI agents can discover content correctly. Please:

1. Create or update /robots.txt at the root of the site.
2. Add at least one Sitemap: directive pointing to my sitemap(s).
3. Add explicit stanzas for common AI bots with conservative, informed choices:
   - GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Applebot-Extended, CCBot, ChatGPT-User
4. Add a Content-Signal directive under User-agent: * declaring:
   - search=yes
   - ai-input=yes       (or 'no' if I don't want to be used as context for live inference)
   - ai-train=no        (or 'yes' if I'm comfortable being training data)

Use this template, keeping my existing rules for traditional crawlers intact:

    User-agent: *
    Allow: /
    Content-Signal: search=yes, ai-input=yes, ai-train=no

    User-agent: GPTBot
    Allow: /

    User-agent: ClaudeBot
    Allow: /

    User-agent: PerplexityBot
    Allow: /

    Sitemap: https://<my-site>/sitemap.xml

Serve the file with Content-Type: text/plain; charset=utf-8. Make sure it is reachable at /robots.txt (not redirected).

Implementation Examples

Next.js App Router (static)

public/robots.txt:

User-agent: *
Allow: /
Content-Signal: search=yes, ai-input=yes, ai-train=no

User-agent: GPTBot
Allow: /

Sitemap: https://example.com/sitemap.xml

Next.js App Router (dynamic)

src/app/robots.ts:

import type { MetadataRoute } from 'next';
 
export default function robots(): MetadataRoute.Robots {
  return {
    rules: [
      { userAgent: '*', allow: '/' },
      { userAgent: 'GPTBot', allow: '/' },
      { userAgent: 'ClaudeBot', allow: '/' },
    ],
    sitemap: 'https://example.com/sitemap.xml',
  };
}

Note: Next's typed Robots object doesn't have a first-class Content-Signal field yet. Emit raw text via a route handler if you need it:

// src/app/robots.txt/route.ts
export function GET() {
  const body = `User-agent: *\nAllow: /\nContent-Signal: search=yes, ai-input=yes, ai-train=no\n\nSitemap: https://example.com/sitemap.xml\n`;
  return new Response(body, { headers: { 'content-type': 'text/plain; charset=utf-8' } });
}

Express

app.get('/robots.txt', (_, res) => {
  res.type('text/plain').send(`User-agent: *\nAllow: /\nSitemap: https://example.com/sitemap.xml\n`);
});

Cloudflare Workers

if (url.pathname === '/robots.txt') {
  return new Response(txt, { headers: { 'content-type': 'text/plain; charset=utf-8' } });
}

nginx (static)

location = /robots.txt {
  add_header Content-Type "text/plain; charset=utf-8";
  root /var/www/public;
}

Common Mistakes

  • Serving robots.txt with content-type: text/html (some CDN default)
  • A catch-all User-agent: * / Disallow: / that accidentally blocks every AI bot
  • Sitemap URL is relative (Sitemap: /sitemap.xml) — spec requires absolute
  • File on a subdomain only (www.site.com/robots.txt) while site.com has none
  • Content-Signal: AI-Train=No — case matters for values in some parsers; use lowercase
  • Disallow: without a path — that means "allow everything", but tools often read it as a typo

Test Fixtures Required

tests/fixtures/robots-txt/:

  • pass-full.json — exists, sitemap, GPTBot rule, Content-Signal
  • pass-content-signal-only.json — exists, sitemap, no bot rule but has Content-Signal
  • warn-no-ai.json — exists, sitemap, no AI signals
  • warn-no-sitemap.json — exists, AI rules, no sitemap
  • fail-404.json — 404
  • fail-html.json — 200 but content-type is text/html (the origin served the SPA)
  • fail-oversize.json — 200 but 100KB body (truncation)