What it is
robots.txt is an informal protocol, formalised as RFC 9309, that tells crawlers what they may access. For agent readiness it plays three roles:
- It's the first file any agent fetches.
- It's where sitemaps are advertised (
Sitemap: …). - It's where new AI-specific directives live — per-bot rules and Content-Signals.
Why it matters
- 78% of top sites have a
robots.txt, but most are written for search engines from a decade ago. - Without explicit AI-bot stanzas, well-behaved LLM crawlers (GPTBot, ClaudeBot, Google-Extended, PerplexityBot, etc.) default to general
User-agent: *rules — which often disallow them by accident. - Without a
Sitemap:line, agents fall back to crawling — slower, noisier, worse for both sides.
Remediation Prompt
I want to improve my site's agent readiness. Please implement the following fix for robots.txt across our codebase. Instructions: Please fix the robots.txt issue on my site so it is agent-ready.
How we test it
| Step | Method | URL | Accept | Notes |
|---|---|---|---|---|
| 1 | GET | https://<host>/robots.txt |
*/* |
Follow up to 5 redirects (same-origin preferred) |
- Max body size: 64 KB (robots files should be tiny; truncation → warn).
- Timeout: 5 s.
Parsed with a lenient parser (loosely following RFC 9309: group by User-agent, collect Allow, Disallow, Crawl-delay, Sitemap, Content-Signal).
Pass Warn Fail Matrix
| Condition | Status | Score |
|---|---|---|
| File exists, points to ≥1 sitemap, addresses AI bots (rule or Content-Signal) | pass | 1.0 |
| File exists, points to ≥1 sitemap, no AI-specific rules | warn | 0.7 |
| File exists, no sitemap, has AI-specific rules | warn | 0.7 |
| File exists, no sitemap, no AI rules | warn | 0.4 |
| 4xx (no robots.txt) | fail | 0.0 |
| 5xx, timeout, or non-text response | fail | 0.0 |
Sub Tests For Ui And Partial Credit
| id | Weight within check | Pass when |
|---|---|---|
exists |
0.4 | Status 200, content-type is text/plain* (or none with plain body), body non-empty |
has-sitemap |
0.3 | At least one Sitemap: line with a valid absolute URL |
addresses-ai-bots |
0.3 | At least one of: (a) a group targeting a known AI bot user-agent; (b) a Content-Signal directive |
Known AI bot user-agents (case-insensitive, substring match): GPTBot, ClaudeBot, Claude-Web, Claude-User, CCBot, PerplexityBot, Google-Extended, Applebot-Extended, anthropic-ai, cohere-ai, ChatGPT-User, ChatGPT, FacebookBot, Meta-ExternalAgent, Bytespider, YouBot, Diffbot, ImagesiftBot, Amazonbot, DuckAssistBot, aideBot.
Remediation Prompt
I'm updating my site's /robots.txt so AI agents can discover content correctly. Please:
1. Create or update /robots.txt at the root of the site.
2. Add at least one Sitemap: directive pointing to my sitemap(s).
3. Add explicit stanzas for common AI bots with conservative, informed choices:
- GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Applebot-Extended, CCBot, ChatGPT-User
4. Add a Content-Signal directive under User-agent: * declaring:
- search=yes
- ai-input=yes (or 'no' if I don't want to be used as context for live inference)
- ai-train=no (or 'yes' if I'm comfortable being training data)
Use this template, keeping my existing rules for traditional crawlers intact:
User-agent: *
Allow: /
Content-Signal: search=yes, ai-input=yes, ai-train=no
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
Sitemap: https://<my-site>/sitemap.xml
Serve the file with Content-Type: text/plain; charset=utf-8. Make sure it is reachable at /robots.txt (not redirected).
Implementation Examples
Next.js App Router (static)
public/robots.txt:
User-agent: *
Allow: /
Content-Signal: search=yes, ai-input=yes, ai-train=no
User-agent: GPTBot
Allow: /
Sitemap: https://example.com/sitemap.xml
Next.js App Router (dynamic)
src/app/robots.ts:
import type { MetadataRoute } from 'next';
export default function robots(): MetadataRoute.Robots {
return {
rules: [
{ userAgent: '*', allow: '/' },
{ userAgent: 'GPTBot', allow: '/' },
{ userAgent: 'ClaudeBot', allow: '/' },
],
sitemap: 'https://example.com/sitemap.xml',
};
}Note: Next's typed
Robotsobject doesn't have a first-classContent-Signalfield yet. Emit raw text via a route handler if you need it:
// src/app/robots.txt/route.ts
export function GET() {
const body = `User-agent: *\nAllow: /\nContent-Signal: search=yes, ai-input=yes, ai-train=no\n\nSitemap: https://example.com/sitemap.xml\n`;
return new Response(body, { headers: { 'content-type': 'text/plain; charset=utf-8' } });
}Express
app.get('/robots.txt', (_, res) => {
res.type('text/plain').send(`User-agent: *\nAllow: /\nSitemap: https://example.com/sitemap.xml\n`);
});Cloudflare Workers
if (url.pathname === '/robots.txt') {
return new Response(txt, { headers: { 'content-type': 'text/plain; charset=utf-8' } });
}nginx (static)
location = /robots.txt {
add_header Content-Type "text/plain; charset=utf-8";
root /var/www/public;
}
Common Mistakes
- Serving
robots.txtwithcontent-type: text/html(some CDN default) - A catch-all
User-agent: * / Disallow: /that accidentally blocks every AI bot - Sitemap URL is relative (
Sitemap: /sitemap.xml) — spec requires absolute - File on a subdomain only (
www.site.com/robots.txt) whilesite.comhas none Content-Signal: AI-Train=No— case matters for values in some parsers; use lowercaseDisallow:without a path — that means "allow everything", but tools often read it as a typo
References
Test Fixtures Required
tests/fixtures/robots-txt/:
pass-full.json— exists, sitemap, GPTBot rule, Content-Signalpass-content-signal-only.json— exists, sitemap, no bot rule but has Content-Signalwarn-no-ai.json— exists, sitemap, no AI signalswarn-no-sitemap.json— exists, AI rules, no sitemapfail-404.json— 404fail-html.json— 200 but content-type is text/html (the origin served the SPA)fail-oversize.json— 200 but 100KB body (truncation)