AI Bot Rules in robots.txt — how to make your site agent-ready

What it is

Traditional robots.txt rules were written for search crawlers (Googlebot, Bingbot, etc.). AI agents introduced a new class of user-agents whose behaviour (training, grounding, live assistance) is qualitatively different from search. This check asks: has the site owner made any explicit choice about AI bots? The answer being "yes, I blocked all of them" is fine; the answer being "no, I never thought about it" is not.

Why it matters

Default User-agent: * rules often allow everything unintentionally.
AI vendors publish their bot names precisely so owners can opt in or out.
Sites without AI stanzas are effectively consenting to whatever vendors decide to do.

Remediation Prompt

I want to improve my site's agent readiness. Please implement the following fix for AI Bot Rules in robots.txt across our codebase.

Instructions:
Please fix the AI Bot Rules in robots.txt issue on my site so it is agent-ready.

How we test it

Re-use the robots.txt body fetched by check 01. Parse user-agent groups.

Pass Warn Fail Matrix

Condition	Status	Score
≥1 group targets a recognised AI bot UA with a meaningful `Allow:` or `Disallow:`	pass	1.0
Site blanket-blocks all bots (`User-agent: *` / `Disallow: /`) — counts as explicit	pass	1.0
Only generic `*` rules, no AI bot-specific stanza	fail	0.0
No robots.txt at all	fail	0.0

Sub Tests

id	Weight	Pass when
`has-ai-bot-stanza`	1.0	≥1 stanza for a known AI bot (list in check 01)

Recognised Ai Bots Curated Keep In Code

aideBot
Amazonbot
anthropic-ai
Applebot-Extended
Bytespider
CCBot
ChatGPT-User
ChatGPT
ClaudeBot
Claude-Web
Claude-User
cohere-ai
DuckAssistBot
Diffbot
FacebookBot
Google-Extended
GPTBot
ImagesiftBot
Meta-ExternalAgent
OAI-SearchBot
PerplexityBot
Timpibot
YouBot

Maintain in src/scanner/checks/data/ai-bots.ts. Source: darkvisitors.com. Update quarterly.

Remediation Prompt

Please update my /robots.txt to include explicit rules for AI bots. I want the file to make a considered statement about what AI training / grounding / assistance is allowed on my content.

Add at least these stanzas (keeping my existing rules for traditional crawlers intact):

    User-agent: GPTBot
    Allow: /           # or Disallow: / if I want to block OpenAI

    User-agent: ClaudeBot
    Allow: /

    User-agent: PerplexityBot
    Allow: /

    User-agent: Google-Extended
    Allow: /

    User-agent: Applebot-Extended
    Allow: /

    User-agent: CCBot
    Disallow: /        # CCBot builds training datasets sold to LLM vendors — opt-out is common

Choose Allow or Disallow per bot according to my policy. Prefer explicit choices over blanket rules.
Also consider adding a Content-Signal directive (see check 07 — content-signals) for finer-grained control.

Implementation Examples

Same file as check 01; this check asks for specific stanzas inside.

Common Mistakes

Pretending User-agent: * is enough — AI vendors often do not read *.
Conflicting rules (Allow: / + Disallow: /private) ordered incorrectly. Order matters: most specific first.
Typos in UA names: GTPBot, Cloudbot, PerplexityAI — none match.

References

Test Fixtures

pass-gptbot-allow.txt
pass-blanket-block.txt
fail-no-ai-stanza.txt
fail-typo-ua.txt — the stanza targets GTPBot (typo)

← Previous Check

Agent Skills Index

Next Check →

API Catalog (RFC 9727)