aide
bot access

AI Bot Rules in robots.txt

Does `robots.txt` have at least one stanza explicitly addressing a known AI bot (allow or disallow), showing the owner has made a considered decision?

What it is

Traditional robots.txt rules were written for search crawlers (Googlebot, Bingbot, etc.). AI agents introduced a new class of user-agents whose behaviour (training, grounding, live assistance) is qualitatively different from search. This check asks: has the site owner made any explicit choice about AI bots? The answer being "yes, I blocked all of them" is fine; the answer being "no, I never thought about it" is not.

Why it matters

  • Default User-agent: * rules often allow everything unintentionally.
  • AI vendors publish their bot names precisely so owners can opt in or out.
  • Sites without AI stanzas are effectively consenting to whatever vendors decide to do.

Remediation Prompt

I want to improve my site's agent readiness. Please implement the following fix for AI Bot Rules in robots.txt across our codebase.

Instructions:
Please fix the AI Bot Rules in robots.txt issue on my site so it is agent-ready.

How we test it

Re-use the robots.txt body fetched by check 01. Parse user-agent groups.

Pass Warn Fail Matrix

Condition Status Score
≥1 group targets a recognised AI bot UA with a meaningful Allow: or Disallow: pass 1.0
Site blanket-blocks all bots (User-agent: * / Disallow: /) — counts as explicit pass 1.0
Only generic * rules, no AI bot-specific stanza fail 0.0
No robots.txt at all fail 0.0

Sub Tests

id Weight Pass when
has-ai-bot-stanza 1.0 ≥1 stanza for a known AI bot (list in check 01)

Recognised Ai Bots Curated Keep In Code

aideBot
Amazonbot
anthropic-ai
Applebot-Extended
Bytespider
CCBot
ChatGPT-User
ChatGPT
ClaudeBot
Claude-Web
Claude-User
cohere-ai
DuckAssistBot
Diffbot
FacebookBot
Google-Extended
GPTBot
ImagesiftBot
Meta-ExternalAgent
OAI-SearchBot
PerplexityBot
Timpibot
YouBot

Maintain in src/scanner/checks/data/ai-bots.ts. Source: darkvisitors.com. Update quarterly.

Remediation Prompt

Please update my /robots.txt to include explicit rules for AI bots. I want the file to make a considered statement about what AI training / grounding / assistance is allowed on my content.

Add at least these stanzas (keeping my existing rules for traditional crawlers intact):

    User-agent: GPTBot
    Allow: /           # or Disallow: / if I want to block OpenAI

    User-agent: ClaudeBot
    Allow: /

    User-agent: PerplexityBot
    Allow: /

    User-agent: Google-Extended
    Allow: /

    User-agent: Applebot-Extended
    Allow: /

    User-agent: CCBot
    Disallow: /        # CCBot builds training datasets sold to LLM vendors — opt-out is common

Choose Allow or Disallow per bot according to my policy. Prefer explicit choices over blanket rules.
Also consider adding a Content-Signal directive (see check 07 — content-signals) for finer-grained control.

Implementation Examples

Same file as check 01; this check asks for specific stanzas inside.

Common Mistakes

  • Pretending User-agent: * is enough — AI vendors often do not read *.
  • Conflicting rules (Allow: / + Disallow: /private) ordered incorrectly. Order matters: most specific first.
  • Typos in UA names: GTPBot, Cloudbot, PerplexityAI — none match.

Test Fixtures

  • pass-gptbot-allow.txt
  • pass-blanket-block.txt
  • fail-no-ai-stanza.txt
  • fail-typo-ua.txt — the stanza targets GTPBot (typo)