robots.txt for AI: Which Bots to Allow and Why
Your robots.txt file was designed for search engine crawlers. But in 2026, a growing fleet of AI-specific bots is crawling the web — training models, powering AI search engines, and running autonomous agents. Each has different purposes, different behaviors, and different implications for your site.
This guide covers every major AI crawler, what it does, whether you should allow it, and how to configure your robots.txt for maximum AI visibility without sacrificing control.
The AI Bot Landscape
| Bot | Company | Purpose | Recommendation |
|---|---|---|---|
GPTBot |
OpenAI | Training data + ChatGPT browsing | Allow |
ChatGPT-User |
OpenAI | Real-time browsing in ChatGPT | Allow |
ClaudeBot |
Anthropic | Training data for Claude models | Allow |
anthropic-ai |
Anthropic | Claude's web browsing feature | Allow |
GoogleOther |
AI/ML training (separate from search) | Allow | |
Google-Extended |
Gemini training data | Allow | |
PerplexityBot |
Perplexity | AI search engine indexing | Allow |
Applebot-Extended |
Apple | Apple Intelligence training | Allow |
Bytespider |
ByteDance | TikTok/Douyin AI training | Optional |
CCBot |
Common Crawl | Open dataset used by many AI labs | Optional |
cohere-ai |
Cohere | Enterprise AI model training | Optional |
Amazonbot |
Amazon | Alexa AI answers | Allow |
Meta-ExternalAgent |
Meta | AI training for Llama models | Optional |
Why Allow AI Bots?
Blocking AI crawlers feels like protecting your content, but it comes with real costs:
- Lost AI search visibility — Perplexity, ChatGPT, and Google's AI Overviews cite sources. If they can't crawl you, they can't cite you. Your competitors get the traffic instead.
- Agent incompatibility — AI agents that browse on behalf of users check robots.txt. If you block them, users can't interact with your site through their preferred AI assistant.
- Model training benefits — When AI models are trained on your content, they learn about your brand, products, and expertise. This means they're more likely to recommend you in conversations.
- No real content protection — Blocking crawlers doesn't prevent content copying. Determined scrapers ignore robots.txt. Real protection comes from legal frameworks, not text files.
When to Block AI Bots
There are legitimate reasons to block specific AI crawlers:
- Paywalled content — If your business model relies on paid access, you may want to block training-only bots while allowing browsing bots (which respect access controls).
- Sensitive content — Legal, medical, or financial content that could be taken out of context when reproduced by AI.
- Bandwidth concerns — High-volume crawlers like CCBot and Bytespider can consume significant bandwidth. Rate limiting or blocking may be necessary for smaller sites.
- Licensing restrictions — Content licensed from third parties may prohibit AI training use.
Key distinction: Training bots (GPTBot, ClaudeBot, Google-Extended) collect data for model training. Browsing bots (ChatGPT-User, anthropic-ai) access pages in real-time for user queries. You can allow one while blocking the other.
The Recommended robots.txt Configuration
For most websites that want maximum AI visibility:
# Standard search engines
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /private/
# AI Browsing Agents — allow real-time access
User-agent: ChatGPT-User
Allow: /
User-agent: anthropic-ai
Allow: /
# AI Training Crawlers — allow for model training
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: GoogleOther
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Applebot-Extended
Allow: /
User-agent: Amazonbot
Allow: /
# Optional — allow or block based on your preference
# User-agent: Bytespider
# Disallow: /
# User-agent: CCBot
# Disallow: /
# User-agent: Meta-ExternalAgent
# Disallow: /
Sitemap: https://yoursite.com/sitemap.xmlFor Paywalled or Premium Content Sites
If you have a mix of free and premium content:
# Allow browsing agents (they respect auth)
User-agent: ChatGPT-User
Allow: /
User-agent: anthropic-ai
Allow: /
# Allow training bots on free content only
User-agent: GPTBot
Allow: /blog/
Allow: /docs/
Disallow: /premium/
Disallow: /members/
User-agent: ClaudeBot
Allow: /blog/
Allow: /docs/
Disallow: /premium/
Disallow: /members/
Sitemap: https://yoursite.com/sitemap.xmlCommon Mistakes
- Blanket blocking with
User-agent: */Disallow: /— This blocks ALL bots, including search engines. Never do this unless you genuinely want zero indexing. - Forgetting the sitemap reference — AI crawlers use your sitemap to discover pages efficiently. Always include
Sitemap: https://yoursite.com/sitemap.xml. - Not testing your robots.txt — Syntax errors in robots.txt can inadvertently block or allow the wrong bots. Use Google's robots.txt tester or AgentReady to verify.
- Using meta robots instead of robots.txt —
<meta name="robots" content="noai">is not a standard directive. Some AI companies have proposed it, but adherence is inconsistent. Stick to robots.txt for reliable control.
How AgentReady Checks Your robots.txt
When you scan a site with AgentReady, the Crawlability & Robots category checks:
- Whether
robots.txtexists and is accessible - Whether major AI bots are explicitly allowed or blocked
- Whether a sitemap is referenced
- Whether any meta robots tags conflict with robots.txt rules
Sites that explicitly allow AI crawlers score higher than those that rely on default behavior, because explicit rules demonstrate intentional AI strategy.
Is your robots.txt AI-ready?
Scan your site to see which AI bots can access your content and which are blocked.
Scan Your Site Free