April 8, 2026 robots.txt

robots.txt for AI: Which Bots to Allow and Why

Your robots.txt file was designed for search engine crawlers. But in 2026, a growing fleet of AI-specific bots is crawling the web — training models, powering AI search engines, and running autonomous agents. Each has different purposes, different behaviors, and different implications for your site.

This guide covers every major AI crawler, what it does, whether you should allow it, and how to configure your robots.txt for maximum AI visibility without sacrificing control.

The AI Bot Landscape

Bot	Company	Purpose	Recommendation
`GPTBot`	OpenAI	Training data + ChatGPT browsing	Allow
`ChatGPT-User`	OpenAI	Real-time browsing in ChatGPT	Allow
`ClaudeBot`	Anthropic	Training data for Claude models	Allow
`anthropic-ai`	Anthropic	Claude's web browsing feature	Allow
`GoogleOther`	Google	AI/ML training (separate from search)	Allow
`Google-Extended`	Google	Gemini training data	Allow
`PerplexityBot`	Perplexity	AI search engine indexing	Allow
`Applebot-Extended`	Apple	Apple Intelligence training	Allow
`Bytespider`	ByteDance	TikTok/Douyin AI training	Optional
`CCBot`	Common Crawl	Open dataset used by many AI labs	Optional
`cohere-ai`	Cohere	Enterprise AI model training	Optional
`Amazonbot`	Amazon	Alexa AI answers	Allow
`Meta-ExternalAgent`	Meta	AI training for Llama models	Optional

Why Allow AI Bots?

Blocking AI crawlers feels like protecting your content, but it comes with real costs:

Lost AI search visibility — Perplexity, ChatGPT, and Google's AI Overviews cite sources. If they can't crawl you, they can't cite you. Your competitors get the traffic instead.
Agent incompatibility — AI agents that browse on behalf of users check robots.txt. If you block them, users can't interact with your site through their preferred AI assistant.
Model training benefits — When AI models are trained on your content, they learn about your brand, products, and expertise. This means they're more likely to recommend you in conversations.
No real content protection — Blocking crawlers doesn't prevent content copying. Determined scrapers ignore robots.txt. Real protection comes from legal frameworks, not text files.

When to Block AI Bots

There are legitimate reasons to block specific AI crawlers:

Paywalled content — If your business model relies on paid access, you may want to block training-only bots while allowing browsing bots (which respect access controls).
Sensitive content — Legal, medical, or financial content that could be taken out of context when reproduced by AI.
Bandwidth concerns — High-volume crawlers like CCBot and Bytespider can consume significant bandwidth. Rate limiting or blocking may be necessary for smaller sites.
Licensing restrictions — Content licensed from third parties may prohibit AI training use.

Key distinction: Training bots (GPTBot, ClaudeBot, Google-Extended) collect data for model training. Browsing bots (ChatGPT-User, anthropic-ai) access pages in real-time for user queries. You can allow one while blocking the other.

The Recommended robots.txt Configuration

For most websites that want maximum AI visibility:

# Standard search engines
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /private/

# AI Browsing Agents — allow real-time access
User-agent: ChatGPT-User
Allow: /

User-agent: anthropic-ai
Allow: /

# AI Training Crawlers — allow for model training
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: GoogleOther
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: Amazonbot
Allow: /

# Optional — allow or block based on your preference
# User-agent: Bytespider
# Disallow: /

# User-agent: CCBot
# Disallow: /

# User-agent: Meta-ExternalAgent
# Disallow: /

Sitemap: https://yoursite.com/sitemap.xml

For Paywalled or Premium Content Sites

If you have a mix of free and premium content:

# Allow browsing agents (they respect auth)
User-agent: ChatGPT-User
Allow: /

User-agent: anthropic-ai
Allow: /

# Allow training bots on free content only
User-agent: GPTBot
Allow: /blog/
Allow: /docs/
Disallow: /premium/
Disallow: /members/

User-agent: ClaudeBot
Allow: /blog/
Allow: /docs/
Disallow: /premium/
Disallow: /members/

Sitemap: https://yoursite.com/sitemap.xml

Common Mistakes

Blanket blocking with User-agent: * / Disallow: / — This blocks ALL bots, including search engines. Never do this unless you genuinely want zero indexing.
Forgetting the sitemap reference — AI crawlers use your sitemap to discover pages efficiently. Always include Sitemap: https://yoursite.com/sitemap.xml.
Not testing your robots.txt — Syntax errors in robots.txt can inadvertently block or allow the wrong bots. Use Google's robots.txt tester or AgentReady to verify.
Using meta robots instead of robots.txt — <meta name="robots" content="noai"> is not a standard directive. Some AI companies have proposed it, but adherence is inconsistent. Stick to robots.txt for reliable control.

How AgentReady Checks Your robots.txt

When you scan a site with AgentReady, the Crawlability & Robots category checks:

Whether robots.txt exists and is accessible
Whether major AI bots are explicitly allowed or blocked
Whether a sitemap is referenced
Whether any meta robots tags conflict with robots.txt rules

Sites that explicitly allow AI crawlers score higher than those that rely on default behavior, because explicit rules demonstrate intentional AI strategy.

Is your robots.txt AI-ready?

Scan your site to see which AI bots can access your content and which are blocked.

Scan Your Site Free

robots.txt for AI: Which Bots to Allow and Why

The AI Bot Landscape

Why Allow AI Bots?

When to Block AI Bots

The Recommended robots.txt Configuration

For Paywalled or Premium Content Sites

Common Mistakes

How AgentReady Checks Your robots.txt

Is your robots.txt AI-ready?

Related Articles