April 8, 2026

robots.txt for AI: Which Bots to Allow and Why

Your robots.txt file was designed for search engine crawlers. But in 2026, a growing fleet of AI-specific bots is crawling the web — training models, powering AI search engines, and running autonomous agents. Each has different purposes, different behaviors, and different implications for your site.

This guide covers every major AI crawler, what it does, whether you should allow it, and how to configure your robots.txt for maximum AI visibility without sacrificing control.

The AI Bot Landscape

Bot Company Purpose Recommendation
GPTBot OpenAI Training data + ChatGPT browsing Allow
ChatGPT-User OpenAI Real-time browsing in ChatGPT Allow
ClaudeBot Anthropic Training data for Claude models Allow
anthropic-ai Anthropic Claude's web browsing feature Allow
GoogleOther Google AI/ML training (separate from search) Allow
Google-Extended Google Gemini training data Allow
PerplexityBot Perplexity AI search engine indexing Allow
Applebot-Extended Apple Apple Intelligence training Allow
Bytespider ByteDance TikTok/Douyin AI training Optional
CCBot Common Crawl Open dataset used by many AI labs Optional
cohere-ai Cohere Enterprise AI model training Optional
Amazonbot Amazon Alexa AI answers Allow
Meta-ExternalAgent Meta AI training for Llama models Optional

Why Allow AI Bots?

Blocking AI crawlers feels like protecting your content, but it comes with real costs:

When to Block AI Bots

There are legitimate reasons to block specific AI crawlers:

Key distinction: Training bots (GPTBot, ClaudeBot, Google-Extended) collect data for model training. Browsing bots (ChatGPT-User, anthropic-ai) access pages in real-time for user queries. You can allow one while blocking the other.

The Recommended robots.txt Configuration

For most websites that want maximum AI visibility:

# Standard search engines User-agent: * Allow: / Disallow: /admin/ Disallow: /private/ # AI Browsing Agents — allow real-time access User-agent: ChatGPT-User Allow: / User-agent: anthropic-ai Allow: / # AI Training Crawlers — allow for model training User-agent: GPTBot Allow: / User-agent: ClaudeBot Allow: / User-agent: GoogleOther Allow: / User-agent: Google-Extended Allow: / User-agent: PerplexityBot Allow: / User-agent: Applebot-Extended Allow: / User-agent: Amazonbot Allow: / # Optional — allow or block based on your preference # User-agent: Bytespider # Disallow: / # User-agent: CCBot # Disallow: / # User-agent: Meta-ExternalAgent # Disallow: / Sitemap: https://yoursite.com/sitemap.xml

For Paywalled or Premium Content Sites

If you have a mix of free and premium content:

# Allow browsing agents (they respect auth) User-agent: ChatGPT-User Allow: / User-agent: anthropic-ai Allow: / # Allow training bots on free content only User-agent: GPTBot Allow: /blog/ Allow: /docs/ Disallow: /premium/ Disallow: /members/ User-agent: ClaudeBot Allow: /blog/ Allow: /docs/ Disallow: /premium/ Disallow: /members/ Sitemap: https://yoursite.com/sitemap.xml

Common Mistakes

How AgentReady Checks Your robots.txt

When you scan a site with AgentReady, the Crawlability & Robots category checks:

  1. Whether robots.txt exists and is accessible
  2. Whether major AI bots are explicitly allowed or blocked
  3. Whether a sitemap is referenced
  4. Whether any meta robots tags conflict with robots.txt rules

Sites that explicitly allow AI crawlers score higher than those that rely on default behavior, because explicit rules demonstrate intentional AI strategy.

Is your robots.txt AI-ready?

Scan your site to see which AI bots can access your content and which are blocked.

Scan Your Site Free