How to Configure robots.txt for AI Crawlers

Why robots.txt matters for AI crawlers

A few years ago, robots.txt was mostly about search engines. Today it handles a much wider set of bots — AI assistants that answer user questions, training crawlers building foundation models, and commercial scrapers that monetize your content.

The stakes are different for each category. Blocking Googlebot hurts your SEO. Blocking ClaudeBot means your site won't show up when someone asks an AI assistant about your topic. Leaving GPTBot unconfigured means your content may be used to train models without any compensation.

Getting this right requires a few minutes, not hours.

Bot categories

Before writing rules, understand what you're configuring for:

AI assistants (should allow)

These bots crawl on behalf of end users who are asking questions right now. Allowing them increases your reach.

Bot	Company	Purpose
`ClaudeBot`	Anthropic	Powers Claude's web browsing
`ChatGPT-User`	OpenAI	Powers ChatGPT's web search
`PerplexityBot`	Perplexity	Powers Perplexity answers

AI training crawlers (should block or charge)

These bots collect content to train future models. Allowing them freely means your work contributes to a commercial product with no return.

Bot	Company
`GPTBot`	OpenAI
`Google-Extended`	Google
`CCBot`	Common Crawl
`Meta-ExternalAgent`	Meta
`Bytespider`	ByteDance
`anthropic-ai`	Anthropic

Note: CCBot is known to ignore robots.txt directives. Even with Disallow: /, it may still crawl. Consider blocking it at the network level if this is a concern.

Search crawlers (should allow)

Googlebot, Bingbot, DuckDuckBot — these drive organic traffic. Block them only if you have a specific reason (e.g., staging environment).

Three recommended configurations

1. Full open — maximum discovery

Suitable for: open-source projects, personal blogs, documentation sites that want maximum exposure.

User-agent: *
Allow: /

This is the implicit default, but declaring it explicitly signals intent.

2. Balanced — assistants yes, training no

The most common choice for content sites. AI assistants can surface your content in answers; training crawlers are blocked.

# Search engines — full access
User-agent: Googlebot
User-agent: Bingbot
User-agent: DuckDuckBot
Allow: /
Disallow: /api/
Disallow: /dashboard/

# AI assistants — allow for discovery
User-agent: ClaudeBot
User-agent: ChatGPT-User
User-agent: PerplexityBot
Allow: /
Disallow: /api/
Disallow: /dashboard/

# AI training crawlers — block
User-agent: GPTBot
User-agent: Google-Extended
User-agent: CCBot
User-agent: Meta-ExternalAgent
User-agent: Bytespider
User-agent: anthropic-ai
Disallow: /

# Default
User-agent: *
Allow: /
Disallow: /api/
Disallow: /dashboard/

3. Full block — maximum control

Suitable for: private tools, apps that require login, sites where discovery has no value.

User-agent: *
Disallow: /

Note that this will also prevent search indexing. If you want to keep SEO but block AI, use the balanced config and add search bots explicitly.

Using Content-signal

Content-signal is an emerging directive (not yet part of the robots.txt standard) that allows you to declare your content licensing preferences in a machine-readable way. Some AI systems read it as a signal for how to treat your content.

Add it to your robots.txt alongside standard rules:

# Standard rules
User-agent: GPTBot
Disallow: /

# Content licensing signal
Content-signal: ai-train=no, search=yes

Supported values:

Key	Values	Meaning
`ai-train`	`yes` / `no`	Whether content may be used for training
`search`	`yes` / `no`	Whether content may be indexed for search

Tools like paylog.dev/score detect Content-signal and grant a score bonus for ai-train=no — it shows crawlers (and humans auditing your site) that you've made an intentional choice.

Common mistakes

Empty Disallow is not "block all"

User-agent: GPTBot
Disallow:

An empty Disallow: means block nothing — this is equivalent to full access. To block all paths, write:

User-agent: GPTBot
Disallow: /

Forgetting the wildcard block

If you configure specific bots without a User-agent: * fallback, any bot not listed will default to full access. Always include a catch-all rule.

Grouping user-agents incorrectly

Multiple User-agent: lines must appear consecutively before any Allow: or Disallow: lines in the block. Mixing them will cause parsers to split them into separate groups.

# Correct
User-agent: ClaudeBot
User-agent: ChatGPT-User
Allow: /

# Incorrect — ChatGPT-User starts a new block
User-agent: ClaudeBot
Allow: /
User-agent: ChatGPT-User
Allow: /

Verify your configuration

Use paylog.dev/score to audit your site. Paste your domain and get a per-bot breakdown showing whether each crawler is explicitly allowed, blocked, or unconfigured — plus a suggested robots.txt you can copy directly.