Robots.txt Guide: Control Crawl Budget and Block AI Scrapers

Googlebot is crawling 10,000 pages on your site. Your analytics shows 95% of organic traffic goes to 500 pages. The other 9,500 are admin panels, staging URLs, parameterized search results, and old redirects that shouldn't be indexed. Every wasted crawl is Google spending budget on noise instead of your actual content — and crawl budget is finite, especially for sites under 100K pageviews/month.

Here's how to use robots.txt to direct that budget where it matters.

The Basics: Syntax That Actually Works

The robots.txt file lives at yourdomain.com/robots.txt. It's a plain text file with a specific syntax:

User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /staging/
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

User-agent: * targets all crawlers. Replace * with a specific bot name (e.g., Googlebot) to apply rules only to that crawler.

Disallow: /path/ tells crawlers not to fetch any URL starting with /path/. The trailing slash matters — Disallow: /admin blocks /admin but not necessarily /administrator. Disallow: /admin/ blocks everything under /admin/.

Allow: / is only necessary when you have a broad Disallow and need to carve out exceptions. If you just Disallow: /admin/, you don't need Allow: / — the root is already allowed by default.

Sitemap: tells crawlers where to find your XML sitemap. Google reads this and uses it to discover and prioritize pages. Always include it.

What to Block

A practical robots.txt for most web applications:

User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /staging/
Disallow: /preview/
Disallow: /wp-admin/
Disallow: /?s=
Disallow: /search?
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /login/
Disallow: /register/
Disallow: /thank-you/
Disallow: /404/

Sitemap: https://yourdomain.com/sitemap.xml

Search and filter URLs (?s=, /search?) generate infinite URL variations from a single crawl path — each query string parameter creates a new URL that Google may crawl and index as a near-duplicate page. Block them unless search result pages are valuable content for your site.

Transactional pages (cart, checkout, login, thank-you) have no organic search value. Blocking them saves crawl budget and keeps your index clean.

Parameter-heavy URLs — if your site generates URLs like /products?color=red&size=M&sort=price, those create thousands of near-duplicate pages. Either block the pattern with Disallow: /products? or use canonical tags on parameterized pages.

Blocking Specific Crawlers

Not all bots follow robots.txt, but well-behaved ones do. Specify rules for individual crawlers with named User-agent blocks:

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Bytespider
Disallow: /

# Standard rules for everyone else
User-agent: *
Disallow: /admin/
Disallow: /api/
Sitemap: https://yourdomain.com/sitemap.xml

GPTBot is OpenAI's web crawler. Blocking it prevents your content from being used to train future GPT models. CCBot is Common Crawl's bot, whose datasets are widely used for AI training. Google-Extended controls whether your content is used to train Google's AI products (Bard/Gemini) separately from Google Search indexing.

Important: Blocking these AI crawlers does not affect Google Search indexing. Googlebot (the search crawler) and Google-Extended (the AI training crawler) are separate bots with separate User-agent strings.

The Crawl-Delay Directive

User-agent: *
Crawl-delay: 10

This asks crawlers to wait 10 seconds between requests. Google ignores Crawl-delay for Googlebot — control Googlebot's crawl rate via Google Search Console instead. Bing and many other crawlers do honor it.

Use Crawl-delay when you want to reduce server load from aggressive third-party crawlers, not for Google. Setting a very high crawl delay for Googlebot won't slow it down — you'll just create confusion.

What Robots.txt Doesn't Do

It doesn't make pages private. A page blocked by robots.txt can still appear in search results if other pages link to it — Google will show the URL but can't describe the content. Use noindex meta tags or password protection for truly private pages.

It doesn't apply retroactively. If a page was already indexed before you blocked it, Google will continue to show it in results until it next crawls it and sees the block. This can take weeks.

It doesn't prevent all bots. Malicious scrapers and badly-behaved bots ignore robots.txt entirely. Robots.txt is a standard for cooperating bots — compliance is voluntary.

Verify Your File Works

After deploying, check two things:

  1. Visit yourdomain.com/robots.txt directly — verify the file is served correctly and hasn't been cached with old content.
  2. Use the Robots.txt Tester to check specific URLs. Paste your robots.txt content and a URL you intend to block — confirm it shows as "blocked" before assuming your rules are working.

Google Search Console also has a built-in robots.txt tester under Coverage settings. Check your top 10 most important pages to confirm they're not accidentally blocked.

Crawl Budget: Why It Matters and When It Doesn't

Crawl budget is finite for sites that Googlebot crawls less frequently — roughly sites under 500K indexed pages and under 100K monthly organic clicks. For small sites (under 500 pages), crawl budget is essentially unlimited — Googlebot will crawl everything. The robots.txt optimization for crawl budget matters most for:

  • E-commerce sites with thousands of product variant URLs (faceted navigation)
  • Sites with large pagination sequences
  • Sites that use URL parameters to track sessions or filter content
  • Legacy sites with thousands of old redirect chains or soft 404s

If your site has fewer than 2,000 pages and doesn't generate parameterized URL variants, crawl budget optimization won't move your rankings. Focus on content quality instead.

The Sitemap Directive

The Sitemap: directive in robots.txt is one of the most overlooked features. It tells all crawlers (not just Google) where your XML sitemap is:

Sitemap: https://yourdomain.com/sitemap.xml
Sitemap: https://yourdomain.com/sitemap-news.xml

You can specify multiple sitemap entries. This is the primary method for notifying crawlers about your sitemap outside of Google Search Console — Bing, DuckDuckGo, and other search engines discover your sitemap through this directive.

Include the full absolute URL with protocol. Some sites accidentally use relative paths (Sitemap: /sitemap.xml), which some crawlers handle gracefully but others don't. Use the complete https:// URL.

robots.txt vs. noindex Meta Tag

These two mechanisms do different things and are often confused:

Disallow in robots.txt: Prevents the crawler from fetching the page. The page may still appear in search results if other pages link to it (Google can infer a title from anchor text alone). The page's outbound links are not followed if the page is never fetched.

<meta name="robots" content="noindex">: Requires the crawler to fetch the page to read the tag. Prevents the page from appearing in search results, but doesn't prevent crawling. Outbound links on the page are followed and contribute to PageRank.

The correct combination for completely hiding a page from search: noindex meta tag (not robots.txt Disallow). Disallow only prevents crawling — it doesn't guarantee the page won't appear in results.

Use the Robots.txt Generator to build the file from a UI, then review before deploying.

Robots.txt Generator

Build a robots.txt file visually — set allow/disallow rules, crawl delay, and sitemap URL.

Try this tool →