Robots.txt Complete Guide 2026: SEO Best Practices

What Robots.txt Does and Does Not Do
Syntax Basics and File Placement
Common Patterns Every Site Needs
Robots.txt and Crawl Budget
Dangerous Mistakes That Tank Sites
Testing and Validation
When to Use Other Tools Instead
Advanced Patterns and Wildcards

Key Takeaways

Robots.txt controls crawling, not indexing — disallowed pages can still be indexed without their content.
A single misplaced slash in robots.txt can wipe a site from search results overnight.
Disallowing CSS or JavaScript files prevents Google from rendering pages correctly and hurts rankings.
Crawl budget is rarely a real issue for sites under 10,000 URLs.
Always test changes in the Search Console robots.txt tester before deploying.

What Robots.txt Does and Does Not Do

Robots.txt is a plain text file at the root of your domain that tells crawlers which parts of your site they may or may not access. The format dates back to 1994 and is one of the oldest standards still in active use on the web.

The most important thing to understand: robots.txt controls crawling, not indexing. A page disallowed in robots.txt can still appear in search results if Google discovers it through external links. The page will be listed without a description, sometimes with the message No information is available for this page. That is the opposite of what most site owners expect.

What It Is For

Robots.txt is for telling well-behaved crawlers where not to waste time. Admin pages, search result pages, internal endpoints, staging URLs, and infinite parameter combinations are all reasonable candidates for disallow rules. The goal is crawl efficiency, not content suppression.

What It Is Not For

Do not use robots.txt to hide private information. Disallow rules are publicly visible at yourdomain.com/robots.txt. Anyone curious about your hidden directories can read them in seconds. Use authentication or noindex for true protection.

Per Google's robots.txt documentation, the file is treated as a hint that compliant crawlers respect. Malicious bots ignore it entirely.

Syntax Basics and File Placement

The robots.txt syntax is simple but unforgiving. A typo can lock crawlers out or let them in where they should not go.

File Location

The file must live at the root of the domain: yourdomain.com/robots.txt. Subdirectories like yourdomain.com/blog/robots.txt are ignored. Each subdomain needs its own file: blog.yourdomain.com requires its own robots.txt separate from the main domain.

Basic Directives

The two core directives are User-agent (which bot the rule applies to) and Disallow (what they cannot crawl). Allow rules override broader disallows. The asterisk wildcard means all bots.

Order and Specificity

Google applies the most specific rule that matches a URL. A specific Allow can override a broader Disallow. Other crawlers like Bing implement specificity slightly differently, so always test with multiple bots if you have international audiences.

Comments

Lines starting with a hash symbol are comments. Use them generously. A robots.txt file with no comments is impossible to maintain six months later.

For more on how crawling fits into the broader SEO picture, see our Google Search Console guide.

Common Patterns Every Site Needs

Most sites need only a handful of rules. These are the patterns that show up in nearly every well-configured robots.txt file.

Allow Everything (Default)

If you do not need to restrict crawlers at all, an empty Disallow tells all bots they can crawl everything. This is the safest default for small sites with no admin or parameter URLs to worry about.

Block Admin Areas

WordPress installs typically disallow /wp-admin/ but allow /wp-admin/admin-ajax.php so AJAX-driven content can still load. Shopify, Magento, and most other platforms have their own equivalent admin paths.

Block Search Results

Internal search result pages create infinite low-value URLs. Disallow them with a pattern like /search/ or /?s= to prevent crawl waste.

Block Tracking Parameters

If your site generates URLs with tracking parameters that you do not want crawled, disallow them with wildcard patterns. Be careful not to disallow useful parameters by accident.

Pattern	Purpose
/wp-admin/	WordPress admin area
/search/	Internal site search
/cart/	E-commerce cart pages
/checkout/	E-commerce checkout

Always include a Sitemap directive at the bottom of your robots.txt pointing to your XML sitemap. See our XML sitemap best practices guide for details.

Robots.txt and Crawl Budget

Crawl budget is the number of pages Googlebot will crawl on your site within a given period. For most sites, it is not a constraint. For very large sites, it can be the difference between fresh and stale rankings.

Who Actually Has a Crawl Budget Problem

Per Google's own statements, crawl budget concerns generally apply to sites with more than 10,000 URLs that update frequently. If you run a 200-page brochure site, crawl budget is irrelevant. If you run a 2-million-product e-commerce site, it matters every day.

How Robots.txt Helps

By disallowing low-value URL patterns (filtered category pages, tracking parameter combinations, paginated archives beyond a certain depth), you free Googlebot to spend its crawl on URLs that actually matter. The savings can be measured in the Crawl Stats report inside Search Console.

How Robots.txt Hurts

Disallowing too aggressively prevents Google from understanding your site structure. Disallowing CSS or JavaScript files breaks rendering, which Google has explicitly said hurts rankings. The right balance is to block low-value URLs while leaving everything Google needs to render and understand pages fully accessible.

For sites hitting performance bottlenecks, our Core Web Vitals guide covers the rendering side of the equation.

See how Sentinel can help your SEO strategy

Try all 4 tools with a 7-day free trial. Cancel any time before day 7 and you won't be charged.

Start Free Trial

Dangerous Mistakes That Tank Sites

A bad robots.txt file can deindex an entire website overnight. These are the mistakes that have actually happened to large sites we have audited.

Disallow All

A single line disallowing the root path blocks all crawling. This typically happens when a staging environment robots.txt accidentally ships to production. The fix is instant once spotted, but recovery in search results can take weeks.

Blocking CSS or JavaScript

Older guidance said to block resource files for crawl efficiency. Modern Google needs CSS and JavaScript to render pages and assess mobile-friendliness. Blocking them now causes ranking drops that are hard to diagnose because the pages still appear indexed.

Blocking the Sitemap URL

Some sites accidentally disallow /sitemap.xml or the directory containing it. Google can still discover the sitemap through Search Console submission, but other crawlers will not.

Blocking via Disallow Instead of Noindex

Pages you want removed from the index should use a noindex meta tag, not a robots.txt disallow. Disallow prevents Google from crawling the page to see the noindex, which means the page can stay indexed indefinitely.

Conflicting Allow and Disallow

Order matters less than specificity. Test conflicting rules in the robots.txt tester before deploying. Bounce issues caused by aggressive blocking are catastrophic. Sentinel's Bounce Rate Bot helps surface affected pages quickly.

Testing and Validation

Never deploy robots.txt changes without testing them first. The cost of getting it wrong is too high.

Search Console Robots.txt Tester

Inside Search Console, the legacy robots.txt tester lets you paste your file and test specific URLs against it for various Googlebot user agents. It is the single most useful tool for catching mistakes before they go live.

Live Verification

After deploying, fetch yourdomain.com/robots.txt directly in a browser. Confirm the content matches what you expected. Then test a handful of URLs in the URL Inspection tool to confirm they crawl as intended.

Monitoring Over Time

Set up alerting for changes to your robots.txt file. A surprising number of incidents come from a deploy script accidentally overwriting it. Continuous monitoring catches the regression in minutes instead of weeks.

Test in the Search Console robots.txt tester before every change
Verify the live file after every deploy
Monitor for unexpected changes with a third-party uptime tool
Keep a version history of robots.txt revisions
Document every Disallow rule with a comment explaining why

For broader audit cadence, see our technical SEO audit checklist.

When to Use Other Tools Instead

Robots.txt is the wrong tool for many jobs SEOs try to use it for. Knowing the alternatives prevents serious mistakes.

Use Noindex for Removing Pages

If you want a page removed from the index, add a noindex meta tag and let Google crawl it to see the tag. Disallowing the page prevents the crawl, so the noindex never takes effect.

Use Authentication for Privacy

Robots.txt is publicly readable. Anything sensitive must be protected by login, IP allowlisting, or HTTP authentication, not by robots.txt rules.

Use Canonical for Duplicates

For consolidating signals across duplicate URLs, canonical tags are the right tool. See our canonical tags guide for the full pattern.

Use 410 or 404 for Permanent Removal

Pages you want gone forever should return a 410 Gone or 404 Not Found status code. This is the strongest signal you can send.

Use Search Console Removal Tool for Urgency

For immediate temporary removal, the Removals tool inside Search Console hides URLs for six months. Combine with noindex for permanent removal.

The pattern is clear: robots.txt is for crawl control, not the catch-all content management tool many treat it as.

Advanced Patterns and Wildcards

For larger sites, robots.txt supports wildcards and pattern matching that enable more precise control.

Wildcard Asterisk

The asterisk matches any sequence of characters. Useful for blocking parameter patterns or any path containing a specific segment. Be careful with broad wildcards — they can match more URLs than you intend.

End-of-URL Dollar Sign

The dollar sign anchors a pattern to the end of the URL. This is the only way to block a specific file extension cleanly.

Crawl-Delay

The Crawl-delay directive tells some crawlers to wait between requests. Google ignores it entirely; Bing respects it. Use Search Console's crawl rate setting if you need to slow Googlebot specifically.

Multiple User-Agent Blocks

You can declare different rules for different bots by stacking User-agent blocks. This is useful when you want to block aggressive AI scrapers while keeping search engines welcome. Per Search Engine Journal coverage, AI bot management has become a major use case for robots.txt in 2026.

Pattern Element	Meaning
*	Any sequence of characters
$	End of URL
/path/	Specific path prefix

Combine these primitives carefully and always test the resulting rules. For sites tracking the engagement impact of crawl changes, Sentinel's Dwell Time Bot provides the visibility you need.

Frequently Asked Questions

Does robots.txt prevent pages from being indexed?

No. Robots.txt prevents crawling, but disallowed pages can still appear in search results if Google discovers them through external links.

Where should the robots.txt file live?

At the root of your domain, accessible at yourdomain.com/robots.txt. Each subdomain needs its own separate file.

Should I block CSS and JavaScript files?

No. Modern Google needs CSS and JavaScript to render pages and assess mobile-friendliness. Blocking these resources hurts rankings.

How do I temporarily remove a page from Google?

Use the Removals tool in Search Console for fast temporary removal, then add a noindex tag for permanent removal.

Can I use robots.txt to block AI scrapers?

Yes, you can declare specific user-agent rules for AI bots like GPTBot, ClaudeBot, and others. Compliant scrapers will respect them.

Ready to optimize your search performance?

Join thousands of SEO professionals using Sentinel. Start your 7-day free trial today.