Table of Contents
Key Takeaways
- Robots.txt controls crawling, not indexing — disallowed pages can still be indexed without their content.
- A single misplaced slash in robots.txt can wipe a site from search results overnight.
- Disallowing CSS or JavaScript files prevents Google from rendering pages correctly and hurts rankings.
- Crawl budget is rarely a real issue for sites under 10,000 URLs.
- Always test changes in the Search Console robots.txt tester before deploying.
What Robots.txt Does and Does Not Do
Robots.txt is a plain text file at the root of your domain that tells crawlers which parts of your site they may or may not access. The format dates back to 1994 and is one of the oldest standards still in active use on the web.
The most important thing to understand: robots.txt controls crawling, not indexing. A page disallowed in robots.txt can still appear in search results if Google discovers it through external links. The page will be listed without a description, sometimes with the message No information is available for this page. That is the opposite of what most site owners expect.
What It Is For
Robots.txt is for telling well-behaved crawlers where not to waste time. Admin pages, search result pages, internal endpoints, staging URLs, and infinite parameter combinations are all reasonable candidates for disallow rules. The goal is crawl efficiency, not content suppression.
What It Is Not For
Do not use robots.txt to hide private information. Disallow rules are publicly visible at yourdomain.com/robots.txt. Anyone curious about your hidden directories can read them in seconds. Use authentication or noindex for true protection.
Per Google's robots.txt documentation, the file is treated as a hint that compliant crawlers respect. Malicious bots ignore it entirely.
Syntax Basics and File Placement
The robots.txt syntax is simple but unforgiving. A typo can lock crawlers out or let them in where they should not go.
File Location
The file must live at the root of the domain: yourdomain.com/robots.txt. Subdirectories like yourdomain.com/blog/robots.txt are ignored. Each subdomain needs its own file: blog.yourdomain.com requires its own robots.txt separate from the main domain.
Basic Directives
The two core directives are User-agent (which bot the rule applies to) and Disallow (what they cannot crawl). Allow rules override broader disallows. The asterisk wildcard means all bots.
Order and Specificity
Google applies the most specific rule that matches a URL. A specific Allow can override a broader Disallow. Other crawlers like Bing implement specificity slightly differently, so always test with multiple bots if you have international audiences.
Comments
Lines starting with a hash symbol are comments. Use them generously. A robots.txt file with no comments is impossible to maintain six months later.
For more on how crawling fits into the broader SEO picture, see our Google Search Console guide.
Common Patterns Every Site Needs
Most sites need only a handful of rules. These are the patterns that show up in nearly every well-configured robots.txt file.
Allow Everything (Default)
If you do not need to restrict crawlers at all, an empty Disallow tells all bots they can crawl everything. This is the safest default for small sites with no admin or parameter URLs to worry about.
Block Admin Areas
WordPress installs typically disallow /wp-admin/ but allow /wp-admin/admin-ajax.php so AJAX-driven content can still load. Shopify, Magento, and most other platforms have their own equivalent admin paths.
Block Search Results
Internal search result pages create infinite low-value URLs. Disallow them with a pattern like /search/ or /?s= to prevent crawl waste.
Block Tracking Parameters
If your site generates URLs with tracking parameters that you do not want crawled, disallow them with wildcard patterns. Be careful not to disallow useful parameters by accident.
| Pattern | Purpose |
|---|---|
| /wp-admin/ | WordPress admin area |
| /search/ | Internal site search |
| /cart/ | E-commerce cart pages |
| /checkout/ | E-commerce checkout |
Always include a Sitemap directive at the bottom of your robots.txt pointing to your XML sitemap. See our XML sitemap best practices guide for details.
Robots.txt and Crawl Budget
Crawl budget is the number of pages Googlebot will crawl on your site within a given period. For most sites, it is not a constraint. For very large sites, it can be the difference between fresh and stale rankings.
Who Actually Has a Crawl Budget Problem
Per Google's own statements, crawl budget concerns generally apply to sites with more than 10,000 URLs that update frequently. If you run a 200-page brochure site, crawl budget is irrelevant. If you run a 2-million-product e-commerce site, it matters every day.
How Robots.txt Helps
By disallowing low-value URL patterns (filtered category pages, tracking parameter combinations, paginated archives beyond a certain depth), you free Googlebot to spend its crawl on URLs that actually matter. The savings can be measured in the Crawl Stats report inside Search Console.
How Robots.txt Hurts
Disallowing too aggressively prevents Google from understanding your site structure. Disallowing CSS or JavaScript files breaks rendering, which Google has explicitly said hurts rankings. The right balance is to block low-value URLs while leaving everything Google needs to render and understand pages fully accessible.
For sites hitting performance bottlenecks, our Core Web Vitals guide covers the rendering side of the equation.
See how Sentinel can help your SEO strategy
Try all 4 tools with a 7-day free trial. Cancel any time before day 7 and you won't be charged.
Start Free TrialDangerous Mistakes That Tank Sites
A bad robots.txt file can deindex an entire website overnight. These are the mistakes that have actually happened to large sites we have audited.
Disallow All
A single line disallowing the root path blocks all crawling. This typically happens when a staging environment robots.txt accidentally ships to production. The fix is instant once spotted, but recovery in search results can take weeks.
Blocking CSS or JavaScript
Older guidance said to block resource files for crawl efficiency. Modern Google needs CSS and JavaScript to render pages and assess mobile-friendliness. Blocking them now causes ranking drops that are hard to diagnose because the pages still appear indexed.
Blocking the Sitemap URL
Some sites accidentally disallow /sitemap.xml or the directory containing it. Google can still discover the sitemap through Search Console submission, but other crawlers will not.
Blocking via Disallow Instead of Noindex
Pages you want removed from the index should use a noindex meta tag, not a robots.txt disallow. Disallow prevents Google from crawling the page to see the noindex, which means the page can stay indexed indefinitely.
Conflicting Allow and Disallow
Order matters less than specificity. Test conflicting rules in the robots.txt tester before deploying. Bounce issues caused by aggressive blocking are catastrophic. Sentinel's Bounce Rate Bot helps surface affected pages quickly.
Testing and Validation
Never deploy robots.txt changes without testing them first. The cost of getting it wrong is too high.
Search Console Robots.txt Tester
Inside Search Console, the legacy robots.txt tester lets you paste your file and test specific URLs against it for various Googlebot user agents. It is the single most useful tool for catching mistakes before they go live.
Live Verification
After deploying, fetch yourdomain.com/robots.txt directly in a browser. Confirm the content matches what you expected. Then test a handful of URLs in the URL Inspection tool to confirm they crawl as intended.
Monitoring Over Time
Set up alerting for changes to your robots.txt file. A surprising number of incidents come from a deploy script accidentally overwriting it. Continuous monitoring catches the regression in minutes instead of weeks.
- Test in the Search Console robots.txt tester before every change
- Verify the live file after every deploy
- Monitor for unexpected changes with a third-party uptime tool
- Keep a version history of robots.txt revisions
- Document every Disallow rule with a comment explaining why
For broader audit cadence, see our technical SEO audit checklist.
When to Use Other Tools Instead
Robots.txt is the wrong tool for many jobs SEOs try to use it for. Knowing the alternatives prevents serious mistakes.
Use Noindex for Removing Pages
If you want a page removed from the index, add a noindex meta tag and let Google crawl it to see the tag. Disallowing the page prevents the crawl, so the noindex never takes effect.
Use Authentication for Privacy
Robots.txt is publicly readable. Anything sensitive must be protected by login, IP allowlisting, or HTTP authentication, not by robots.txt rules.
Use Canonical for Duplicates
For consolidating signals across duplicate URLs, canonical tags are the right tool. See our canonical tags guide for the full pattern.
Use 410 or 404 for Permanent Removal
Pages you want gone forever should return a 410 Gone or 404 Not Found status code. This is the strongest signal you can send.
Use Search Console Removal Tool for Urgency
For immediate temporary removal, the Removals tool inside Search Console hides URLs for six months. Combine with noindex for permanent removal.
The pattern is clear: robots.txt is for crawl control, not the catch-all content management tool many treat it as.
Advanced Patterns and Wildcards
For larger sites, robots.txt supports wildcards and pattern matching that enable more precise control.
Wildcard Asterisk
The asterisk matches any sequence of characters. Useful for blocking parameter patterns or any path containing a specific segment. Be careful with broad wildcards — they can match more URLs than you intend.
End-of-URL Dollar Sign
The dollar sign anchors a pattern to the end of the URL. This is the only way to block a specific file extension cleanly.
Crawl-Delay
The Crawl-delay directive tells some crawlers to wait between requests. Google ignores it entirely; Bing respects it. Use Search Console's crawl rate setting if you need to slow Googlebot specifically.
Multiple User-Agent Blocks
You can declare different rules for different bots by stacking User-agent blocks. This is useful when you want to block aggressive AI scrapers while keeping search engines welcome. Per Search Engine Journal coverage, AI bot management has become a major use case for robots.txt in 2026.
| Pattern Element | Meaning |
|---|---|
| * | Any sequence of characters |
| $ | End of URL |
| /path/ | Specific path prefix |
Combine these primitives carefully and always test the resulting rules. For sites tracking the engagement impact of crawl changes, Sentinel's Dwell Time Bot provides the visibility you need.
Frequently Asked Questions
No. Robots.txt prevents crawling, but disallowed pages can still appear in search results if Google discovers them through external links.
At the root of your domain, accessible at yourdomain.com/robots.txt. Each subdomain needs its own separate file.
No. Modern Google needs CSS and JavaScript to render pages and assess mobile-friendliness. Blocking these resources hurts rankings.
Use the Removals tool in Search Console for fast temporary removal, then add a noindex tag for permanent removal.
Yes, you can declare specific user-agent rules for AI bots like GPTBot, ClaudeBot, and others. Compliant scrapers will respect them.
Ready to optimize your search performance?
Join thousands of SEO professionals using Sentinel. Start your 7-day free trial today.
Start Free TrialRelated tools, articles & authoritative sources
Hand-picked internal pages and external references from sources Google itself considers authoritative on this topic.
Related free tools
- On-Page SEO Analyzer Full on-page SEO audit: title, meta, headings, schema, OG tags.
- Keyword Ideas Generator Hundreds of long-tail keyword suggestions from Google autocomplete.
- PageSpeed & Core Web Vitals Google Lighthouse scores: performance, SEO, accessibility, best practices.
- Site Validator (robots, sitemap, SSL, headers) Validate robots.txt, sitemap.xml, SSL certificate, and security headers.
Related premium tools
- Dwell Time Bot Increase time on page, session duration, and engagement signals with realistic multi-source browsing sessions
- Bounce Rate Bot Drop competitor rankings with sustained pogo-stick sessions from multi-source SERP research