PublicSoftTools
Tools16 min read·PublicSoftTools Team·May 2026

robots.txt Generator — Build Your robots.txt File Fast

The free robots.txt Generator lets you build a complete robots.txt file in your browser — add user-agent rules, specify disallowed and allowed paths, and include your sitemap URL. Copy the output and deploy it to your site's root in seconds. No signup required.

What Is a robots.txt File?

A robots.txt file is a plain-text document placed at the root of your website that tells web crawlers which pages or directories they may or may not access. It follows the Robots Exclusion Protocol (REP), originally proposed by Martijn Koster in 1994 and formalized as RFC 9309 (IETF 2022). The standard is voluntary — major search engines and reputable bots respect it, but any program can ignore it.

When a crawler visits your site, its first request is typically for https://www.yoursite.com/robots.txt. If the file exists, the crawler reads the rules before fetching any other page. If the file does not exist, the crawler assumes it has permission to access everything.

How to Use the robots.txt Generator

  1. Open the robots.txt Generator
  2. Enter your sitemap URL (e.g. https://www.yoursite.com/sitemap.xml)
  3. Click Add Rule to create a user-agent block. Choose * for all bots or a specific bot name
  4. Enter paths to disallow or allow. Use the quick-path buttons to insert /admin/, /api/, etc.
  5. Click Copy and paste the output into a file named robots.txt at your site root

robots.txt Syntax

A robots.txt file consists of one or more groups. Each group starts with one or more User-agent: lines followed by Disallow: and Allow: directives, then optionally a Crawl-delay: directive.

# Allow all crawlers everywhere (effectively empty rules)
User-agent: *
Disallow:

# Block specific admin paths from all crawlers
User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /private/public-page

# Block AI training bots completely
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

Sitemap: https://www.yoursite.com/sitemap.xml

Known Web Crawlers

User-agentCrawlerCompanyPurpose
GooglebotGoogle web crawlGoogleSearch index
BingbotBing web crawlMicrosoftSearch index
DuckDuckBotDDG web crawlDuckDuckGoSearch index
YandexYandex crawlYandexSearch index
ApplebotApple searchAppleSpotlight, Siri
GPTBotOpenAI crawlerOpenAIAI training data
ClaudeBotAnthropic crawlerAnthropicAI training data
CCBotCommon CrawlCommon CrawlOpen web dataset
Meta-ExternalAgentMeta crawlerMetaAI training data
AhrefsBotAhrefs crawlerAhrefsSEO backlink index

Disallow vs Allow — Priority Rules

When Disallow and Allow rules conflict for the same URL, the more specific rule wins. If two rules are equally specific, Allow wins in Google's implementation.

User-agent: Googlebot
Disallow: /docs/
Allow: /docs/public/

# Result:
# /docs/         → blocked
# /docs/private/ → blocked (covered by /docs/)
# /docs/public/  → allowed (more specific rule wins)

An empty Disallow: line means “allow everything” for that user-agent — it is equivalent to no restrictions at all. This is the correct way to explicitly permit all access.

Wildcards in Path Rules

Disallow: /*.json$        # Block all URLs ending in .json
Disallow: /search?*       # Block all search result pages (URLs with ?)
Disallow: /api/v*         # Block all versioned API endpoints
Allow: /api/v2/public/    # But allow this specific public API path

* matches any sequence of characters within the path. $ anchors the pattern to the end of the URL. These are the only two wildcard characters in the Robots Exclusion Protocol.

Crawl Budget and When It Matters

Crawl budget is the number of URLs Googlebot will crawl on your site within a given time window. For most sites under 10,000 pages, crawl budget is not a concern — Google will crawl all your pages. For large sites (100,000+ pages), crawl budget directly affects how quickly new or updated content appears in search results.

Block low-value URLs to preserve crawl budget for content that matters:

Blocking AI Training Bots

Several AI companies crawl the web to build training datasets. These bots respect robots.txt. To block them without affecting search engine crawling:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

Each named user-agent block applies only to that crawler. Googlebot and Bingbot are unaffected because they are not listed in these blocks. The generator includes all major AI bots in its user-agent selector.

robots.txt vs Meta Robots vs X-Robots-Tag

MethodScopeCan block crawling?Can block indexing?
robots.txtPer-path, site-wideYesNo (URL may still appear in results)
<meta name="robots" content="noindex">Per-page (in HTML)No (page is crawled to read the tag)Yes
X-Robots-Tag: noindex (HTTP header)Per-resource (any file type)NoYes — works for PDFs, images, etc.

Common Mistakes to Avoid

Blocking resources Google needs to render pages

If you block /assets/ or /static/, Googlebot may not be able to load your JavaScript or CSS files. It will see a broken version of your page and may rank it lower. Only block paths you genuinely want invisible to crawlers — verify regularly with Google Search Console's URL Inspection tool.

Thinking robots.txt provides security

robots.txt is a courtesy document, not a security mechanism. Any bot can ignore it. Protect sensitive content with authentication, firewalls, or server-side access control. Disallowing /secret-data/ keeps well-behaved crawlers away, but does nothing to stop a determined attacker or an ill-behaved scraper.

Using robots.txt to de-index pages

Disallowing a URL prevents crawling but does not remove it from Google's index. Google can still show the URL in results if other pages link to it, even if it cannot crawl the page content. To de-index a page, add a noindex meta tag to the HTML, or use the X-Robots-Tag: noindex HTTP response header, or use Google Search Console's URL removal tool for immediate temporary removal.

Wrong file location

robots.txt must be at exactly https://www.example.com/robots.txt. Placing it at any sub-path (/blog/robots.txt, /en/robots.txt) is invalid and will be ignored by all crawlers. For subdomain sites, each subdomain needs its own robots.txt: https://shop.example.com/robots.txt is separate from the main domain.

Build Your robots.txt Now

User-agent rules, path directives, sitemap URL, AI bot blocking — all in your browser. No signup.

Open robots.txt Generator