PublicSoftTools
Tools6 min read

robots.txt Generator — Build Your robots.txt File Fast

The free robots.txt Generator lets you build a complete robots.txt file in your browser — add user-agent rules, specify disallowed and allowed paths, and include your sitemap URL. Copy the output and deploy it to your site's root in seconds. No signup required.

What Is a robots.txt File?

A robots.txt file is a plain-text document placed at the root of your website that tells web crawlers which pages or directories they may or may not access. It follows the Robots Exclusion Protocol, a de facto standard that all major search engines — Google, Bing, Yandex, and others — respect voluntarily.

When a crawler visits your site, its first request is typically for https://www.yoursite.com/robots.txt. If the file exists, the crawler reads the rules before fetching any other page. If the file does not exist, the crawler assumes it has permission to access everything.

How to Use the robots.txt Generator

  1. Open the robots.txt Generator tool.
  2. Enter your sitemap URL in the sitemap field (e.g. https://www.yoursite.com/sitemap.xml).
  3. Click Add Rule to create a user-agent block. Choose * for all bots or a specific bot name.
  4. Enter the paths to disallow or allow. Use the quick-path buttons to insert common paths like /admin/ or /api/.
  5. Click Copy and paste the output into a file named robots.txt at your site's root.

robots.txt Syntax Explained

A robots.txt file is made up of one or more groups. Each group starts with one or more User-agent: lines followed by Disallow: and/or Allow: directives.

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /private/public-page

Sitemap: https://www.yoursite.com/sitemap.xml

User-agent

Specifies which crawler the rules apply to. User-agent: * applies to all crawlers. Named bots override the wildcard for their own crawling:

User-agentCrawlerCompany
GooglebotGoogle web crawlGoogle
BingbotBing web crawlMicrosoft
GPTBotChatGPT trainingOpenAI
ClaudeBotClaude training dataAnthropic
CCBotCommon CrawlCommon Crawl
ApplebotApple searchApple

Disallow and Allow

Disallow: /path/ tells the crawler to skip that path and all its sub-pages. Allow: /path/file.html overrides a broader Disallow for a specific sub-resource. An empty Disallow: line with no path means "allow everything" for that user-agent.

When Disallow and Allow rules conflict, the more specific rule wins. If two rules are equally specific, the Allow rule wins in Google's implementation.

Wildcards

The * character matches any sequence of characters within a path. The $ character anchors the match to the end of the URL. Examples:

Disallow: /*.json$      # Block all URLs ending in .json
Disallow: /search?*     # Block all search result pages
Allow: /api/public/     # Allow this specific API path

Crawl Budget and robots.txt

Crawl budget is the number of URLs Googlebot will crawl on your site within a given time window. For large sites (millions of pages), crawl budget directly affects how quickly new or updated content appears in search results.

Disallowing low-value URLs — paginated URLs with parameters, internal search results, session-ID URLs, staging duplicates — preserves crawl budget for the pages that matter. Common paths worth blocking from crawlers:

Blocking AI Training Bots

Several AI companies crawl the web to build training datasets. These bots respect robots.txt — blocking them is as simple as adding a named user-agent rule:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

This does not affect Googlebot or Bingbot since each user-agent block applies only to the named crawler. The generator's user-agent selector includes all major AI bots so you can add these rules without memorising bot names.

Common Mistakes to Avoid

Blocking pages you want indexed

A common mistake is Disallowing paths that contain important content — especially JavaScript or CSS files that Google needs to render your pages. If Googlebot cannot load your stylesheets, it may see a broken version of your site and rank it lower. Audit your robots.txt with Google Search Console's URL Inspection tool regularly.

Thinking robots.txt provides security

robots.txt is a courtesy document, not a security mechanism. Any bot can ignore it. Disallowing /secret-data/ keeps well-behaved crawlers away but does nothing to stop a scraper or attacker. Protect sensitive content with authentication, firewalls, or server-side access control — not a robots.txt directive.

Wrong file placement

The robots.txt file must be at the root of your domain — https://www.example.com/robots.txt. Placing it at /blog/robots.txt or any other sub-path is invalid and will be ignored by all crawlers.

Using robots.txt to remove pages from search

Disallowing a URL prevents crawling but does not remove it from Google's index. Google can still show the URL in results if other pages link to it, even if it cannot crawl the page content. To de-index a page, add a noindex meta tag to the page HTML or use the X-Robots-Tag: noindex HTTP header.

Common Questions

How often do crawlers re-read robots.txt?

Googlebot typically caches robots.txt for up to 24 hours. If you make a change and need it respected sooner, you can request a recrawl from Google Search Console, though the cache TTL is not guaranteed to be shorter.

Can I have multiple User-agent blocks?

Yes. Each block applies only to the user-agents listed in its header. You can have as many blocks as needed. The wildcard block (User-agent: *) acts as a catch-all for any crawler not covered by a named block.

Does Disallow: / block everything?

Yes — Disallow: / tells the crawler it may not access any path on your site. This is commonly used in staging environments to prevent test content from being indexed. If you accidentally deploy a robots.txt with this rule to production, your entire site will stop being crawled within 24–48 hours.

Build Your robots.txt Now

Use the generator to create a complete, correctly formatted robots.txt file with user-agent rules, path directives, and your sitemap URL — all in your browser.

Open Robots.txt Generator