LLMs.txt File Generator avatar
LLMs.txt File Generator

Pricing

Pay per usage

Go to Apify Store
LLMs.txt File Generator

LLMs.txt File Generator

Generate an llms.txt file from a website sitemap. Crawls all URLs, extracts titles and meta descriptions, and creates a Markdown-formatted file following the llms.txt specification. Upload then the output of your file directly on your website (Webflow, Wordpress etc.)

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Benoit Eveillard

Benoit Eveillard

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

llms.txt Generator

Generate an llms.txt file from any website sitemap. This Apify Actor crawls all URLs from a sitemap, extracts page titles and meta descriptions, and creates a Markdown-formatted file that helps LLMs understand your website's content.

What is llms.txt?

The llms.txt file is a standardized way to provide LLMs (Large Language Models) with information about your website. It follows a simple Markdown format:

# Website Name
Brief description of the website.
## Pages
- [Page Title](https://example.com/page): Page description
- [Another Page](https://example.com/another): Another description

Features

  • Crawls XML sitemaps (including sitemap index files with nested sitemaps)
  • Extracts page titles from <title> tags
  • Extracts descriptions from <meta name="description"> or <meta property="og:description">
  • Supports glob patterns for URL filtering (include/exclude)
  • Respects robots.txt directives
  • Configurable concurrency and request limits
  • Progress tracking with status updates

Input

FieldTypeRequiredDefaultDescription
sitemapUrlstringYes-URL of the XML sitemap to crawl
maxConcurrencyintegerNo5Maximum concurrent requests (1-50)
maxRequestsPerCrawlintegerNo1000Maximum pages to crawl (0 = unlimited)
respectRobotsTxtbooleanNotrueHonor robots.txt restrictions
includeUrlPatternsarrayNo["**"]Glob patterns for URLs to include
excludeUrlPatternsarrayNo[]Glob patterns for URLs to exclude

Example Input

{
"sitemapUrl": "https://example.com/sitemap.xml",
"maxConcurrency": 5,
"maxRequestsPerCrawl": 500,
"includeUrlPatterns": ["**/blog/**", "**/docs/**"],
"excludeUrlPatterns": ["**/tag/**", "**/author/**"],
"respectRobotsTxt": true
}

URL Pattern Examples

  • ** - Match all URLs
  • **/blog/** - Match URLs containing /blog/
  • **/docs/* - Match direct children of /docs/
  • **/*.html - Match URLs ending with .html

Output

The Actor produces two outputs:

1. llms.txt File (Key-Value Store)

The generated llms.txt file is stored in the default Key-Value Store and can be downloaded directly:

https://api.apify.com/v2/key-value-stores/{storeId}/records/llms.txt

2. Crawl Results (Dataset)

The Dataset contains a single item with crawl statistics:

{
"llmsTxtUrl": "https://api.apify.com/v2/key-value-stores/{storeId}/records/llms.txt",
"statistics": {
"totalDiscovered": 150,
"totalAfterFiltering": 120,
"successCount": 118,
"errorCount": 2,
"robotsSkippedCount": 0,
"limitSkippedCount": 0,
"startedAt": "2024-01-15T10:00:00.000Z",
"finishedAt": "2024-01-15T10:01:30.000Z",
"durationMs": 90000
},
"errors": [
{ "url": "https://example.com/broken", "message": "404 Not Found" }
]
}

Local Development

Prerequisites

  • Node.js 18+
  • npm

Setup

# Install dependencies
npm install
# Create input file
mkdir -p storage/key_value_stores/default
cat > storage/key_value_stores/default/INPUT.json << 'EOF'
{
"sitemapUrl": "https://crawlee.dev/sitemap.xml",
"maxConcurrency": 5,
"maxRequestsPerCrawl": 50
}
EOF
# Run locally
npm run start:dev

Available Scripts

CommandDescription
npm run start:devRun Actor locally with tsx
npm run buildCompile TypeScript
npm run lintRun ESLint
npm run lint:fixFix ESLint issues
npm testRun tests

Project Structure

.actor/
actor.json # Actor configuration
input_schema.json # Input validation schema
output_schema.json # Output schema definition
dataset_schema.json # Dataset structure
key_value_store_schema.json # KV store structure
src/
main.ts # Entry point and orchestration
types.ts # TypeScript interfaces
services/
crawler.ts # CheerioCrawler configuration
sitemap.ts # Sitemap loading utilities
url-filter.ts # Glob-based URL filtering
llms-txt.ts # llms.txt generation
utils/
constants.ts # Default values and config
storage/ # Local storage (dev only)

Deploy to Apify

Using Git

  1. Push your code to a Git repository
  2. Go to Apify Console
  3. Click "Link Git Repository"
  4. Select your repository

Using CLI

# Login to Apify
apify login
# Deploy
apify push

Technologies

Resources

License

Apache-2.0