AI Sitemap Content Extractor
Pricing
from $4.00 / 1,000 processed pages
AI Sitemap Content Extractor
Transform website sitemaps into clean, AI-ready content with Markdown, semantic chunks, and optional AI summaries.
Pricing
from $4.00 / 1,000 processed pages
Rating
0.0
(0)
Developer
Enos gabriel
Actor stats
0
Bookmarked
1
Total users
0
Monthly active users
a day ago
Last modified
Categories
Share
Transform website sitemaps into clean, AI-ready content with Markdown, semantic chunks, and optional AI summaries. Built for RAG pipelines, LLM applications, and content automation.
What does AI Sitemap Content Extractor do?
AI Sitemap Content Extractor converts any website into a structured, AI-ready dataset. Unlike basic sitemap extractors that only return URLs, this Actor fetches each page, cleans the content by removing navigation, headers, footers, and scripts, converts the main content to clean Markdown, and optionally enriches it with AI-generated summaries and content classification.
Simply provide a website URL or sitemap URL, and the Actor will:
- Discover all pages via sitemap
- Fetch and clean each page's content
- Convert HTML to Markdown
- Generate semantic chunks for LLM usage
- Optionally add AI summaries and classification
Try it with any website: https://example.com
Why use AI Sitemap Content Extractor?
- RAG Pipeline Ready: Output is ready for LangChain, LlamaIndex, or any RAG framework
- Clean Content: Removes noise (nav, footer, scripts) to extract only meaningful content
- Semantic Chunking: Splits content into LLM-friendly chunks with configurable overlap
- AI Enrichment: Optional Groq-powered summarization and classification (free tier available)
- Cost Effective: Uses efficient HTTP-based scraping (Cheerio) - no expensive browser automation
- Smart Filtering: Automatically skips login pages, privacy policy, terms, and other low-value pages
- Quality Scoring: Built-in content quality assessment filters out thin or low-quality pages
How to use AI Sitemap Content Extractor
- Enter Website URL: Provide the main website URL (e.g.,
https://example.com) or a direct sitemap URL - Configure Options (optional):
- Set maximum pages to process (default: 1000)
- Enable/disable AI summarization and classification (enabled by default)
- Adjust chunk size for LLM processing
- Set content quality threshold
- Run the Actor: Click "Start" to begin extraction
- Download Results: Get structured JSON output with clean Markdown, tokens, chunks, and AI enrichment
Input
| Parameter | Description | Default |
|---|---|---|
| Website URL or Sitemap URL | The website to extract content from | Required |
| Maximum Pages | Maximum number of pages to process | 1000 |
| Maximum URL Depth | Maximum path depth to crawl (0 = unlimited) | 0 |
| Concurrency | Number of parallel requests | 20 |
| Chunk Size | Target tokens per chunk for LLM | 1000 |
| Enable AI Summary | Generate 2-4 sentence summary per page | true |
| Enable AI Classification | Classify content type (blog, docs, etc.) | true |
Output
Each page in the dataset includes:
{"url": "https://example.com/blog/post-1","title": "My Blog Post","content_markdown": "# Introduction\n\nThis is the clean content...","tokens": 1234,"word_count": 850,"reading_time_minutes": 4,"chunks": [{"index": 0,"content": "## Introduction\n\nThis is the first chunk...","token_count": 450,"heading": "Introduction"}],"summary": "This article covers the main topic with key insights...","content_type": "blog_post","metadata": {"depth": 2,"fetched_at": "2024-01-20T10:30:00Z","content_quality_score": 85}}
You can download the dataset in JSON, CSV, or Excel format.
Data Schema
| Field | Type | Description |
|---|---|---|
| url | string | Page URL |
| title | string | Page title |
| content_markdown | string | Clean Markdown content |
| tokens | integer | Estimated token count |
| word_count | integer | Word count |
| reading_time_minutes | integer | Estimated reading time |
| chunks | array | Semantic chunks for LLM |
| summary | string | AI-generated summary (optional) |
| content_type | string | AI-classified type (optional) |
| metadata | object | Page metadata |
Pricing
This Actor is free to use on Apify's free tier.
AI features (summarization and classification) are included at no additional cost - the Actor uses a built-in Groq API key.
Tips and Advanced Options
- Increase concurrency (30-50) for faster extraction on fast servers
- Use proxy if targeting sites with anti-bot protection
- Adjust chunk size based on your LLM's context window (smaller = more chunks)
- Quality threshold - raise to skip more low-quality pages
- Custom URL filters - use regex patterns to include/exclude specific paths
Limitations
- JavaScript-rendered content requires Playwright (not supported in this version)
- Very large sites may take longer to process
- Some sites may block scraping - use proxy option if needed