Sitemap Content Crawler
Pricing
Pay per usage
Go to Apify Store
Sitemap Content Crawler
Pricing
Pay per usage
Rating
0.0
(0)
Developer

Donny Nguyen
Maintained by Community
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Parse any sitemap.xml file and crawl every URL found to extract full page content. Returns structured data including title, text content, meta descriptions, headings hierarchy, and word count for every page.
Features
- Automatic sitemap parsing with support for sitemap index files and nested sitemaps
- Full page content extraction including title, meta tags, headings, and body text
- Word count analysis for content auditing and quality assessment
- Heading hierarchy preservation for understanding page structure
- Configurable page limits to control crawl scope and resource usage
- Proxy support for reliable access to any website
Use Cases
- Build comprehensive AI knowledge bases from entire websites
- Audit website content quality, coverage, and completeness
- Create search indexes for internal documentation or knowledge management
- Monitor content changes across a website over time
- Extract training data for NLP models from structured web content
Input Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
sitemapUrl | string | "https://docs.apify.com/sitemap.xml" | URL of the sitemap.xml file |
maxPages | integer | 500 | Maximum pages to crawl |
Output Format
Each page produces a dataset item with:
url- Page URL from the sitemaptitle- HTML page titlemetaDescription- Meta description tag contentheadings- Array of headings with level and textcontent- Full text content of the pagewordCount- Total word countlastModified- Last modified date from sitemap if availablescrapedAt- ISO timestamp of when the page was scraped
Integration Tips
The output is designed for feeding into AI systems like RAG pipelines, vector databases, and search engines. Each page is self-contained with metadata for proper chunking and indexing.
Limitations
- Only processes URLs found in the sitemap; pages not in the sitemap are skipped
- JavaScript-rendered content may not be fully captured with Cheerio
- Very large sitemaps (50K+ URLs) should use the
maxPagesparameter