Sitemap Content Extractor
Pricing
Pay per usage
Go to Apify Store
Sitemap Content Extractor
Crawl any website sitemap.xml and extract structured content from each page. Full-text extraction, metadata, headings, and word counts for SEO audits and content inventories.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Oaida Adrian
Maintained by CommunityActor stats
0
Bookmarked
1
Total users
0
Monthly active users
4 hours ago
Last modified
Categories
Share
Crawl any website's sitemap.xml and extract structured content from every page. Full-text extraction, metadata, headings, and word counts — perfect for SEO audits, content inventories, and AI training data collection.
Features
- Sitemap Index Support — Handles both regular sitemaps (
<urlset>) and sitemap indexes (<sitemapindex>), recursively following child sitemaps - Gzip Support — Handles
.xml.gzsitemaps automatically - Full Content Extraction — Extracts clean article text using trafilatura
- Metadata Extraction — Title, meta description, meta keywords, H1 headings
- URL Filtering — Include/exclude URL patterns via regex
- Structured Output — Each item includes URL, content, word count, metadata, lastmod date
Input Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
sitemapUrl | string | Yes | — | URL to sitemap.xml or sitemap index |
maxUrls | integer | No | 50 | Maximum URLs to process |
extractContent | boolean | No | true | Extract full text content from each page |
includePatterns | array | No | [] | Only process URLs matching these regex patterns |
excludePatterns | array | No | [] | Skip URLs matching these regex patterns |
proxyConfiguration | object | No | Apify Proxy | Proxy settings for scraping |
Output Fields
| Field | Type | Description |
|---|---|---|
url | string | Page URL |
title | string | Page title |
content | string | Full extracted text content |
wordCount | integer | Word count of extracted content |
metaDescription | string | Meta description tag |
metaKeywords | string | Meta keywords tag |
h1Headings | array | List of H1 heading texts |
lastmod | string | Last modification date from sitemap |
extractedAt | string | ISO timestamp of extraction |
Use Cases
- SEO Audits — Inventory all content on a domain, check titles and meta tags
- Content Migration — Extract all content from a legacy site before migration
- AI Training Data — Collect clean text from documentation sites for LLM training
- Competitor Analysis — Analyze competitor content structure and coverage
- Documentation Indexing — Build searchable indexes of documentation sites
Pricing
This actor charges per extracted page (page-extracted event).