Pricing

from $10.00 / 1,000 site analyzeds

Sitemap Content Extractor

Crawl any website sitemap.xml and extract structured content from each page. Full-text extraction, metadata, headings, and word counts for SEO audits and content inventories.

Pricing

from $10.00 / 1,000 site analyzeds

Rating

0.0

(0)

Developer

Oaida Adrian

Actor stats

Bookmarked

Total users

Monthly active users

6 days ago

Last modified

Sitemap Content Extractor — Crawl a Sitemap, Extract Clean Content

Point this Actor at any website's sitemap.xml and it crawls every listed URL and returns clean full-text content plus metadata for each page — title, meta description, keywords, H1 headings, word count, and last-modified date. Perfect for SEO audits, content inventories, site migrations, and building AI training corpora from documentation sites.

No browser automation to configure, no page-by-page URL lists to maintain — the sitemap is the input.

Why this Actor

Sitemap index aware — handles both plain sitemaps (<urlset>) and sitemap indexes (<sitemapindex>), recursively following every child sitemap.
Gzip support — reads .xml.gz sitemaps transparently.
Clean extraction — uses trafilatura to strip nav/ads/boilerplate and return just the article text.
Precise scoping — include/exclude URL regex patterns so you crawl only /blog/ or skip /tag/ pages.
Structured output — one dataset item per page, ready for search indexing, embeddings, or a content spreadsheet.

How it works

Give it a sitemapUrl. The Actor fetches and parses the sitemap (following index files and gzip automatically), applies your include/exclude filters, then visits up to maxUrls pages and extracts clean content and metadata from each. Set extractContent: false to inventory URLs and metadata only (faster, no page fetches).

Input

{
  "sitemapUrl": "https://apify.com/sitemap.xml",
  "maxUrls": 50,
  "extractContent": true,
  "includePatterns": ["/blog/"],
  "excludePatterns": ["/tag/", "/author/"],
  "proxyConfiguration": { "useApifyProxy": true }
}

Field	Type	Required	Default	Description
`sitemapUrl`	string	Yes	—	URL to `sitemap.xml` or a sitemap index
`maxUrls`	integer	No	50	Maximum URLs to process
`extractContent`	boolean	No	true	Fetch each page and extract full text
`includePatterns`	array	No	[]	Only process URLs matching these regex patterns
`excludePatterns`	array	No	[]	Skip URLs matching these regex patterns
`proxyConfiguration`	object	No	Apify Proxy	Proxy settings for page fetches

Output

One item per page:

{
  "url": "https://apify.com/blog/web-scraping-guide",
  "title": "The Complete Web Scraping Guide",
  "content": "Web scraping is the process of ...",
  "wordCount": 2184,
  "metaDescription": "Learn web scraping from scratch...",
  "metaKeywords": "web scraping, crawling",
  "h1Headings": ["The Complete Web Scraping Guide"],
  "lastmod": "2026-06-30",
  "extractedAt": "2026-07-18T09:12:44Z"
}

Use cases

🔍 SEO audits — inventory every indexable page, spot missing titles/meta descriptions, and measure content depth by word count.
🚚 Site migrations — pull all content from a legacy site into a structured dataset before rebuilding.
🤖 AI training data — collect clean text from documentation and blog sitemaps to feed LLM fine-tuning or RAG.
📚 Documentation indexing — build a searchable index of a docs site from its sitemap in one run.
🕵️ Competitor analysis — map a competitor's content coverage and structure across their whole site.

Run it from the API

curl -X POST "https://api.apify.com/v2/acts/darknezz~sitemap-content-extractor/run-sync-get-dataset-items?token=YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"sitemapUrl":"https://apify.com/sitemap.xml","maxUrls":100,"includePatterns":["/blog/"]}'

Scheduling: attach an Apify Schedule (e.g. daily) to re-crawl a sitemap and keep a content inventory or search index continuously fresh — the lastmod field lets you detect which pages changed.

Pricing

Pay per event: a small fee per extracted page (page-extracted), plus Apify's standard platform events. You pay only for the pages you actually crawl — cap spend with maxUrls.

FAQ

Does it follow sitemap index files? Yes — nested <sitemapindex> files are followed recursively, so a single index URL crawls the whole site.

Can I crawl only part of a site? Yes — use includePatterns / excludePatterns with regex to scope to specific sections (e.g. only /blog/, skip /tag/).

What if a site has no sitemap? This Actor requires a sitemap URL. For arbitrary link-following crawls, use a general web crawler instead.

Do I need a proxy? Most sitemaps and pages fetch fine over Apify Proxy (the default). Sites with heavy anti-bot protection may need residential proxies.

How do I get metadata without full text? Set extractContent: false — you still get title, meta tags, and lastmod per URL, much faster.

Sitemap Scraper

scrapevanta/sitemap-scraper

Sitemap Scraper extracts URLs, page metadata, update dates, images, and structured sitemap data from XML sitemaps. Ideal for SEO audits, website analysis, content discovery, indexing validation, competitor research, and large-scale web data collection.

ScrapeVanta

Sitemap URL Extractor

fetch_cat/sitemap-url-extractor

Extract clean URL inventories from XML sitemaps and sitemap indexes with filters, deduplication, and metadata.

Hanna Nosova

Sitemap URL Extractor

lnlenost/sitemap-url-extractor

Extract page URLs from robots.txt and sitemap.xml files for SEO audits, URL discovery crawl planning, and data pipelines.

Niccolò Salerno

Sitemap Generator - Crawl Website & Create XML Sitemap

scrappy_garden/sitemap-generator

Generate an XML sitemap for any website. Crawls internal pages from start URLs (with depth + page limits), deduplicates URLs, and stores a ready-to-submit sitemap.xml plus a structured dataset and summary for SEO audits.

Bikram Adhikari

XML Sitemap URL Extractor

andok/sitemap-extractor

Recursively crawl and extract every single URL from a website’s sitemap.xml. Automate your SEO audits and scraping queues.

Andok

Sitemap to URL Crawler — Extract Sitemap.xml URLs

logiover/sitemap-to-url-crawler

Extract all URLs from any sitemap.xml recursively. Export sitemap URLs to CSV/JSON for RAG pipelines, SEO audits, and LLM training datasets.

Logiover

Sitemap URL & SEO Metadata Extractor

happitap/sitemap-seo-extractor

Crawl any XML sitemap and instantly extract comprehensive SEO metadata (Title, Description, H1s, OG Tags, Canonical) for every page

HappiTap

Sitemap Extractor: Website → All URLs (sitemap.xml parser)

boxbox10/sitemap-extractor

Give it a website. Get every URL from its sitemap — loc, lastmod, changefreq, priority — as one clean record per URL. Auto-discovers sitemap.xml, robots.txt Sitemap: directives, and nested sitemap indexes. Perfect for SEO audits, crawl seeding, and URL discovery.

Marvin Eguilos

Sitemap Intelligence API

bipbip-apis/sitemap-intelligence-api

Analyze sitemap.xml files for URL counts, latest pages, content types, recency, sitemap index issues, and crawl-planning hints.

Daniel Christensen

Sitemap URl Extractor

scrapebridge/sitemap-url-extractor

🔎 Sitemap URL Extractor pulls URLs from any sitemap quickly and accurately. Export results for SEO audits, crawling, link building & content planning. 🚀 Automate discovery, enhance indexing, boost rankings.