Sitemap Content Extractor avatar

Sitemap Content Extractor

Pricing

Pay per usage

Go to Apify Store
Sitemap Content Extractor

Sitemap Content Extractor

Crawl any website sitemap.xml and extract structured content from each page. Full-text extraction, metadata, headings, and word counts for SEO audits and content inventories.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Oaida Adrian

Oaida Adrian

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

4 hours ago

Last modified

Share

Crawl any website's sitemap.xml and extract structured content from every page. Full-text extraction, metadata, headings, and word counts — perfect for SEO audits, content inventories, and AI training data collection.

Features

  • Sitemap Index Support — Handles both regular sitemaps (<urlset>) and sitemap indexes (<sitemapindex>), recursively following child sitemaps
  • Gzip Support — Handles .xml.gz sitemaps automatically
  • Full Content Extraction — Extracts clean article text using trafilatura
  • Metadata Extraction — Title, meta description, meta keywords, H1 headings
  • URL Filtering — Include/exclude URL patterns via regex
  • Structured Output — Each item includes URL, content, word count, metadata, lastmod date

Input Parameters

ParameterTypeRequiredDefaultDescription
sitemapUrlstringYesURL to sitemap.xml or sitemap index
maxUrlsintegerNo50Maximum URLs to process
extractContentbooleanNotrueExtract full text content from each page
includePatternsarrayNo[]Only process URLs matching these regex patterns
excludePatternsarrayNo[]Skip URLs matching these regex patterns
proxyConfigurationobjectNoApify ProxyProxy settings for scraping

Output Fields

FieldTypeDescription
urlstringPage URL
titlestringPage title
contentstringFull extracted text content
wordCountintegerWord count of extracted content
metaDescriptionstringMeta description tag
metaKeywordsstringMeta keywords tag
h1HeadingsarrayList of H1 heading texts
lastmodstringLast modification date from sitemap
extractedAtstringISO timestamp of extraction

Use Cases

  • SEO Audits — Inventory all content on a domain, check titles and meta tags
  • Content Migration — Extract all content from a legacy site before migration
  • AI Training Data — Collect clean text from documentation sites for LLM training
  • Competitor Analysis — Analyze competitor content structure and coverage
  • Documentation Indexing — Build searchable indexes of documentation sites

Pricing

This actor charges per extracted page (page-extracted event).