Sitemap Content Crawler avatar

Sitemap Content Crawler

Pricing

Pay per usage

Go to Apify Store
Sitemap Content Crawler

Sitemap Content Crawler

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Donny Nguyen

Donny Nguyen

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Categories

Share

Parse any sitemap.xml file and crawl every URL found to extract full page content. Returns structured data including title, text content, meta descriptions, headings hierarchy, and word count for every page.

Features

  • Automatic sitemap parsing with support for sitemap index files and nested sitemaps
  • Full page content extraction including title, meta tags, headings, and body text
  • Word count analysis for content auditing and quality assessment
  • Heading hierarchy preservation for understanding page structure
  • Configurable page limits to control crawl scope and resource usage
  • Proxy support for reliable access to any website

Use Cases

  • Build comprehensive AI knowledge bases from entire websites
  • Audit website content quality, coverage, and completeness
  • Create search indexes for internal documentation or knowledge management
  • Monitor content changes across a website over time
  • Extract training data for NLP models from structured web content

Input Configuration

ParameterTypeDefaultDescription
sitemapUrlstring"https://docs.apify.com/sitemap.xml"URL of the sitemap.xml file
maxPagesinteger500Maximum pages to crawl

Output Format

Each page produces a dataset item with:

  • url - Page URL from the sitemap
  • title - HTML page title
  • metaDescription - Meta description tag content
  • headings - Array of headings with level and text
  • content - Full text content of the page
  • wordCount - Total word count
  • lastModified - Last modified date from sitemap if available
  • scrapedAt - ISO timestamp of when the page was scraped

Integration Tips

The output is designed for feeding into AI systems like RAG pipelines, vector databases, and search engines. Each page is self-contained with metadata for proper chunking and indexing.

Limitations

  • Only processes URLs found in the sitemap; pages not in the sitemap are skipped
  • JavaScript-rendered content may not be fully captured with Cheerio
  • Very large sitemaps (50K+ URLs) should use the maxPages parameter