Pricing

Pay per usage

Go to Apify Store

Sitemap Content Crawler

Try for free

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Donny Nguyen

Actor stats

Bookmarked

Total users

Monthly active users

2 days ago

Last modified

Features

Automatic sitemap parsing with support for sitemap index files and nested sitemaps
Full page content extraction including title, meta tags, headings, and body text
Word count analysis for content auditing and quality assessment
Heading hierarchy preservation for understanding page structure
Configurable page limits to control crawl scope and resource usage
Proxy support for reliable access to any website

Use Cases

Build comprehensive AI knowledge bases from entire websites
Audit website content quality, coverage, and completeness
Create search indexes for internal documentation or knowledge management
Monitor content changes across a website over time
Extract training data for NLP models from structured web content

Input Configuration

Parameter	Type	Default	Description
`sitemapUrl`	string	`"https://docs.apify.com/sitemap.xml"`	URL of the sitemap.xml file
`maxPages`	integer	`500`	Maximum pages to crawl

Output Format

Each page produces a dataset item with:

url - Page URL from the sitemap
title - HTML page title
metaDescription - Meta description tag content
headings - Array of headings with level and text
content - Full text content of the page
wordCount - Total word count
lastModified - Last modified date from sitemap if available
scrapedAt - ISO timestamp of when the page was scraped

Integration Tips

The output is designed for feeding into AI systems like RAG pipelines, vector databases, and search engines. Each page is self-contained with metadata for proper chunking and indexing.

Limitations

Only processes URLs found in the sitemap; pages not in the sitemap are skipped
JavaScript-rendered content may not be fully captured with Cheerio
Very large sitemaps (50K+ URLs) should use the maxPages parameter

Sitemap Change Orchestrator

tri_angle/sitemap-change-orchestrator

Monitor website sitemaps for new, updated, or removed URLs. Integration with the Website Content Crawler (WCC) allows feeding only relevant URLs. This ensures your web crawls are efficient, targeted, and resource-optimized, keeping your datasets fresh for any application.

Tri⟁angle

Updated Content Checker

tomas.gabik/updated-content-checker

Monitors sitemaps for new/updated content. Returns only URLs modified since a specified date for efficient incremental scraping.

Tomáš Gabík

Website Content Crawler

apify/website-content-crawler

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Apify

102K

4.5

(167)

Website Metadata Extractor (meta tags, sitemap, robots) 🔎

powerful_bachelor/website-metadata-extractor

🔍 Website Metadata Extractor 🌐 Extract essential website data: meta tags, robots.txt, and sitemap.xml in one scan. 📊 Analyze SEO elements, crawler directives, and site structure. ✅ Perfect for SEO audits, 🔎 competitor research, and 🚀 understanding how search engines view your website.

Powerful Bachelor

CNN Article Scraper

filip_cicvarek/cnn-article-scraper

Extract CNN articles by category or search query with date filtering. Scrape news from politics, business, world, tech, sports, and more. Get structured data: title, author, publication date, full content. Perfect for media monitoring, research, and content analysis.

Filip Cicvárek

5.0

(3)

Website Content Crawler

alizarin_refrigerator-owner/website-crawler

Crawl websites for SEO audits. Extracts HTML, title, meta tags, headings, links, & text content from pages. Automatic sitemap detection & parsing Extracts metadata (title, description, OG tags) Heading structure (H1, H2, H3) Internal & external link analysis Image extraction w/alt text Word count

John Rippy

SEO Data Extractor

nocodeventure/seo-data-extractor

Extract comprehensive SEO metadata, headings, links, images, Open Graph tags, Twitter Cards, and technical data from websites. Perfect for SEO audits, competitor analysis, and content optimization. Runs on Apify platform with structured JSON output.

No-Code Venture

Website Content Crawler for LLM's

salesblaster-ai/website-content-crawler

Extract contact information + turn any website into clean, structured content ready for LLM's (e.g. AI lead magnets, RAG pipelines, and outbound personalization). Most web scrapers dump raw HTML or unstructured text. This crawler is purpose-built for LLM's, and optimized for lead generation.

SalesBlaster AI

Website Content Crawler Rag

tropical_quince/website-content-crawler-rag

Donny Nguyen

Sqoosh Image Compressor

eunit/sqoosh-image-compressor

Optimize images for SEO with the Squoosh Actor on Apify. Batch compress, resize, and convert images to WebP, AVIF, and MozJPEG to boost site speed and Core Web Vitals. Automate high-performance image optimization for web scraping and developer workflows with ease.