AI Sitemap Content Extractor avatar

AI Sitemap Content Extractor

Pricing

from $4.00 / 1,000 processed pages

Go to Apify Store
AI Sitemap Content Extractor

AI Sitemap Content Extractor

Transform website sitemaps into clean, AI-ready content with Markdown, semantic chunks, and optional AI summaries.

Pricing

from $4.00 / 1,000 processed pages

Rating

0.0

(0)

Developer

Enos gabriel

Enos gabriel

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

a day ago

Last modified

Share

Transform website sitemaps into clean, AI-ready content with Markdown, semantic chunks, and optional AI summaries. Built for RAG pipelines, LLM applications, and content automation.

What does AI Sitemap Content Extractor do?

AI Sitemap Content Extractor converts any website into a structured, AI-ready dataset. Unlike basic sitemap extractors that only return URLs, this Actor fetches each page, cleans the content by removing navigation, headers, footers, and scripts, converts the main content to clean Markdown, and optionally enriches it with AI-generated summaries and content classification.

Simply provide a website URL or sitemap URL, and the Actor will:

  1. Discover all pages via sitemap
  2. Fetch and clean each page's content
  3. Convert HTML to Markdown
  4. Generate semantic chunks for LLM usage
  5. Optionally add AI summaries and classification

Try it with any website: https://example.com

Why use AI Sitemap Content Extractor?

  • RAG Pipeline Ready: Output is ready for LangChain, LlamaIndex, or any RAG framework
  • Clean Content: Removes noise (nav, footer, scripts) to extract only meaningful content
  • Semantic Chunking: Splits content into LLM-friendly chunks with configurable overlap
  • AI Enrichment: Optional Groq-powered summarization and classification (free tier available)
  • Cost Effective: Uses efficient HTTP-based scraping (Cheerio) - no expensive browser automation
  • Smart Filtering: Automatically skips login pages, privacy policy, terms, and other low-value pages
  • Quality Scoring: Built-in content quality assessment filters out thin or low-quality pages

How to use AI Sitemap Content Extractor

  1. Enter Website URL: Provide the main website URL (e.g., https://example.com) or a direct sitemap URL
  2. Configure Options (optional):
    • Set maximum pages to process (default: 1000)
    • Enable/disable AI summarization and classification (enabled by default)
    • Adjust chunk size for LLM processing
    • Set content quality threshold
  3. Run the Actor: Click "Start" to begin extraction
  4. Download Results: Get structured JSON output with clean Markdown, tokens, chunks, and AI enrichment

Input

ParameterDescriptionDefault
Website URL or Sitemap URLThe website to extract content fromRequired
Maximum PagesMaximum number of pages to process1000
Maximum URL DepthMaximum path depth to crawl (0 = unlimited)0
ConcurrencyNumber of parallel requests20
Chunk SizeTarget tokens per chunk for LLM1000
Enable AI SummaryGenerate 2-4 sentence summary per pagetrue
Enable AI ClassificationClassify content type (blog, docs, etc.)true

Output

Each page in the dataset includes:

{
"url": "https://example.com/blog/post-1",
"title": "My Blog Post",
"content_markdown": "# Introduction\n\nThis is the clean content...",
"tokens": 1234,
"word_count": 850,
"reading_time_minutes": 4,
"chunks": [
{
"index": 0,
"content": "## Introduction\n\nThis is the first chunk...",
"token_count": 450,
"heading": "Introduction"
}
],
"summary": "This article covers the main topic with key insights...",
"content_type": "blog_post",
"metadata": {
"depth": 2,
"fetched_at": "2024-01-20T10:30:00Z",
"content_quality_score": 85
}
}

You can download the dataset in JSON, CSV, or Excel format.

Data Schema

FieldTypeDescription
urlstringPage URL
titlestringPage title
content_markdownstringClean Markdown content
tokensintegerEstimated token count
word_countintegerWord count
reading_time_minutesintegerEstimated reading time
chunksarraySemantic chunks for LLM
summarystringAI-generated summary (optional)
content_typestringAI-classified type (optional)
metadataobjectPage metadata

Pricing

This Actor is free to use on Apify's free tier.

AI features (summarization and classification) are included at no additional cost - the Actor uses a built-in Groq API key.

Tips and Advanced Options

  • Increase concurrency (30-50) for faster extraction on fast servers
  • Use proxy if targeting sites with anti-bot protection
  • Adjust chunk size based on your LLM's context window (smaller = more chunks)
  • Quality threshold - raise to skip more low-quality pages
  • Custom URL filters - use regex patterns to include/exclude specific paths

Limitations

  • JavaScript-rendered content requires Playwright (not supported in this version)
  • Very large sites may take longer to process
  • Some sites may block scraping - use proxy option if needed