Deprecated

Pricing

from $4.00 / 1,000 processed pages

See alternative Actors

Go to Apify Store

AI Sitemap Content Extractor

Deprecated

See alternative Actors

Transform website sitemaps into clean, AI-ready content with Markdown, semantic chunks, and optional AI summaries.

Pricing

from $4.00 / 1,000 processed pages

Rating

0.0

(0)

Developer

Enos Melo

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

What does AI Sitemap Content Extractor do?

AI Sitemap Content Extractor converts any website into a structured, AI-ready dataset. Unlike basic sitemap extractors that only return URLs, this Actor fetches each page, cleans the content by removing navigation, headers, footers, and scripts, converts the main content to clean Markdown, and optionally enriches it with AI-generated summaries and content classification.

Simply provide a website URL or sitemap URL, and the Actor will:

Discover all pages via sitemap
Fetch and clean each page's content
Convert HTML to Markdown
Generate semantic chunks for LLM usage
Optionally add AI summaries and classification

Try it with any website: https://example.com

Why use AI Sitemap Content Extractor?

RAG Pipeline Ready: Output is ready for LangChain, LlamaIndex, or any RAG framework
Clean Content: Removes noise (nav, footer, scripts) to extract only meaningful content
Semantic Chunking: Splits content into LLM-friendly chunks with configurable overlap
AI Enrichment: Optional Groq-powered summarization and classification (free tier available)
Cost Effective: Uses efficient HTTP-based scraping (Cheerio) - no expensive browser automation
Smart Filtering: Automatically skips login pages, privacy policy, terms, and other low-value pages
Quality Scoring: Built-in content quality assessment filters out thin or low-quality pages

How to use AI Sitemap Content Extractor

Enter Website URL: Provide the main website URL (e.g., https://example.com) or a direct sitemap URL
Configure Options (optional):
- Set maximum pages to process (default: 1000)
- Enable/disable AI summarization and classification (enabled by default)
- Adjust chunk size for LLM processing
- Set content quality threshold
Run the Actor: Click "Start" to begin extraction
Download Results: Get structured JSON output with clean Markdown, tokens, chunks, and AI enrichment

Input

Parameter	Description	Default
Website URL or Sitemap URL	The website to extract content from	Required
Maximum Pages	Maximum number of pages to process	1000
Maximum URL Depth	Maximum path depth to crawl (0 = unlimited)	0
Concurrency	Number of parallel requests	20
Chunk Size	Target tokens per chunk for LLM	1000
Enable AI Summary	Generate 2-4 sentence summary per page	true
Enable AI Classification	Classify content type (blog, docs, etc.)	true

Output

Each page in the dataset includes:

{
    "url": "https://example.com/blog/post-1",
    "title": "My Blog Post",
    "content_markdown": "# Introduction\n\nThis is the clean content...",
    "tokens": 1234,
    "word_count": 850,
    "reading_time_minutes": 4,
    "chunks": [
        {
            "index": 0,
            "content": "## Introduction\n\nThis is the first chunk...",
            "token_count": 450,
            "heading": "Introduction"
        }
    ],
    "summary": "This article covers the main topic with key insights...",
    "content_type": "blog_post",
    "metadata": {
        "depth": 2,
        "fetched_at": "2024-01-20T10:30:00Z",
        "content_quality_score": 85
    }
}

You can download the dataset in JSON, CSV, or Excel format.

Data Schema

Field	Type	Description
url	string	Page URL
title	string	Page title
content_markdown	string	Clean Markdown content
tokens	integer	Estimated token count
word_count	integer	Word count
reading_time_minutes	integer	Estimated reading time
chunks	array	Semantic chunks for LLM
summary	string	AI-generated summary (optional)
content_type	string	AI-classified type (optional)
metadata	object	Page metadata

Pricing

This Actor is free to use on Apify's free tier.

AI features (summarization and classification) are included at no additional cost - the Actor uses a built-in Groq API key.

Tips and Advanced Options

Increase concurrency (30-50) for faster extraction on fast servers
Use proxy if targeting sites with anti-bot protection
Adjust chunk size based on your LLM's context window (smaller = more chunks)
Quality threshold - raise to skip more low-quality pages
Custom URL filters - use regex patterns to include/exclude specific paths

Limitations

JavaScript-rendered content requires Playwright (not supported in this version)
Very large sites may take longer to process
Some sites may block scraping - use proxy option if needed

Website Contact & Email Extractor

code-node-tools/website-contact-extractor

Crawl a domain or list of URLs and extract emails, phone numbers, and social media handles. Cheerio-based crawling with configurable depth, proxies, and selectable extraction targets.

CodeNodeTools

RAG Doctor: Audit & Repair Your AI Knowledge Base

sanya_kumari/rag-doctor

Audit and repair the content you feed your AI. Finds contradictions, stale facts, duplicates, dead links, and broken chunks that quietly poison RAG, agents, and custom GPTs. Returns a scored report, a prioritized fix list, and a cleaned, ready-to-index knowledge base.

Sanya Kumari

Site QA Broken Link Report Scraper

taroyamada/site-qa-broken-link-report-scraper

Check public pages for broken links and generate source-linked URL health report rows.

naoki anzai

Sitemap Scraper

scrapers-hub/sitemap-scraper

Sitemap scraper to crawl and extract URLs, pages, and structure from website sitemaps 🌐📊 Perfect for SEO analysis, website auditing, and data extraction. Fast, reliable, and scalable.

Scrapers Hub

Website Content Crawler

alizarin_refrigerator-owner/website-crawler

Crawl websites for SEO audits. Extracts HTML, title, meta tags, headings, links, & text content from pages. Automatic sitemap detection & parsing Extracts metadata (title, description, OG tags) Heading structure (H1, H2, H3) Internal & external link analysis Image extraction w/alt text Word count

The Howlers

116

Website Content Extractor

taroyamada/website-content-extractor

Extract clean text and markdown from docs, pricing, product, policy, and help-center URLs for RAG datasets and content operations.

naoki anzai

Sitemap to URL Crawler — Extract Sitemap.xml URLs

logiover/sitemap-to-url-crawler

Extract all URLs from any sitemap.xml recursively. Export sitemap URLs to CSV/JSON for RAG pipelines, SEO audits, and LLM training datasets.

Logiover

Website Content Crawler API - Markdown for RAG

tugelbay/website-content-crawler

Crawl public websites and extract clean Markdown, text, or HTML for RAG pipelines, AI agents, documentation indexing, and content monitoring. Guide: https://konabayev.com/tools/website-content-crawler/?utm_source=apify_info&utm_medium=referral&utm_campaign=website-content-crawler

Tugelbay Konabayev

Website Content Crawler

apify/website-content-crawler

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Apify

137K

4.7

(208)

Website Content Crawler

parseforge/website-content-crawler

Crawl any website and pull clean Markdown content ready for AI! Follow links across a whole domain and extract page text, titles, headings, images, and metadata. Perfect for building RAG pipelines, training datasets, knowledge bases, and vector databases. Start crawling content in minutes!