# AI Sitemap Content Extractor (`enosgb/ai-sitemap-content-extractor`) Actor

Transform website sitemaps into clean, AI-ready content with Markdown, semantic chunks, and optional AI summaries.

- **URL**: https://apify.com/enosgb/ai-sitemap-content-extractor.md
- **Developed by:** [Enos gabriel](https://apify.com/enosgb) (community)
- **Categories:** AI, Developer tools, SEO tools
- **Stats:** 1 total users, 0 monthly users, 0.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $4.00 / 1,000 processed pages

This Actor is paid per event and usage. You are charged both the fixed price for specific events and for Apify platform usage.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## AI Sitemap Content Extractor

Transform website sitemaps into clean, AI-ready content with Markdown, semantic chunks, and optional AI summaries. Built for RAG pipelines, LLM applications, and content automation.

### What does AI Sitemap Content Extractor do?

AI Sitemap Content Extractor converts any website into a structured, AI-ready dataset. Unlike basic sitemap extractors that only return URLs, this Actor fetches each page, cleans the content by removing navigation, headers, footers, and scripts, converts the main content to clean Markdown, and optionally enriches it with AI-generated summaries and content classification.

Simply provide a website URL or sitemap URL, and the Actor will:

1. Discover all pages via sitemap
2. Fetch and clean each page's content
3. Convert HTML to Markdown
4. Generate semantic chunks for LLM usage
5. Optionally add AI summaries and classification

Try it with any website: `https://example.com`

### Why use AI Sitemap Content Extractor?

- **RAG Pipeline Ready**: Output is ready for LangChain, LlamaIndex, or any RAG framework
- **Clean Content**: Removes noise (nav, footer, scripts) to extract only meaningful content
- **Semantic Chunking**: Splits content into LLM-friendly chunks with configurable overlap
- **AI Enrichment**: Optional Groq-powered summarization and classification (free tier available)
- **Cost Effective**: Uses efficient HTTP-based scraping (Cheerio) - no expensive browser automation
- **Smart Filtering**: Automatically skips login pages, privacy policy, terms, and other low-value pages
- **Quality Scoring**: Built-in content quality assessment filters out thin or low-quality pages

### How to use AI Sitemap Content Extractor

1. **Enter Website URL**: Provide the main website URL (e.g., `https://example.com`) or a direct sitemap URL
2. **Configure Options** (optional):
    - Set maximum pages to process (default: 1000)
    - Enable/disable AI summarization and classification (enabled by default)
    - Adjust chunk size for LLM processing
    - Set content quality threshold
3. **Run the Actor**: Click "Start" to begin extraction
4. **Download Results**: Get structured JSON output with clean Markdown, tokens, chunks, and AI enrichment

### Input

| Parameter                  | Description                                 | Default  |
| -------------------------- | ------------------------------------------- | -------- |
| Website URL or Sitemap URL | The website to extract content from         | Required |
| Maximum Pages              | Maximum number of pages to process          | 1000     |
| Maximum URL Depth          | Maximum path depth to crawl (0 = unlimited) | 0        |
| Concurrency                | Number of parallel requests                 | 20       |
| Chunk Size                 | Target tokens per chunk for LLM             | 1000     |
| Enable AI Summary          | Generate 2-4 sentence summary per page      | true     |
| Enable AI Classification   | Classify content type (blog, docs, etc.)    | true     |

### Output

Each page in the dataset includes:

```json
{
    "url": "https://example.com/blog/post-1",
    "title": "My Blog Post",
    "content_markdown": "## Introduction\n\nThis is the clean content...",
    "tokens": 1234,
    "word_count": 850,
    "reading_time_minutes": 4,
    "chunks": [
        {
            "index": 0,
            "content": "### Introduction\n\nThis is the first chunk...",
            "token_count": 450,
            "heading": "Introduction"
        }
    ],
    "summary": "This article covers the main topic with key insights...",
    "content_type": "blog_post",
    "metadata": {
        "depth": 2,
        "fetched_at": "2024-01-20T10:30:00Z",
        "content_quality_score": 85
    }
}
````

You can download the dataset in JSON, CSV, or Excel format.

### Data Schema

| Field                | Type    | Description                     |
| -------------------- | ------- | ------------------------------- |
| url                  | string  | Page URL                        |
| title                | string  | Page title                      |
| content\_markdown     | string  | Clean Markdown content          |
| tokens               | integer | Estimated token count           |
| word\_count           | integer | Word count                      |
| reading\_time\_minutes | integer | Estimated reading time          |
| chunks               | array   | Semantic chunks for LLM         |
| summary              | string  | AI-generated summary (optional) |
| content\_type         | string  | AI-classified type (optional)   |
| metadata             | object  | Page metadata                   |

### Pricing

This Actor is free to use on Apify's free tier.

AI features (summarization and classification) are included at no additional cost - the Actor uses a built-in Groq API key.

### Tips and Advanced Options

- **Increase concurrency** (30-50) for faster extraction on fast servers
- **Use proxy** if targeting sites with anti-bot protection
- **Adjust chunk size** based on your LLM's context window (smaller = more chunks)
- **Quality threshold** - raise to skip more low-quality pages
- **Custom URL filters** - use regex patterns to include/exclude specific paths

### Limitations

- JavaScript-rendered content requires Playwright (not supported in this version)
- Very large sites may take longer to process
- Some sites may block scraping - use proxy option if needed

# Actor input Schema

## `startUrls` (type: `array`):

Enter the website's main URL (e.g., https://example.com) or a direct sitemap URL (e.g., https://example.com/sitemap.xml). The Actor will automatically find and parse the sitemap.

## `maxPages` (type: `integer`):

Maximum number of pages to fetch and process. Set to 0 for unlimited (not recommended for large sites).

## `maxDepth` (type: `integer`):

Maximum URL path depth to process. Pages deeper than this will be skipped. Set to 0 for no limit.

## `maxConcurrency` (type: `integer`):

Number of pages to fetch in parallel. Higher = faster but uses more memory. Recommended: 10-50.

## `excludePatterns` (type: `array`):

Additional URL patterns to exclude (one per line, supports regex). Built-in exclusions: login, privacy, terms, admin, feeds, media files.

## `includePatterns` (type: `array`):

If set, only URLs matching these patterns will be processed (one per line, supports regex). Leave empty to process all non-excluded URLs.

## `minContentQuality` (type: `integer`):

Minimum quality score (0-100) for a page to be included. Pages below this threshold will be skipped. Set to 0 to include all pages.

## `chunkSize` (type: `integer`):

Target number of tokens per chunk for LLM-ready content splitting. Set to 0 to disable chunking.

## `chunkOverlap` (type: `integer`):

Number of overlapping tokens between consecutive chunks for context continuity.

## `enableAiSummary` (type: `boolean`):

Generate a 2-4 sentence summary for each page using Groq AI.

## `enableAiClassification` (type: `boolean`):

Classify each page as blog\_post, documentation, landing\_page, etc. using Groq AI.

## `useProxy` (type: `boolean`):

Enable proxy rotation for sites with anti-bot protection. Requires Apify proxy plan.

## Actor input object example

```json
{
  "startUrls": [
    {
      "url": "https://example.com"
    }
  ],
  "maxPages": 1000,
  "maxDepth": 0,
  "maxConcurrency": 20,
  "excludePatterns": [],
  "includePatterns": [],
  "minContentQuality": 30,
  "chunkSize": 1000,
  "chunkOverlap": 100,
  "enableAiSummary": true,
  "enableAiClassification": true,
  "useProxy": false
}
```

# Actor output Schema

## `results` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        {
            "url": "https://example.com"
        }
    ],
    "excludePatterns": [],
    "includePatterns": []
};

// Run the Actor and wait for it to finish
const run = await client.actor("enosgb/ai-sitemap-content-extractor").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "startUrls": [{ "url": "https://example.com" }],
    "excludePatterns": [],
    "includePatterns": [],
}

# Run the Actor and wait for it to finish
run = client.actor("enosgb/ai-sitemap-content-extractor").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    {
      "url": "https://example.com"
    }
  ],
  "excludePatterns": [],
  "includePatterns": []
}' |
apify call enosgb/ai-sitemap-content-extractor --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=enosgb/ai-sitemap-content-extractor",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "AI Sitemap Content Extractor",
        "description": "Transform website sitemaps into clean, AI-ready content with Markdown, semantic chunks, and optional AI summaries.",
        "version": "1.1",
        "x-build-id": "8IeusLr5qPWG2CsE8"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/enosgb~ai-sitemap-content-extractor/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-enosgb-ai-sitemap-content-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/enosgb~ai-sitemap-content-extractor/runs": {
            "post": {
                "operationId": "runs-sync-enosgb-ai-sitemap-content-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/enosgb~ai-sitemap-content-extractor/run-sync": {
            "post": {
                "operationId": "run-sync-enosgb-ai-sitemap-content-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "startUrls"
                ],
                "properties": {
                    "startUrls": {
                        "title": "Website URL or Sitemap URL",
                        "type": "array",
                        "description": "Enter the website's main URL (e.g., https://example.com) or a direct sitemap URL (e.g., https://example.com/sitemap.xml). The Actor will automatically find and parse the sitemap.",
                        "default": [
                            {
                                "url": "https://example.com"
                            }
                        ],
                        "items": {
                            "type": "object",
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "type": "string",
                                    "title": "URL of a web page",
                                    "format": "uri"
                                }
                            }
                        }
                    },
                    "maxPages": {
                        "title": "Maximum Pages to Process",
                        "minimum": 0,
                        "maximum": 50000,
                        "type": "integer",
                        "description": "Maximum number of pages to fetch and process. Set to 0 for unlimited (not recommended for large sites).",
                        "default": 1000
                    },
                    "maxDepth": {
                        "title": "Maximum URL Depth",
                        "minimum": 0,
                        "maximum": 20,
                        "type": "integer",
                        "description": "Maximum URL path depth to process. Pages deeper than this will be skipped. Set to 0 for no limit.",
                        "default": 0
                    },
                    "maxConcurrency": {
                        "title": "Concurrency",
                        "minimum": 1,
                        "maximum": 100,
                        "type": "integer",
                        "description": "Number of pages to fetch in parallel. Higher = faster but uses more memory. Recommended: 10-50.",
                        "default": 20
                    },
                    "excludePatterns": {
                        "title": "Exclude URL Patterns",
                        "type": "array",
                        "description": "Additional URL patterns to exclude (one per line, supports regex). Built-in exclusions: login, privacy, terms, admin, feeds, media files.",
                        "default": [],
                        "items": {
                            "type": "string"
                        }
                    },
                    "includePatterns": {
                        "title": "Include Only URL Patterns",
                        "type": "array",
                        "description": "If set, only URLs matching these patterns will be processed (one per line, supports regex). Leave empty to process all non-excluded URLs.",
                        "default": [],
                        "items": {
                            "type": "string"
                        }
                    },
                    "minContentQuality": {
                        "title": "Minimum Content Quality Score",
                        "minimum": 0,
                        "maximum": 100,
                        "type": "integer",
                        "description": "Minimum quality score (0-100) for a page to be included. Pages below this threshold will be skipped. Set to 0 to include all pages.",
                        "default": 30
                    },
                    "chunkSize": {
                        "title": "Chunk Size (tokens)",
                        "minimum": 0,
                        "maximum": 8000,
                        "type": "integer",
                        "description": "Target number of tokens per chunk for LLM-ready content splitting. Set to 0 to disable chunking.",
                        "default": 1000
                    },
                    "chunkOverlap": {
                        "title": "Chunk Overlap (tokens)",
                        "minimum": 0,
                        "maximum": 500,
                        "type": "integer",
                        "description": "Number of overlapping tokens between consecutive chunks for context continuity.",
                        "default": 100
                    },
                    "enableAiSummary": {
                        "title": "Enable AI Summarization",
                        "type": "boolean",
                        "description": "Generate a 2-4 sentence summary for each page using Groq AI.",
                        "default": true
                    },
                    "enableAiClassification": {
                        "title": "Enable AI Content Classification",
                        "type": "boolean",
                        "description": "Classify each page as blog_post, documentation, landing_page, etc. using Groq AI.",
                        "default": true
                    },
                    "useProxy": {
                        "title": "Use Proxy",
                        "type": "boolean",
                        "description": "Enable proxy rotation for sites with anti-bot protection. Requires Apify proxy plan.",
                        "default": false
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
