You can access the AI Sitemap Content Extractor programmatically from your own applications by using the Apify API. You can also choose the language preference from below. To use the Apify API, you’ll need an Apify account and your API token, found in Integrations settings in Apify Console.

Python

JavaScript

CLI

OpenAPI

HTTP

MCP

{
  "openapi": "3.0.1",
  "info": {
    "version": "1.1",
    "x-build-id": "8IeusLr5qPWG2CsE8"
  },
  "servers": [
    {
      "url": "https://api.apify.com/v2"
    }
  ],
  "paths": {
    "/acts/enosgb~ai-sitemap-content-extractor/run-sync-get-dataset-items": {
      "post": {
        "operationId": "run-sync-get-dataset-items-enosgb-ai-sitemap-content-extractor",
        "x-openai-isConsequential": false,
        "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
        "tags": [
          "Run Actor"
        ],
        "requestBody": {
          "required": true,
          "content": {
            "application/json": {
              "schema": {
                "$ref": "#/components/schemas/inputSchema"
              }
            }
          }
        },
        "parameters": [
          {
            "name": "token",
            "in": "query",
            "required": true,
            "schema": {
              "type": "string"
            },
            "description": "Enter your Apify token here"
          }
        ],
        "responses": {
          "200": {
            "description": "OK"
          }
        }
      }
    },
    "/acts/enosgb~ai-sitemap-content-extractor/runs": {
      "post": {
        "operationId": "runs-sync-enosgb-ai-sitemap-content-extractor",
        "x-openai-isConsequential": false,
        "summary": "Executes an Actor and returns information about the initiated run in response.",
        "tags": [
          "Run Actor"
        ],
        "requestBody": {
          "required": true,
          "content": {
            "application/json": {
              "schema": {
                "$ref": "#/components/schemas/inputSchema"
              }
            }
          }
        },
        "parameters": [
          {
            "name": "token",
            "in": "query",
            "required": true,
            "schema": {
              "type": "string"
            },
            "description": "Enter your Apify token here"
          }
        ],
        "responses": {
          "200": {
            "description": "OK",
            "content": {
              "application/json": {
                "schema": {
                  "$ref": "#/components/schemas/runsResponseSchema"
                }
              }
            }
          }
        }
      }
    },
    "/acts/enosgb~ai-sitemap-content-extractor/run-sync": {
      "post": {
        "operationId": "run-sync-enosgb-ai-sitemap-content-extractor",
        "x-openai-isConsequential": false,
        "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
        "tags": [
          "Run Actor"
        ],
        "requestBody": {
          "required": true,
          "content": {
            "application/json": {
              "schema": {
                "$ref": "#/components/schemas/inputSchema"
              }
            }
          }
        },
        "parameters": [
          {
            "name": "token",
            "in": "query",
            "required": true,
            "schema": {
              "type": "string"
            },
            "description": "Enter your Apify token here"
          }
        ],
        "responses": {
          "200": {
            "description": "OK"
          }
        }
      }
    }
  },
  "components": {
    "schemas": {
      "inputSchema": {
        "type": "object",
        "required": [
          "startUrls"
        ],
        "properties": {
          "startUrls": {
            "title": "Website URL or Sitemap URL",
            "type": "array",
            "description": "Enter the website's main URL (e.g., https://example.com) or a direct sitemap URL (e.g., https://example.com/sitemap.xml). The Actor will automatically find and parse the sitemap.",
            "default": [
              {
                "url": "https://example.com"
              }
            ],
            "items": {
              "type": "object",
              "required": [
                "url"
              ],
              "properties": {
                "url": {
                  "type": "string",
                  "title": "URL of a web page",
                  "format": "uri"
                }
              }
            }
          },
          "maxPages": {
            "title": "Maximum Pages to Process",
            "minimum": 0,
            "maximum": 50000,
            "type": "integer",
            "description": "Maximum number of pages to fetch and process. Set to 0 for unlimited (not recommended for large sites).",
            "default": 1000
          },
          "maxDepth": {
            "title": "Maximum URL Depth",
            "minimum": 0,
            "maximum": 20,
            "type": "integer",
            "description": "Maximum URL path depth to process. Pages deeper than this will be skipped. Set to 0 for no limit.",
            "default": 0
          },
          "maxConcurrency": {
            "title": "Concurrency",
            "minimum": 1,
            "maximum": 100,
            "type": "integer",
            "description": "Number of pages to fetch in parallel. Higher = faster but uses more memory. Recommended: 10-50.",
            "default": 20
          },
          "excludePatterns": {
            "title": "Exclude URL Patterns",
            "type": "array",
            "description": "Additional URL patterns to exclude (one per line, supports regex). Built-in exclusions: login, privacy, terms, admin, feeds, media files.",
            "default": [],
            "items": {
              "type": "string"
            }
          },
          "includePatterns": {
            "title": "Include Only URL Patterns",
            "type": "array",
            "description": "If set, only URLs matching these patterns will be processed (one per line, supports regex). Leave empty to process all non-excluded URLs.",
            "default": [],
            "items": {
              "type": "string"
            }
          },
          "minContentQuality": {
            "title": "Minimum Content Quality Score",
            "minimum": 0,
            "maximum": 100,
            "type": "integer",
            "description": "Minimum quality score (0-100) for a page to be included. Pages below this threshold will be skipped. Set to 0 to include all pages.",
            "default": 30
          },
          "chunkSize": {
            "title": "Chunk Size (tokens)",
            "minimum": 0,
            "maximum": 8000,
            "type": "integer",
            "description": "Target number of tokens per chunk for LLM-ready content splitting. Set to 0 to disable chunking.",
            "default": 1000
          },
          "chunkOverlap": {
            "title": "Chunk Overlap (tokens)",
            "minimum": 0,
            "maximum": 500,
            "type": "integer",
            "description": "Number of overlapping tokens between consecutive chunks for context continuity.",
            "default": 100
          },
          "enableAiSummary": {
            "title": "Enable AI Summarization",
            "type": "boolean",
            "description": "Generate a 2-4 sentence summary for each page using Groq AI.",
            "default": true
          },
          "enableAiClassification": {
            "title": "Enable AI Content Classification",
            "type": "boolean",
            "description": "Classify each page as blog_post, documentation, landing_page, etc. using Groq AI.",
            "default": true
          },
          "useProxy": {
            "title": "Use Proxy",
            "type": "boolean",
            "description": "Enable proxy rotation for sites with anti-bot protection. Requires Apify proxy plan.",
            "default": false
          }
        }
      },
      "runsResponseSchema": {
        "type": "object",
        "properties": {
          "data": {
            "type": "object",
            "properties": {
              "id": {
                "type": "string"
              },
              "actId": {
                "type": "string"
              },
              "userId": {
                "type": "string"
              },
              "startedAt": {
                "type": "string",
                "format": "date-time",
                "example": "2025-01-08T00:00:00.000Z"
              },
              "finishedAt": {
                "type": "string",
                "format": "date-time",
                "example": "2025-01-08T00:00:00.000Z"
              },
              "status": {
                "type": "string",
                "example": "READY"
              },
              "meta": {
                "type": "object",
                "properties": {
                  "origin": {
                    "type": "string",
                    "example": "API"
                  },
                  "userAgent": {
                    "type": "string"
                  }
                }
              },
              "stats": {
                "type": "object",
                "properties": {
                  "inputBodyLen": {
                    "type": "integer",
                    "example": 2000
                  },
                  "rebootCount": {
                    "type": "integer",
                    "example": 0
                  },
                  "restartCount": {
                    "type": "integer",
                    "example": 0
                  },
                  "resurrectCount": {
                    "type": "integer",
                    "example": 0
                  },
                  "computeUnits": {
                    "type": "integer",
                    "example": 0
                  }
                }
              },
              "options": {
                "type": "object",
                "properties": {
                  "build": {
                    "type": "string",
                    "example": "latest"
                  },
                  "timeoutSecs": {
                    "type": "integer",
                    "example": 300
                  },
                  "memoryMbytes": {
                    "type": "integer",
                    "example": 1024
                  },
                  "diskMbytes": {
                    "type": "integer",
                    "example": 2048
                  }
                }
              },
              "buildId": {
                "type": "string"
              },
              "defaultKeyValueStoreId": {
                "type": "string"
              },
              "defaultDatasetId": {
                "type": "string"
              },
              "defaultRequestQueueId": {
                "type": "string"
              },
              "buildNumber": {
                "type": "string",
                "example": "1.0.0"
              },
              "containerUrl": {
                "type": "string"
              },
              "usage": {
                "type": "object",
                "properties": {
                  "ACTOR_COMPUTE_UNITS": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATASET_READS": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATASET_WRITES": {
                    "type": "integer",
                    "example": 0
                  },
                  "KEY_VALUE_STORE_READS": {
                    "type": "integer",
                    "example": 0
                  },
                  "KEY_VALUE_STORE_WRITES": {
                    "type": "integer",
                    "example": 1
                  },
                  "KEY_VALUE_STORE_LISTS": {
                    "type": "integer",
                    "example": 0
                  },
                  "REQUEST_QUEUE_READS": {
                    "type": "integer",
                    "example": 0
                  },
                  "REQUEST_QUEUE_WRITES": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATA_TRANSFER_INTERNAL_GBYTES": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATA_TRANSFER_EXTERNAL_GBYTES": {
                    "type": "integer",
                    "example": 0
                  },
                  "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                    "type": "integer",
                    "example": 0
                  },
                  "PROXY_SERPS": {
                    "type": "integer",
                    "example": 0
                  }
                }
              },
              "usageTotalUsd": {
                "type": "number",
                "example": 0.00005
              },
              "usageUsd": {
                "type": "object",
                "properties": {
                  "ACTOR_COMPUTE_UNITS": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATASET_READS": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATASET_WRITES": {
                    "type": "integer",
                    "example": 0
                  },
                  "KEY_VALUE_STORE_READS": {
                    "type": "integer",
                    "example": 0
                  },
                  "KEY_VALUE_STORE_WRITES": {
                    "type": "number",
                    "example": 0.00005
                  },
                  "KEY_VALUE_STORE_LISTS": {
                    "type": "integer",
                    "example": 0
                  },
                  "REQUEST_QUEUE_READS": {
                    "type": "integer",
                    "example": 0
                  },
                  "REQUEST_QUEUE_WRITES": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATA_TRANSFER_INTERNAL_GBYTES": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATA_TRANSFER_EXTERNAL_GBYTES": {
                    "type": "integer",
                    "example": 0
                  },
                  "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                    "type": "integer",
                    "example": 0
                  },
                  "PROXY_SERPS": {
                    "type": "integer",
                    "example": 0
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

AI Sitemap Content Extractor OpenAPI definition

OpenAPI is a standard for designing and describing RESTful APIs, allowing developers to define API structure, endpoints, and data formats in a machine-readable way. It simplifies API development, integration, and documentation.

OpenAPI is effective when used with AI agents and GPTs by standardizing how these systems interact with various APIs, for reliable integrations and efficient communication.

By defining machine-readable API specifications, OpenAPI allows AI models like GPTs to understand and use varied data sources, improving accuracy. This accelerates development, reduces errors, and provides context-aware responses, making OpenAPI a core component for AI applications.

You can download the OpenAPI definitions for AI Sitemap Content Extractor from the options below:

OpenAPI.json

If you’d like to learn more about how OpenAPI powers GPTs, read our blog post.

You can also check out our other API clients:

AI Sitemap Content Extractor API in Python

AI Sitemap Content Extractor API in JavaScript

AI Sitemap Content Extractor API through CLI

AI Sitemap Content Extractor API

Sitemap Scraper

scrapers-hub/sitemap-scraper

Sitemap scraper to crawl and extract URLs, pages, and structure from website sitemaps 🌐📊 Perfect for SEO analysis, website auditing, and data extraction. Fast, reliable, and scalable.

Scrapers Hub

Website Content Crawler

alizarin_refrigerator-owner/website-crawler

Crawl websites for SEO audits. Extracts HTML, title, meta tags, headings, links, & text content from pages. Automatic sitemap detection & parsing Extracts metadata (title, description, OG tags) Heading structure (H1, H2, H3) Internal & external link analysis Image extraction w/alt text Word count

The Howlers

Website Content Crawler API - Markdown for RAG

tugelbay/website-content-crawler

Crawl public websites and extract clean Markdown, text, or HTML for RAG pipelines, AI agents, documentation indexing, and content monitoring. Guide: https://konabayev.com/tools/website-content-crawler/?utm_source=apify_info&utm_medium=referral&utm_campaign=website-content-crawler

Tugelbay Konabayev

Website Content Crawler

apify/website-content-crawler

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Apify

126K

4.6

(198)

Website Metadata Extractor (meta tags, sitemap, robots) 🔎

powerful_bachelor/website-metadata-extractor

🔍 Website Metadata Extractor 🌐 Extract essential website data: meta tags, robots.txt, and sitemap.xml in one scan. 📊 Analyze SEO elements, crawler directives, and site structure. ✅ Perfect for SEO audits, 🔎 competitor research, and 🚀 understanding how search engines view your website.

Powerful Bachelor

Website to Markdown Crawler â€” AI/RAG Data Pipeline

sovereigntaylor/website-to-markdown

Crawl any website and convert every page to clean, structured Markdown. Perfect for RAG pipelines, LLM training data, vector database ingestion, knowledge base building, and AI-powered search. Extracts main content, strips boilerplate, handles metadata, and chunks output for embeddings. Works with L

Ricardo Akiyoshi

Website Content Extractor for RAG: Markdown, HTML, Text

nezha/website-content-crawler

Turn docs sites, help centers, blogs, and websites into clean markdown, text, or HTML for RAG, AI knowledge bases, and internal search. Crawl from start URLs or sitemaps and keep the crawl in scope.

nezha

5.0

(2)

Website Content Extractor

taroyamada/website-content-extractor

Extract clean text and markdown from docs, pricing, product, policy, and help-center URLs for RAG datasets and content operations.

太郎山田

Website Content Crawler

mikolabs/website-content-crawler

Deep-crawl websites to extract clean text, Markdown, or HTML for AI/LLM apps, RAG pipelines, and vector databases. Supports adaptive crawling, HTML cleaning, file downloads, and structured dataset output. Easily integrates with LangChain, LlamaIndex, and other LLM tools.

mikolabs

5.0

(1)

Website Content Crawler

parseforge/website-content-crawler

Crawl any website and pull clean Markdown content ready for AI! Follow links across a whole domain and extract page text, titles, headings, images, and metadata. Perfect for building RAG pipelines, training datasets, knowledge bases, and vector databases. Start crawling content in minutes!