Deprecated

Pricing

$10.00 / 1,000 results

See alternative Actors

Go to Store

Website Content Scraper

Deprecated

See alternative Actors

Developed by

Dolp

Efficient scraper for extracting meaningful text from URLs found in a sitemap. Supports nested sitemaps, processes large datasets, and filters out boilerplate content. Ideal for SEO research, content analysis, and text mining, providing clean and structured text data for further processing.

0.0 (0)

Pricing

$10.00 / 1,000 results

Total users

Monthly users

Runs succeeded

>99%

Last modified

5 months ago

SEO tools

Automation

E-commerce

🌐 Sitemap Text Extractor

A web scraping actor that extracts text content from URLs listed in a sitemap.

📝 Overview

This actor fetches all URLs from a given sitemap, including nested sitemaps, and extracts text content from each URL. The extracted text is then saved to an Apify dataset.

✨ Features

🔄 Recursive Sitemap Parsing: Handles nested sitemaps to fetch all URLs.
📄 Text Extraction: Extracts meaningful text content from web pages.
🚫 Static Asset Filtering: Excludes URLs pointing to static assets like images and PDFs.
⚠️ Error Handling: Logs warnings for failed page fetches and continues processing other URLs.

📋 Requirements

⚙️ Input Configuration: Provide a sitemap_url in the actor's input configuration.
🌍 Publicly Accessible Sitemap: Ensure the sitemap is publicly accessible.

🚀 Usage

Input Configuration:
- Provide a sitemap_url in the actor's input configuration.
- Ensure the sitemap is publicly accessible.
Running the Actor:
- Start the actor in Apify.
- The actor will fetch URLs from the sitemap, extract text content, and save the results to a dataset.

📤 Output

The actor saves the extracted text content to an Apify dataset. Each record in the dataset contains:

🔗 url: The URL from which the text was extracted.
📜 content: The extracted text content.

Example Output

[
  {
    "url": "https://example.com/page1",
    "content": "This is the text content extracted from page 1."
  },
  {
    "url": "https://example.com/page2",
    "content": "Page 2 contains different text content for analysis."
  },
  {
    "url": "https://example.com/blog/article",
    "content": "This article discusses important topics in detail."
  }
]

💡 Example

To use this actor, simply provide the URL of the sitemap you want to process. For example, if you want to extract text from all pages listed in a sitemap located at https://example.com/sitemap.xml, you would enter this URL in the input configuration.

🎯 Use Cases

This actor is useful for extracting text data from websites for various purposes, such as:

📊 Data Collection: Gathering text content for analysis or processing.
🤖 AI Training: Feeding text data to AI models for training or fine-tuning.
📝 Content Summarization: Extracting key information from large volumes of text.

🤝 Contributing

Contributions are welcome! Feel free to report issues or suggest improvements.

On this page

🌐 Sitemap Text Extractor

Share Actor:

Website Content Crawler

apify/website-content-crawler

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Apify

59K

3.7

Fast Website Content Crawler

6sigmag/fast-website-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

1.2K

4.4

🔥 FireScrape AI Website Content Markdown Scraper

mohamedgb00714/fireScraper-AI-Website-Content-Markdown-Scraper

Advanced web scraper powered by Crawlee and Puppeteer — extracts website content, converts it to Markdown, and structures it for LLM training datasets.

mohamed el hadi msaid

5.0

AI Website Content Markdown Scraper

quaking_pail/ai-website-content-markdown-scraper

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.

AI_Builder

589

3.8

Deep Website Content Crawler

6sigmag/deep-website-content-crawler

Scrape Failed Killer! A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

427

4.1

Website Content Extractor

fastidious_drawer/website-content-extractor

This extractor lets you extract content from any website with a single or multiple URLs. Use selectors to choose specific sections like the body and exclude elements like headers or navigation. It also extracts images and links, providing data in JSON and DataTable formats for easy processing.

fastidious_drawer

Website Changes Detector

tri_angle/website-changes-detector

This actor uses Apify’s Website Content Crawler to track website changes by comparing new and previous crawl results, highlighting only relevant updates to save time and resources.

Tri⟁angle

Sitemap Change Orchestrator

tri_angle/sitemap-change-orchestrator

Monitor website sitemaps for new, updated, or removed URLs. Integration with the Website Content Crawler (WCC) allows feeding only relevant URLs. This ensures your web crawls are efficient, targeted, and resource-optimized, keeping your datasets fresh for any application.

Tri⟁angle

AI-Powered Web Content & Link Extractor

scrapercoder/ai-powered-web-content-link-extractor

Crawls websites to extract clean, structured content for AI/LLM use, ideal for training datasets, knowledge bases, and RAG systems. Json output includes: * text: Normalized page content * links: Extracted sub-URLs

wallnut.ai

Deep URL Content Crawler

6sigmag/deep-url-content-crawler

Scrape Failed Killer! A high-performance web scraper that rapidly extracts and analyzes content from multiple URLs simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

Fast URL Content Crawler

6sigmag/fast-url-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple URLs simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng