Website Content Scraper avatar

Website Content Scraper

Under maintenance
Go to Store
This Actor is under maintenance.

This Actor may be unreliable while under maintenance. Would you like to try a similar Actor instead?

See alternative Actors
Website Content Scraper

Website Content Scraper

killerbees1982/website-content-scraper

Efficient scraper for extracting meaningful text from URLs found in a sitemap. Supports nested sitemaps, processes large datasets, and filters out boilerplate content. Ideal for SEO research, content analysis, and text mining, providing clean and structured text data for further processing.

Developer
Maintained by Community

🌐 Sitemap Text Extractor

A web scraping actor that extracts text content from URLs listed in a sitemap.

📝 Overview

This actor fetches all URLs from a given sitemap, including nested sitemaps, and extracts text content from each URL. The extracted text is then saved to an Apify dataset.

✨ Features

  • 🔄 Recursive Sitemap Parsing: Handles nested sitemaps to fetch all URLs.
  • 📄 Text Extraction: Extracts meaningful text content from web pages.
  • 🚫 Static Asset Filtering: Excludes URLs pointing to static assets like images and PDFs.
  • ⚠️ Error Handling: Logs warnings for failed page fetches and continues processing other URLs.

📋 Requirements

  • ⚙️ Input Configuration: Provide a sitemap_url in the actor's input configuration.
  • 🌍 Publicly Accessible Sitemap: Ensure the sitemap is publicly accessible.

🚀 Usage

  1. Input Configuration:

    • Provide a sitemap_url in the actor's input configuration.
    • Ensure the sitemap is publicly accessible.
  2. Running the Actor:

    • Start the actor in Apify.
    • The actor will fetch URLs from the sitemap, extract text content, and save the results to a dataset.

📤 Output

The actor saves the extracted text content to an Apify dataset. Each record in the dataset contains:

  • 🔗 url: The URL from which the text was extracted.
  • 📜 content: The extracted text content.

Example Output

1[
2  {
3    "url": "https://example.com/page1",
4    "content": "This is the text content extracted from page 1."
5  },
6  {
7    "url": "https://example.com/page2",
8    "content": "Page 2 contains different text content for analysis."
9  },
10  {
11    "url": "https://example.com/blog/article",
12    "content": "This article discusses important topics in detail."
13  }
14]

💡 Example

To use this actor, simply provide the URL of the sitemap you want to process. For example, if you want to extract text from all pages listed in a sitemap located at https://example.com/sitemap.xml, you would enter this URL in the input configuration.

🎯 Use Cases

This actor is useful for extracting text data from websites for various purposes, such as:

  • 📊 Data Collection: Gathering text content for analysis or processing.
  • 🤖 AI Training: Feeding text data to AI models for training or fine-tuning.
  • 📝 Content Summarization: Extracting key information from large volumes of text.

🤝 Contributing

Contributions are welcome! Feel free to report issues or suggest improvements.