Website Content Scraper avatar
Website Content Scraper
Deprecated

Pricing

$10.00 / 1,000 results

Go to Apify Store
Website Content Scraper

Website Content Scraper

Deprecated

Efficient scraper for extracting meaningful text from URLs found in a sitemap. Supports nested sitemaps, processes large datasets, and filters out boilerplate content. Ideal for SEO research, content analysis, and text mining, providing clean and structured text data for further processing.

Pricing

$10.00 / 1,000 results

Rating

0.0

(0)

Developer

Dolp

Dolp

Maintained by Community

Actor stats

2

Bookmarked

20

Total users

20

Monthly active users

10 months ago

Last modified

Share

🌐 Sitemap Text Extractor

A web scraping actor that extracts text content from URLs listed in a sitemap.

πŸ“ Overview

This actor fetches all URLs from a given sitemap, including nested sitemaps, and extracts text content from each URL. The extracted text is then saved to an Apify dataset.

✨ Features

  • πŸ”„ Recursive Sitemap Parsing: Handles nested sitemaps to fetch all URLs.
  • πŸ“„ Text Extraction: Extracts meaningful text content from web pages.
  • 🚫 Static Asset Filtering: Excludes URLs pointing to static assets like images and PDFs.
  • ⚠️ Error Handling: Logs warnings for failed page fetches and continues processing other URLs.

πŸ“‹ Requirements

  • βš™οΈ Input Configuration: Provide a sitemap_url in the actor's input configuration.
  • 🌍 Publicly Accessible Sitemap: Ensure the sitemap is publicly accessible.

πŸš€ Usage

  1. Input Configuration:

    • Provide a sitemap_url in the actor's input configuration.
    • Ensure the sitemap is publicly accessible.
  2. Running the Actor:

    • Start the actor in Apify.
    • The actor will fetch URLs from the sitemap, extract text content, and save the results to a dataset.

πŸ“€ Output

The actor saves the extracted text content to an Apify dataset. Each record in the dataset contains:

  • πŸ”— url: The URL from which the text was extracted.
  • πŸ“œ content: The extracted text content.

Example Output

[
{
"url": "https://example.com/page1",
"content": "This is the text content extracted from page 1."
},
{
"url": "https://example.com/page2",
"content": "Page 2 contains different text content for analysis."
},
{
"url": "https://example.com/blog/article",
"content": "This article discusses important topics in detail."
}
]

πŸ’‘ Example

To use this actor, simply provide the URL of the sitemap you want to process. For example, if you want to extract text from all pages listed in a sitemap located at https://example.com/sitemap.xml, you would enter this URL in the input configuration.

🎯 Use Cases

This actor is useful for extracting text data from websites for various purposes, such as:

  • πŸ“Š Data Collection: Gathering text content for analysis or processing.
  • πŸ€– AI Training: Feeding text data to AI models for training or fine-tuning.
  • πŸ“ Content Summarization: Extracting key information from large volumes of text.

🀝 Contributing

Contributions are welcome! Feel free to report issues or suggest improvements.