
Website Content Scraper
Deprecated
Pricing
$10.00 / 1,000 results

Website Content Scraper
Deprecated
Efficient scraper for extracting meaningful text from URLs found in a sitemap. Supports nested sitemaps, processes large datasets, and filters out boilerplate content. Ideal for SEO research, content analysis, and text mining, providing clean and structured text data for further processing.
0.0 (0)
Pricing
$10.00 / 1,000 results
2
Total users
20
Monthly users
20
Runs succeeded
>99%
Last modified
5 months ago
π Sitemap Text Extractor
A web scraping actor that extracts text content from URLs listed in a sitemap.
π Overview
This actor fetches all URLs from a given sitemap, including nested sitemaps, and extracts text content from each URL. The extracted text is then saved to an Apify dataset.
β¨ Features
- π Recursive Sitemap Parsing: Handles nested sitemaps to fetch all URLs.
- π Text Extraction: Extracts meaningful text content from web pages.
- π« Static Asset Filtering: Excludes URLs pointing to static assets like images and PDFs.
- β οΈ Error Handling: Logs warnings for failed page fetches and continues processing other URLs.
π Requirements
- βοΈ Input Configuration: Provide a
sitemap_url
in the actor's input configuration. - π Publicly Accessible Sitemap: Ensure the sitemap is publicly accessible.
π Usage
-
Input Configuration:
- Provide a
sitemap_url
in the actor's input configuration. - Ensure the sitemap is publicly accessible.
- Provide a
-
Running the Actor:
- Start the actor in Apify.
- The actor will fetch URLs from the sitemap, extract text content, and save the results to a dataset.
π€ Output
The actor saves the extracted text content to an Apify dataset. Each record in the dataset contains:
- π url: The URL from which the text was extracted.
- π content: The extracted text content.
Example Output
[{"url": "https://example.com/page1","content": "This is the text content extracted from page 1."},{"url": "https://example.com/page2","content": "Page 2 contains different text content for analysis."},{"url": "https://example.com/blog/article","content": "This article discusses important topics in detail."}]
π‘ Example
To use this actor, simply provide the URL of the sitemap you want to process. For example, if you want to extract text from all pages listed in a sitemap located at https://example.com/sitemap.xml
, you would enter this URL in the input configuration.
π― Use Cases
This actor is useful for extracting text data from websites for various purposes, such as:
- π Data Collection: Gathering text content for analysis or processing.
- π€ AI Training: Feeding text data to AI models for training or fine-tuning.
- π Content Summarization: Extracting key information from large volumes of text.
π€ Contributing
Contributions are welcome! Feel free to report issues or suggest improvements.