Website Content Scraper
Pricing
$10.00 / 1,000 results
Website Content Scraper
Efficient scraper for extracting meaningful text from URLs found in a sitemap. Supports nested sitemaps, processes large datasets, and filters out boilerplate content. Ideal for SEO research, content analysis, and text mining, providing clean and structured text data for further processing.
Pricing
$10.00 / 1,000 results
Rating
0.0
(0)
Developer

Dolp
Actor stats
2
Bookmarked
20
Total users
20
Monthly active users
10 months ago
Last modified
Categories
Share
π Sitemap Text Extractor
A web scraping actor that extracts text content from URLs listed in a sitemap.
π Overview
This actor fetches all URLs from a given sitemap, including nested sitemaps, and extracts text content from each URL. The extracted text is then saved to an Apify dataset.
β¨ Features
- π Recursive Sitemap Parsing: Handles nested sitemaps to fetch all URLs.
- π Text Extraction: Extracts meaningful text content from web pages.
- π« Static Asset Filtering: Excludes URLs pointing to static assets like images and PDFs.
- β οΈ Error Handling: Logs warnings for failed page fetches and continues processing other URLs.
π Requirements
- βοΈ Input Configuration: Provide a
sitemap_urlin the actor's input configuration. - π Publicly Accessible Sitemap: Ensure the sitemap is publicly accessible.
π Usage
-
Input Configuration:
- Provide a
sitemap_urlin the actor's input configuration. - Ensure the sitemap is publicly accessible.
- Provide a
-
Running the Actor:
- Start the actor in Apify.
- The actor will fetch URLs from the sitemap, extract text content, and save the results to a dataset.
π€ Output
The actor saves the extracted text content to an Apify dataset. Each record in the dataset contains:
- π url: The URL from which the text was extracted.
- π content: The extracted text content.
Example Output
[{"url": "https://example.com/page1","content": "This is the text content extracted from page 1."},{"url": "https://example.com/page2","content": "Page 2 contains different text content for analysis."},{"url": "https://example.com/blog/article","content": "This article discusses important topics in detail."}]
π‘ Example
To use this actor, simply provide the URL of the sitemap you want to process. For example, if you want to extract text from all pages listed in a sitemap located at https://example.com/sitemap.xml, you would enter this URL in the input configuration.
π― Use Cases
This actor is useful for extracting text data from websites for various purposes, such as:
- π Data Collection: Gathering text content for analysis or processing.
- π€ AI Training: Feeding text data to AI models for training or fine-tuning.
- π Content Summarization: Extracting key information from large volumes of text.
π€ Contributing
Contributions are welcome! Feel free to report issues or suggest improvements.