Website Content Scraper
Pricing
$10.00 / 1,000 results
Website Content Scraper
Efficient scraper for extracting meaningful text from URLs found in a sitemap. Supports nested sitemaps, processes large datasets, and filters out boilerplate content. Ideal for SEO research, content analysis, and text mining, providing clean and structured text data for further processing.
Pricing
$10.00 / 1,000 results
Rating
0.0
(0)
Developer

Dolp
Actor stats
2
Bookmarked
20
Total users
20
Monthly active users
a year ago
Last modified
Categories
Share
๐ Sitemap Text Extractor
A web scraping actor that extracts text content from URLs listed in a sitemap.
๐ Overview
This actor fetches all URLs from a given sitemap, including nested sitemaps, and extracts text content from each URL. The extracted text is then saved to an Apify dataset.
โจ Features
- ๐ Recursive Sitemap Parsing: Handles nested sitemaps to fetch all URLs.
- ๐ Text Extraction: Extracts meaningful text content from web pages.
- ๐ซ Static Asset Filtering: Excludes URLs pointing to static assets like images and PDFs.
- โ ๏ธ Error Handling: Logs warnings for failed page fetches and continues processing other URLs.
๐ Requirements
- โ๏ธ Input Configuration: Provide a
sitemap_urlin the actor's input configuration. - ๐ Publicly Accessible Sitemap: Ensure the sitemap is publicly accessible.
๐ Usage
-
Input Configuration:
- Provide a
sitemap_urlin the actor's input configuration. - Ensure the sitemap is publicly accessible.
- Provide a
-
Running the Actor:
- Start the actor in Apify.
- The actor will fetch URLs from the sitemap, extract text content, and save the results to a dataset.
๐ค Output
The actor saves the extracted text content to an Apify dataset. Each record in the dataset contains:
- ๐ url: The URL from which the text was extracted.
- ๐ content: The extracted text content.
Example Output
[{"url": "https://example.com/page1","content": "This is the text content extracted from page 1."},{"url": "https://example.com/page2","content": "Page 2 contains different text content for analysis."},{"url": "https://example.com/blog/article","content": "This article discusses important topics in detail."}]
๐ก Example
To use this actor, simply provide the URL of the sitemap you want to process. For example, if you want to extract text from all pages listed in a sitemap located at https://example.com/sitemap.xml, you would enter this URL in the input configuration.
๐ฏ Use Cases
This actor is useful for extracting text data from websites for various purposes, such as:
- ๐ Data Collection: Gathering text content for analysis or processing.
- ๐ค AI Training: Feeding text data to AI models for training or fine-tuning.
- ๐ Content Summarization: Extracting key information from large volumes of text.
๐ค Contributing
Contributions are welcome! Feel free to report issues or suggest improvements.