
Website Content Scraper
This Actor may be unreliable while under maintenance. Would you like to try a similar Actor instead?
See alternative Actors
Website Content Scraper
Efficient scraper for extracting meaningful text from URLs found in a sitemap. Supports nested sitemaps, processes large datasets, and filters out boilerplate content. Ideal for SEO research, content analysis, and text mining, providing clean and structured text data for further processing.
🌐 Sitemap Text Extractor
A web scraping actor that extracts text content from URLs listed in a sitemap.
📝 Overview
This actor fetches all URLs from a given sitemap, including nested sitemaps, and extracts text content from each URL. The extracted text is then saved to an Apify dataset.
✨ Features
- 🔄 Recursive Sitemap Parsing: Handles nested sitemaps to fetch all URLs.
- 📄 Text Extraction: Extracts meaningful text content from web pages.
- 🚫 Static Asset Filtering: Excludes URLs pointing to static assets like images and PDFs.
- ⚠️ Error Handling: Logs warnings for failed page fetches and continues processing other URLs.
📋 Requirements
- ⚙️ Input Configuration: Provide a
sitemap_url
in the actor's input configuration. - 🌍 Publicly Accessible Sitemap: Ensure the sitemap is publicly accessible.
🚀 Usage
-
Input Configuration:
- Provide a
sitemap_url
in the actor's input configuration. - Ensure the sitemap is publicly accessible.
- Provide a
-
Running the Actor:
- Start the actor in Apify.
- The actor will fetch URLs from the sitemap, extract text content, and save the results to a dataset.
📤 Output
The actor saves the extracted text content to an Apify dataset. Each record in the dataset contains:
- 🔗 url: The URL from which the text was extracted.
- 📜 content: The extracted text content.
Example Output
1[ 2 { 3 "url": "https://example.com/page1", 4 "content": "This is the text content extracted from page 1." 5 }, 6 { 7 "url": "https://example.com/page2", 8 "content": "Page 2 contains different text content for analysis." 9 }, 10 { 11 "url": "https://example.com/blog/article", 12 "content": "This article discusses important topics in detail." 13 } 14]
💡 Example
To use this actor, simply provide the URL of the sitemap you want to process. For example, if you want to extract text from all pages listed in a sitemap located at https://example.com/sitemap.xml
, you would enter this URL in the input configuration.
🎯 Use Cases
This actor is useful for extracting text data from websites for various purposes, such as:
- 📊 Data Collection: Gathering text content for analysis or processing.
- 🤖 AI Training: Feeding text data to AI models for training or fine-tuning.
- 📝 Content Summarization: Extracting key information from large volumes of text.
🤝 Contributing
Contributions are welcome! Feel free to report issues or suggest improvements.