Advanced Website Crawling Actor
Pricing
$15.00 / 1,000 results
Advanced Website Crawling Actor
A fast and reliable scraper for any website that extracts clean HTML, Markdown, and text content. Provides clean, structured data with support for dynamic rendering, recursive sitemap discovery, SSL bypass, and easy API integration for your applications.
Pricing
$15.00 / 1,000 results
Rating
0.0
(0)
Developer

Techforce
Actor stats
0
Bookmarked
1
Total users
1
Monthly active users
15 hours ago
Last modified
Categories
Share
Advanced Website Crawling & Content Extraction
The Advanced Website Crawling & Content Extraction is a powerful Apify Actor designed to crawl complex, dynamic, and bot-protected websites. It uses a hybrid architecture (Static + Playwright) to extract clean HTML.
It automatically handles sitemaps (recursive & gzipped), SSL certificate errors, and deduplication, making it perfect for RAG (Retrieval-Augmented Generation) pipelines and data archiving.
⭐ Features
- Hybrid Crawling: Combines fast static fetching with a robust Playwright browser fallback for dynamic content.
- Universal Content: Extracts text from HTML files automatically.
- Smart Sitemap Discovery: Recursively parses
sitemap.xml,sitemap_index.xml, and Gzip-compressed sitemaps. - Strict & Relaxed Scoping: Intelligent domain filtering (handles
www.vs non-wwwautomatically). - Output Formats: Save data as Clean Text, or Markdown.
📝 Example Use Cases
- LLM Training Data: Scrape entire documentation sites or knowledge bases for RAG pipelines.
- Competitor Analysis: Extract product details, pricing, and services from competitor websites.
- Archiving: Download and index reports from government or corporate portals.
- SEO Auditing: Crawl sites to check for broken links, metadata, and content structure.
🚀 How to Use
- Open the Actor on Apify.
- Enter the Start URL (
<<YOUR WEBSITE URL FOR SCRAPING>>). - Set Max Pages (default: 20).
- (Optional) Configure proxy or adjust output format.
- Click Run.
- Download your dataset in JSON, CSV, or using the API.
🧩 Input Configuration
| Field | Type | Required | Description |
|---|---|---|---|
startUrl | String | ✔️ Yes | The URL where the crawler should start. |
maxPages | Number | No | Maximum number of pages to crawl. |
proxy | Object | No | Apify Proxy configuration (Recommended) |
Example Input
{"startUrl": "https://techforceglobal.com","maxPages": 200,"apifyProxyGroups": ["RESIDENTIAL"]}
📦 Output Structure
The actor stores results in the default Apify RequestList/Dataset. Each item looks like this:
{"url": "https://example.com/page","title": "Page Title","meta": { "description": "Page description..." },"headings": ["Header 1", "Subheader"],"type": "html","content": "# Page Title\n\nContent converted to Markdown..."}
🆘 Support
For issues, questions, or feature requests:
Email: bhavin.shah@techforceglobal.com
Made with ❤️ for the Data Extraction community
📅 Book a 15-min consultation
This Actor can be integrated into a fully automated n8n workflow no manual steps, just end-to-end automation.