Advanced Website Crawling Actor avatar
Advanced Website Crawling Actor

Pricing

$15.00 / 1,000 results

Go to Apify Store
Advanced Website Crawling Actor

Advanced Website Crawling Actor

A fast and reliable scraper for any website that extracts clean HTML, Markdown, and text content. Provides clean, structured data with support for dynamic rendering, recursive sitemap discovery, SSL bypass, and easy API integration for your applications.

Pricing

$15.00 / 1,000 results

Rating

0.0

(0)

Developer

Techforce

Techforce

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

1

Monthly active users

15 hours ago

Last modified

Share

Advanced Website Crawling & Content Extraction

The Advanced Website Crawling & Content Extraction is a powerful Apify Actor designed to crawl complex, dynamic, and bot-protected websites. It uses a hybrid architecture (Static + Playwright) to extract clean HTML.

It automatically handles sitemaps (recursive & gzipped), SSL certificate errors, and deduplication, making it perfect for RAG (Retrieval-Augmented Generation) pipelines and data archiving.


⭐ Features

  • Hybrid Crawling: Combines fast static fetching with a robust Playwright browser fallback for dynamic content.
  • Universal Content: Extracts text from HTML files automatically.
  • Smart Sitemap Discovery: Recursively parses sitemap.xml, sitemap_index.xml, and Gzip-compressed sitemaps.
  • Strict & Relaxed Scoping: Intelligent domain filtering (handles www. vs non-www automatically).
  • Output Formats: Save data as Clean Text, or Markdown.

📝 Example Use Cases

  • LLM Training Data: Scrape entire documentation sites or knowledge bases for RAG pipelines.
  • Competitor Analysis: Extract product details, pricing, and services from competitor websites.
  • Archiving: Download and index reports from government or corporate portals.
  • SEO Auditing: Crawl sites to check for broken links, metadata, and content structure.

🚀 How to Use

  1. Open the Actor on Apify.
  2. Enter the Start URL (<<YOUR WEBSITE URL FOR SCRAPING>>).
  3. Set Max Pages (default: 20).
  4. (Optional) Configure proxy or adjust output format.
  5. Click Run.
  6. Download your dataset in JSON, CSV, or using the API.

🧩 Input Configuration

FieldTypeRequiredDescription
startUrlString✔️ YesThe URL where the crawler should start.
maxPagesNumberNoMaximum number of pages to crawl.
proxyObjectNoApify Proxy configuration (Recommended)

Example Input

{
"startUrl": "https://techforceglobal.com",
"maxPages": 200,
"apifyProxyGroups": ["RESIDENTIAL"]
}

📦 Output Structure

The actor stores results in the default Apify RequestList/Dataset. Each item looks like this:

{
"url": "https://example.com/page",
"title": "Page Title",
"meta": { "description": "Page description..." },
"headings": ["Header 1", "Subheader"],
"type": "html",
"content": "# Page Title\n\nContent converted to Markdown..."
}

🆘 Support

For issues, questions, or feature requests:
Email: bhavin.shah@techforceglobal.com

Made with ❤️ for the Data Extraction community

📅 Book a 15-min consultation

This Actor can be integrated into a fully automated n8n workflow no manual steps, just end-to-end automation.

🌐 Website: https://techforceglobal.com/