Under maintenance

Pricing

$15.00 / 1,000 results

Try for free

Go to Apify Store

Advanced Website Crawling Actor

Under maintenance

Try for free

A fast and reliable scraper for any website that extracts clean HTML, Markdown, and text content. Provides clean, structured data with support for dynamic rendering, recursive sitemap discovery, SSL bypass, and easy API integration for your applications.

Pricing

$15.00 / 1,000 results

Rating

0.0

(0)

Developer

Techforce

Actor stats

Bookmarked

Total users

Monthly active users

a day ago

Last modified

Advanced Website Crawling & Content Extraction

The Advanced Website Crawling & Content Extraction is a powerful Apify Actor designed to crawl complex, dynamic, and bot-protected websites. It uses a hybrid architecture (Static + Playwright) to extract clean HTML.

It automatically handles sitemaps (recursive & gzipped), SSL certificate errors, and deduplication, making it perfect for RAG (Retrieval-Augmented Generation) pipelines and data archiving.

⭐ Features

Hybrid Crawling: Combines fast static fetching with a robust Playwright browser fallback for dynamic content.
Universal Content: Extracts text from HTML files automatically.
Smart Sitemap Discovery: Recursively parses sitemap.xml, sitemap_index.xml, and Gzip-compressed sitemaps.
Strict & Relaxed Scoping: Intelligent domain filtering (handles www. vs non-www automatically).
Output Formats: Save data as Clean Text, or Markdown.

📝 Example Use Cases

LLM Training Data: Scrape entire documentation sites or knowledge bases for RAG pipelines.
Competitor Analysis: Extract product details, pricing, and services from competitor websites.
Archiving: Download and index reports from government or corporate portals.
SEO Auditing: Crawl sites to check for broken links, metadata, and content structure.

🚀 How to Use

Open the Actor on Apify.
Enter the Start URL (<<YOUR WEBSITE URL FOR SCRAPING>>).
Set Max Pages (default: 20).
(Optional) Configure proxy or adjust output format.
Click Run.
Download your dataset in JSON, CSV, or using the API.

🧩 Input Configuration

Field	Type	Required	Description
`startUrl`	String	✔️ Yes	The URL where the crawler should start.
`maxPages`	Number	No	Maximum number of pages to crawl.
`proxy`	Object	No	Apify Proxy configuration (Recommended)

Example Input

{
    "startUrl": "https://techforceglobal.com",
    "maxPages": 200,
    "apifyProxyGroups": ["RESIDENTIAL"]
}

📦 Output Structure

The actor stores results in the default Apify RequestList/Dataset. Each item looks like this:

{
    "url": "https://example.com/page",
    "title": "Page Title",
    "meta": { "description": "Page description..." },
    "headings": ["Header 1", "Subheader"],
    "type": "html",
    "content": "# Page Title\n\nContent converted to Markdown..."
}

🆘 Support

For issues, questions, or feature requests:
Email: bhavin.shah@techforceglobal.com

Made with ❤️ for the Data Extraction community

📅 Book a 15-min consultation

This Actor can be integrated into a fully automated n8n workflow no manual steps, just end-to-end automation.

🌐 Website: https://techforceglobal.com/

All Events Scraper

techforce.global/all-events-scraper

A fast and reliable scraper for AllEvents.in that extracts event titles, dates, locations, interested people count, and more. Provides clean, structured data with support for Markdown, pagination, and easy API integration for your applications.

Techforce

4.7

Events Eye Scraper

techforce.global/events-eye-scraper

A fast and reliable scraper for eventseye.com that extracts Exhibition Name, Description, Venue details, Organizer details and much more. Provides clean, structured data with support for Markdown, pagination, and easy API integration for your applications.

Techforce

Linkedin Candidate Search (No Cookies)

techforce.global/linkedin-candidate-search

A fast and reliable scraper for LinkedIn that extracts candidate names, job titles, and profile URLs. Provides clean, structured data with support for Markdown, pagination, and easy API integration for your applications without the need of cookies.

Techforce

Firecrawl Website Crawler

alizarin_refrigerator-owner/firecrawl-website-crawler

Enhanced Website Crawling with Superior JS Rendering Enhanced website crawler using Firecrawl's Crawl API for superior JavaScript rendering, smart rate limiting, anti-bot bypass, and clean markdown extraction.

The Howlers

Universal AI Web Scraper

stanvanrooy6/universal-ai-web-scraper

Turn any website into an API. Extract structured data using plain English. Features anti-bot bypass, dynamic rendering, and web search. No coding needed.

Stan Van Rooy

5.0

Find Sitemap from url

eesti/find-sitemap-from-url

A powerful [Apify Actor] that finds sitemap URLs for any website. This Actor helps you discover XML sitemaps by checking common locations, robots.txt files, and analyzing HTML content for sitemap links.

ando

205

1.0

Sitemap Generator - Creates sitemap.xml for any domain

wisteria_banjo/sitemap-generator---creates-sitemap-xml-for-any-domain

Generate a clean, standards-compliant sitemap.xml for a website. This actor crawls a single website, discovers all indexable pages, and produces: ✅ A ready-to-submit sitemap.xml (Google-compliant) ✅ A structured JSON dataset of discovered URLs (for auditing, reporting, and billing)

Chris Xavier

Website Scraper

quarterly_lettuce/website-scraper

Fast web scraper that extracts page titles and URLs from any website. Uses Cheerio for lightning-fast HTML parsing. Perfect for SEO audits, site mapping, and content discovery. Handles pagination and follows links automatically.

Abhishek Kumar Giri

Website To Markdown

hamzasaleem/website-to-markdown

Convert any webpage to clean, readable Markdown format. Perfect for content extraction and readability.

Hmza

AI Website Content Markdown Scraper

quaking_pail/ai-website-content-markdown-scraper

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.