Pricing

from $0.01 / result

Website Content Crawler for RAG

Crawl documentation sites, help centers, blogs, and websites, then extract clean markdown, text, or HTML for RAG pipelines, vector databases, and LLM applications.

Pricing

from $0.01 / result

Rating

0.0

(0)

Developer

yun qing

Actor stats

Bookmarked

Total users

Monthly active users

10 days ago

Last modified

Why use this Actor?

Crawl from start URLs or sitemap URLs
Keep the crawl inside your target scope
Filter out PDFs and non-HTML files
Store clean HTML separately for downstream processing
Export markdown, text, or HTML depending on your ingestion workflow

Typical use cases

Crawl product documentation into a vector database
Ingest help center content into an internal knowledge base
Extract clean website content for LLM applications
Capture docs and blog content for search or analysis

What makes it useful for content ingestion

sitemap mode for docs and help center sites
scope control to avoid crawling unrelated pages
PDF and file filtering to keep the output focused
clean HTML storage for downstream parsing and chunking
markdown, text, and HTML outputs for different pipelines

Recommended first run

If this is your first run, start with:

1 start URL or 1 sitemap URL
contentFormat: markdown
a conservative maxDepth
file filtering enabled

Good first-run targets:

a product docs site
a help center
a blog section

Example workflows

1. Docs site to RAG

Use the Actor to crawl a documentation site, then send the markdown or clean HTML output into your chunking and embedding pipeline.

Best for:

internal developer docs
product documentation
public API docs

2. Help center to knowledge base

Crawl support articles from a help center and export them as clean text or markdown for:

internal search
support copilots
FAQ assistants

3. Website content extraction for LLM apps

Collect structured content from blogs, docs, and product pages to build:

retrieval systems
internal knowledge tools
content analysis workflows

Typical input

{
  "startUrls": [{ "url": "https://docs.apify.com/" }],
  "crawlMode": "website",
  "contentFormat": "markdown",
  "maxDepth": 2,
  "excludeFileExtensions": [".pdf", ".zip", ".doc", ".docx", ".ppt", ".pptx"]
}

Local development

pnpm actor:dev websiteContentCrawler --example 0 --force-input
pnpm actor:dev websiteContentCrawler --example 2 --force-input

Notes:

input-examples.json is used by local actor:dev
Apify platform automated testing uses the prefill values from .actor/input_schema.json
The schema uses a public default URL so automated testing can pass without relying on localhost

Build

$pnpm actor:build websiteContentCrawler

Publish

pnpm actor:push websiteContentCrawler
pnpm actor:push websiteContentCrawler --dry-run
pnpm actor:push websiteContentCrawler --sync-meta --prefer-local-meta

Dataset Output

Each dataset item includes:

url
title
description
content
contentFormat
cleanHtml
markdown
text
html
wordCount
language
canonicalUrl
depth
httpStatusCode
crawledAt

Crawl Modes

website: start from startUrls, then follow links recursively
sitemap: load URLs from sitemapUrls or fallback origin + /sitemap.xml

Separate Clean HTML Storage

CLEAN_HTML_INDEX stores the mapping between page URL and KVS record key
Individual cleaned HTML records are stored as CLEAN_HTML_000001, CLEAN_HTML_000002, and so on

Website Content Crawler

apify/website-content-crawler

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Apify

117K

4.5

Product Hunt Data Extractor

rckflr/product-hunt-data-extractor

This actor efficiently scrapes product details from ProductHunt, including names, descriptions, images, categories, and video thumbnails. It's designed specifically for ProductHunt, ensuring accurate and structured data for product research and competitive analysis.

Mauricio Perera

115

Product Hunt Scraper

shahidirfan/product-hunt-scraper

Extract top daily launches, upvotes, and tech trends directly from Product Hunt. Scrape product details, reviews, and maker info instantly. Ideal for market research, lead generation, and spotting the next big thing in tech. Get the data you need to stay ahead!

Shahid Irfan

5.0

Producthunt Scraper

runtime/producthunt-scraper

A web scraper that extracts comprehensive product information from Product Hunt using Apify.

scraping automation

5.0

Bluesky Scraper — Posts, Profiles & Search

cryptosignals/bluesky-scraper

Bluesky scraper 2026 — extract posts, profiles and trending content from Bluesky social network without API key. Pay-per-result pricing. Returns structured JSON. Perfect for decentralized social media monitoring and research.

Web Data Labs

Product Hunt Email Scraper - Advanced, Fast & Cheapest

contacts-api/product-hunt-email-scraper-fast-advanced-and-cheapest

🚀 Product Hunt Email Scraper lets you extract founder and startup emails from Product Hunt launches ⚡ Perfect for SaaS sales and partnerships 📧

Lead Heaven

Advanced Product Hunt Scraper

danpoletaev/product-hunt-scraper

Scrape product hunt "Top Products Launching Today" section. Actor crawls products and extracts information about the product: title, description, categories, images, maker info with contact links and website info with raw text and email. Export scraped datasets in JSON, csv, etc. Run via API.

Danil Poletaev

796

5.0

Website Content Crawler Pro

datascoutapi/website-content-crawler-pro

Crawl websites and extract clean, structured content in Markdown, JSON, or plain text for AI models, LLMs, vector DBs, or RAG pipelines. Fast, reliable, and stealthy, with bulk processing, advanced metadata extraction, and seamless integration with LangChain, LlamaIndex, and AI workflows.

halam

529

2.1

Product Hunt Scraper with Contacts | $4 / 1K

fatihtahta/product-hunt-scraper-fast-reliable-4-1k

Scrape any Product Hunt leaderboard with verified founder contacts. This high-speed scraper delivers rich product launch, maker data, and product comments. Get a clean, structured dataset of the latest tech products for sales prospecting, deal sourcing, and competitor analysis.

Fatih Tahta

287

2.3

Product Hunt Scraper

automation-lab/producthunt-scraper

Extract detailed product data from Product Hunt — ratings, reviews, screenshots, pricing, launch dates, and categories. Get structured data for competitor research, market analysis, and product tracking.