Pricing

from $0.10 / 1,000 results

Website Content Scraper

Extract clean Markdown, plain text, linked files, and RAG-ready chunks from websites, documentation, help centers, knowledge bases, and authenticated portals. Preserve structure, metadata, URLs, and crawl context for AI search, training, and retrieval workflows.

Pricing

from $0.10 / 1,000 results

Rating

0.0

(0)

Developer

Muhammad Qaseem Iqbal

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

Why use Website Content Scraper?

Most websites are designed for people, not for AI systems or clean data exports. A page can include menus, banners, cookie popups, repeated footers, scripts, and links that are not useful for your final dataset.

This Actor helps by collecting the useful content and organizing it into records you can export as JSON, CSV, Excel, XML, or other Apify dataset formats.

Common use cases include:

Build a chatbot that answers questions from your website or docs.
Create a search index for internal or customer-facing support.
Export documentation pages to Markdown or plain text.
Feed website content into a vector database or AI workflow.
Track changed, unchanged, or deleted pages across repeat crawls.
Download and parse linked documents such as PDFs, spreadsheets, and JSON files.

Main features

Crawl one page, one section, or a larger website.
Extract clean text and Markdown from web pages.
Create AI-ready chunks, which are smaller pieces of content for search and chatbot systems.
Download and parse linked files, including PDF, DOCX, XLSX, CSV, TSV, Markdown, JSON, XML, and text files.
Discover extra URLs from sitemaps and llms.txt files.
Respect robots.txt by default.
Use fast crawling for simple sites and browser crawling for JavaScript-heavy pages.
Crawl pages behind login when you provide cookies or request headers.
Save run summaries, skipped URL diagnostics, and sync manifests.
Support incremental recrawls, so you can skip unchanged content in scheduled runs.

How it works

Website Content Scraper works in four simple steps:

Find pages

The Actor starts from the URLs you provide. It follows links that are in scope, can read sitemaps, and can use llms.txt files when available.
Clean the page

It removes common noise such as navigation, scripts, repeated layout content, and other page clutter where possible.
Extract content

It saves the page as clean text, Markdown, and optionally cleaned HTML. It can also download and parse supported linked files.
Prepare results

It writes page records, file records, and AI-ready chunks to the dataset. You can export the data or connect it to another workflow.

Quick start

For your first run, start small. You can increase the limits after you check the results.

{
  "startUrls": [{ "url": "https://docs.apify.com/" }],
  "crawlerType": "cheerio",
  "crawlScope": "startUrlPath",
  "maxCrawlPages": 25,
  "maxResults": 25,
  "discoverSitemaps": false,
  "discoverLlmsTxt": false,
  "discoverLlmsFullTxt": false,
  "saveMarkdown": true,
  "saveText": false,
  "createChunks": false,
  "saveFiles": false,
  "parseFiles": false,
  "maxFiles": 0,
  "proxyConfiguration": {
    "useApifyProxy": false
  }
}

For many documentation and help sites, cheerio is the best first choice because it is fast and cost-efficient. Turn on sitemap discovery, chunks, file parsing, or browser rendering only when the first small run shows that you need them.

For AI search or chatbot workflows, use the rag preset or enable createChunks. For linked PDFs, spreadsheets, or JSON files, enable saveFiles, parseFiles, and set maxFiles to a small number first.

Example output

The dataset contains different types of records. The most important field is recordType.

Page record

A page record represents one crawled web page.

{
  "recordType": "page",
  "url": "https://docs.example.com/getting-started",
  "title": "Getting started",
  "markdown": "# Getting started\n\nThis guide explains...",
  "text": "Getting started\n\nThis guide explains...",
  "contentQuality": {
    "confidence": 0.98,
    "wordCount": 1240,
    "isThin": false
  }
}

Chunk record

A chunk record is a smaller piece of a page or file. These records are useful for AI search, chatbots, and retrieval workflows.

{
  "recordType": "chunk",
  "url": "https://docs.example.com/getting-started",
  "title": "Getting started",
  "headingPath": ["Getting started", "Install"],
  "text": "Install the package and configure your project...",
  "tokenEstimate": 420
}

File record

A file record represents a downloaded or parsed file linked from a page.

{
  "recordType": "file",
  "url": "https://docs.example.com/api/openapi.json",
  "title": "JSON",
  "metadata": {
    "contentType": "application/json",
    "byteLength": 968704
  }
}

Understanding the results

Use recordType to filter the dataset:

Record type	What it means	When to use it
`page`	A full crawled web page	Markdown export, content review, documentation migration
`chunk`	A smaller text section	AI search, chatbots, vector databases, RAG workflows
`file`	A downloaded or parsed linked file	File archives, API specs, PDFs, spreadsheets
`skipped`	A URL skipped by the Actor	Debugging crawl limits or URL scope
`tombstone`	A previously seen item that disappeared	Incremental sync and delete handling

Apify dataset views select useful columns, but they do not filter rows by type. For page-only, chunk-only, or file-only exports, filter by recordType.

Input settings explained

Setting	Plain-language description
`startUrls`	The page or website section where the crawl starts.
`crawlScope`	Controls which links are allowed. `startUrlPath` is safest for one docs section or blog section.
`maxCrawlPages`	Maximum number of page requests the crawler will process.
`maxResults`	Maximum number of page records saved to the dataset.
`crawlerType`	Choose fast crawling, adaptive crawling, or browser crawling.
`maxBrowserFallbacks`	Caps how many pages adaptive mode may retry in a browser.
`discoverSitemaps`	Finds more URLs from sitemap files. Leave off for the cheapest first run.
`discoverLlmsTxt`	Finds URLs from `llms.txt` files when a site provides them. Leave off unless you need extra discovery.
`discoverLlmsFullTxt`	Also reads `llms-full.txt`; keep off unless you want a larger crawl.
`saveMarkdown`	Saves page content in Markdown format.
`saveText`	Saves page content as plain text. Turn off when Markdown is enough.
`createChunks`	Splits content into smaller AI-friendly records. Useful for RAG, but creates more dataset rows.
`saveFiles`	Downloads supported linked files. Leave off unless you need file archives.
`parseFiles`	Extracts text from supported linked files. Leave off unless you need PDF, spreadsheet, or document text.
`maxFiles`	Limits how many linked files are processed.
`cookies`	Secret cookie string for logged-in pages.
`requestHeaders`	Secret custom headers for authenticated or special requests.

Crawler types

Crawler type	Best for
`cheerio`	Fast crawling of static pages, docs, blogs, and help centers.
`adaptive`	Starts fast and falls back to browser rendering when needed.
`playwright-firefox`	Pages that need a real browser, JavaScript, or login flows.
`playwright-chromium`	Browser crawling with Chromium.

Browser crawling is more powerful, but usually slower and more expensive. Start with cheerio unless the website content does not appear in the results.

AI and chatbot use cases

This Actor is especially useful when you want AI to answer questions from website content.

Examples:

Customer support chatbot trained on a help center.
Internal assistant that searches company documentation.
Product copilot that answers questions from API docs.
Custom GPT knowledge files created from website pages.
Vector database ingestion for tools such as Pinecone, Qdrant, Weaviate, or similar systems.

If you are not familiar with the term RAG, it simply means giving an AI model relevant information from your own content before it answers a question. The chunk records are designed for that kind of workflow.

Incremental crawling

If you run the Actor on a schedule, you may not want to process the same unchanged content every time.

Use incremental mode to track what changed:

{
  "startUrls": [{ "url": "https://docs.example.com/" }],
  "incrementalMode": "readWriteState",
  "stateKey": "docs-production",
  "skipUnchanged": true,
  "emitDeletedRecords": true
}

The Actor stores content hashes in the key-value store. On future runs, it can identify new, changed, unchanged, and deleted content.

Authenticated websites

For private pages or customer portals, provide cookies or request headers in the input.

These fields are marked as secret inputs:

cookies
requestHeaders

They are not written to dataset records or logs. You can also provide loginValidationUrl to check that authentication works before the crawl continues.

How much does it cost?

The cost depends on:

how many pages you crawl,
how many files you download or parse,
whether you use browser crawling,
how much data is written to datasets and key-value stores.

Tips to control cost:

Start with maxCrawlPages and maxResults set to 25.
Keep discoverLlmsFullTxt off unless you need it.
Keep discoverSitemaps and discoverLlmsTxt off for the first test run.
Use cheerio for static sites.
Use createChunks only when you need AI search or chatbot-ready records.
Keep saveFiles and parseFiles off unless linked files matter.
Turn off saveHtml and saveScreenshots unless you need them.
Set maxFiles to a small number, such as 5 or 10, before processing many files.

Troubleshooting

Try adaptive or a Playwright crawler. The page may need JavaScript rendering. You can also use keepElementsCssSelector to tell the Actor which part of the page to keep.

I got too many pages

Use a narrower startUrl, keep crawlScope set to startUrlPath, or add patterns to excludeUrlGlobs.

I did not get enough pages

Increase maxCrawlPages, maxResults, and maxCrawlDepth. Also keep discoverSitemaps enabled.

My files are missing

Make sure saveFiles and parseFiles are enabled, and increase maxFiles if the site links to many files.

Some pages have low confidence scores

Low scores are common for index pages, category pages, and navigation-heavy pages. For AI workflows, the detailed content pages and chunk records are usually more useful.

The website blocks the crawler

Try a browser crawler and configure proxies in Apify. Some sites require stronger crawling settings than simple HTTP crawling.

Limitations

Legacy .doc files can be downloaded but are not text-extracted.
Very large files may be skipped based on fileMaxSizeMb.
Browser crawling is slower and may cost more than fast HTTP crawling.
llms.txt and llms-full.txt are used for discovery, not saved as normal file records.
Results depend on the structure and accessibility of the target website.

Best practices

Test with a small crawl before running a large one.
Review a few page records to confirm the extracted text looks right.
Use chunk records for chatbot and vector database workflows.
Use page records for full Markdown or text exports.
Use skipped records to understand why URLs were not saved.
Save a tested input as an Apify Task for repeat use.

RAG Web Extractor — Clean Markdown, HTML & Chunks

junipr/rag-web-extractor

Extract clean website content for RAG and AI search. Crawl pages, remove boilerplate, preserve structure, and export markdown, HTML, text, JSON, and chunks.

junipr

Website Content Extractor for RAG: Markdown, HTML, Text

nezha/website-content-crawler

Turn docs sites, help centers, blogs, and websites into clean markdown, text, or HTML for RAG, AI knowledge bases, and internal search. Crawl from start URLs or sitemaps and keep the crawl in scope.

nezha

5.0

Knowledge Intelligence Engine — Website to Markdown for RAG

ryanclinton/website-content-to-markdown

Turn any website, documentation site or help centre into a retrieval-ready knowledge corpus for RAG and AI search. Clean Markdown plus chunks, change detection, deduplication, retrieval scoring, version awareness and a full corpus audit, in one run.

Ryan Clinton

Docs Change Monitor for AI

careybrown/docs-change-rag-ready-monitor

Monitor public docs, changelogs, help centers, status pages, and pricing pages for changes, then output clean Markdown and RAG-ready chunks for AI knowledge bases.

Carey Brown

AI Web Content Crawler - Markdown for LLMs

intelscrape/ai-web-content-crawler

Crawl any website and extract clean Markdown optimized for LLM training, RAG pipelines, and AI knowledge bases - removes boilerplate and outputs structured JSON with URL, title, markdown, and metadata.

IntelScrape

Website RAG Knowledge Builder

glowing_glove/website-rag-knowledge-builder

Crawl public website pages and build clean RAG-ready knowledge records with page summaries, key facts, FAQs, links, and retrieval chunks.

Ushba Khan

Website Content Crawler

parseforge/website-content-crawler

Crawl any website and pull clean Markdown content ready for AI! Follow links across a whole domain and extract page text, titles, headings, images, and metadata. Perfect for building RAG pipelines, training datasets, knowledge bases, and vector databases. Start crawling content in minutes!

ParseForge

Website to Markdown for LLMs

agentictools/website-to-markdown-llm

Crawl a site and export clean Markdown with token counts and chunks, ready for RAG.

Ken Agland

RAG-Ready Website Crawler — Clean Content for LLMs & Vector DBs

yourwingman/rag-ready-crawler

Crawl websites and output clean, chunked content optimized for RAG pipelines, LLM training data, and vector databases. Built for AI knowledge bases and semantic search.