Website Content Crawler avatar

Website Content Crawler

Pricing

Pay per usage

Go to Apify Store
Website Content Crawler

Website Content Crawler

Crawl any website and ship clean Markdown, plain text, and HTML for AI, LLM, and RAG pipelines. Each row carries token estimates, JSON LD metadata, link graph, and optional auto chunk splitting for vector databases. Pay per page.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Kennedy Mutisya

Kennedy Mutisya

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Website Content Crawler — Markdown, Token Counts & RAG Chunks

Crawl any website and ship clean Markdown, plain text, and HTML for AI, LLM, and RAG pipelines. Each row carries a token estimate, JSON LD metadata, and a link graph. Optional auto chunk splitting drops your data straight into a vector database. Pay per page.

Built for AI engineers feeding RAG pipelines, LLM application teams indexing documentation, vector database operators ingesting knowledge bases, and content teams converting websites to clean Markdown for fine tuning.

Keywords this actor ranks for: website to markdown, website crawler for LLM, RAG pipeline crawler, scrape website to JSON, website content scraper API, llamaindex web scraper, langchain web crawler, vector database ingestion, AI training data crawler, documentation to markdown, website to RAG chunks, html to markdown converter API, knowledge base crawler.


Why this actor

Other crawlersThis actor
Raw HTML or plain text onlyMarkdown, plain text, AND cleaned HTML in one row
One extractor, take it or leave itThree extractors race, the highest scored wins, the winner is tagged
Manual chunking on your sideAuto chunks at paragraph boundaries with token aware overlap
No token infoEvery row ships an estimated GPT and Claude token count
Sitemap configuration requiredAuto discovers sitemap.xml, sitemap_index.xml, and robots.txt
PII passes through to your indexOptional one click PII redaction (emails, phones, SSN, IBAN)
Link graph data missingEvery row carries internal vs external link counts and 25 samples

How it works

flowchart LR
A[Start URLs] --> B[Auto sitemap discovery<br/>sitemap.xml + robots.txt]
A --> C[Adaptive crawler<br/>Playwright or Cheerio]
B --> C
C --> D[Strip nav header footer<br/>ads modals cookies]
D --> E[Race three extractors<br/>Readability vs main vs body]
E --> F[HTML to Markdown<br/>code blocks tables links]
F --> G[Token count + chunk split]
G --> H[(JSON CSV API<br/>vector database)]

Three extractors run on every page. Mozilla Readability, a custom main content detector, and a body fallback each return text plus a content score. The highest scoring result wins and the row is tagged with which extractor produced it, so you can audit quality on a per row basis.


What you get per row

flowchart LR
R[Page row] --> R1[Identity<br/>url loadedUrl title depth]
R --> R2[Content<br/>markdown text html]
R --> R3[Tokens<br/>estGpt chars]
R --> R4[Metadata<br/>author publishedAt JSON LD]
R --> R5[Link graph<br/>internal external samples]
R --> R6[Extractor<br/>winner + score]

Toggle chunkOutput and the same row format is split into RAG ready chunks. Each chunk row has chunkIndex, totalChunks, the chunk markdown, and a token count, ready to push straight into Pinecone, Qdrant, Weaviate, or a Postgres pgvector table.


Quick start

Index a documentation site for RAG

{
"startUrls": ["https://docs.example.com/"],
"maxPages": 500,
"maxDepth": 5,
"chunkOutput": true,
"chunkSize": 1000,
"chunkOverlap": 100
}

Convert a blog to clean Markdown

{
"startUrls": ["https://blog.example.com/"],
"includeUrlPatterns": ["**/posts/**", "**/blog/**"],
"outputFormats": ["markdown", "text"],
"maxPages": 200
}

GDPR safe RAG ingestion (PII redacted)

{
"startUrls": ["https://support.example.com/"],
"redactPII": true,
"chunkOutput": true,
"removeFluff": true,
"minContentLength": 200
}

Index a knowledge base with PDF download

{
"startUrls": ["https://kb.example.com/"],
"downloadFiles": true,
"downloadFileTypes": ["pdf", "docx"],
"maxPages": 1000
}

Sample output

Page row

{
"url": "https://docs.apify.com/academy/scraping-basics-javascript",
"loadedUrl": "https://docs.apify.com/academy/scraping-basics-javascript",
"title": "Web scraping basics for JavaScript devs",
"depth": 0,
"extractor": "readability",
"contentScore": 42.8,
"markdown": "**Learn how to use JavaScript to extract information from websites...**\n\nIn this course we'll use JavaScript to create...",
"text": "Learn how to use JavaScript to extract information from websites...",
"tokens": { "estGpt": 1508, "chars": 6030 },
"metadata": {
"title": "Web scraping basics for JavaScript devs",
"description": "Learn how to extract information from websites in this hands on course.",
"author": null,
"publishedAt": "2024-09-12T00:00:00.000Z",
"modifiedAt": "2025-08-04T00:00:00.000Z",
"language": "en",
"jsonLdTypes": ["TechArticle"]
},
"links": { "outbound": 57, "internal": 43, "external": 14, "crawlable": 25, "samples": ["..."] },
"crawledAt": "2026-04-25T16:00:00.000Z"
}

Chunk row (when chunkOutput is on)

{
"url": "https://docs.apify.com/academy/scraping-basics-javascript",
"title": "Web scraping basics for JavaScript devs",
"chunkIndex": 0,
"totalChunks": 4,
"markdown": "First 1000 token slice of the page...",
"tokens": { "estGpt": 998, "chars": 3992 },
"metadata": { "..." }
}

File row (when downloadFiles is on)

{
"url": "https://docs.example.com/whitepaper.pdf",
"kind": "file",
"extension": "pdf",
"sizeBytes": 482194,
"keyValueStoreKey": "https___docs_example_com_whitepaper_pdf-1714053000000.pdf"
}

Who uses this

RoleUse case
AI engineerIndex docs, knowledge bases, and blogs into a RAG pipeline. Use chunk output to skip a chunking step.
LLM app teamConvert customer documentation into Markdown for prompt context or fine tuning datasets.
Vector database operatorPipe each chunk row straight into Pinecone, Qdrant, Weaviate, or pgvector.
Content teamMirror an old website into clean Markdown for migration to a new CMS.
Compliance teamRedact PII at ingest time with redactPII: true. No post processing on your side.
ResearcherPull every page from a site with metadata, then run analysis on the link graph.

Input reference

FieldTypeWhat it does
startUrlsstring[]Required. Entry URLs for the crawl.
crawlerTypeenumadaptive, playwright, or cheerio.
maxPagesintegerHard cap across all start URLs. 0 means unlimited.
maxDepthintegerLink hops from start URL. 0 means seed only.
useSitemapbooleanAuto discover sitemap.xml and robots.txt.
respectRobotsTxtbooleanSkip URLs disallowed by robots.txt.
includeUrlPatternsstring[]Glob patterns. Pages must match at least one.
excludeUrlPatternsstring[]Glob patterns. Pages matching any are skipped.
stayOnDomainbooleanStay on the registrable domain of the start URL.
stayOnSubdomainbooleanStricter than stayOnDomain. Same hostname only.
removeFluffbooleanStrip nav, footer, ads, and modals before extracting.
extractorenumauto, readability, main, or body.
outputFormatsstring[]Any of markdown, text, html.
minContentLengthintegerDrop pages below this many characters.
chunkOutputbooleanSplit pages into RAG chunks and push one row per chunk.
chunkSizeintegerTarget tokens per chunk.
chunkOverlapintegerTokens of overlap between consecutive chunks.
redactPIIbooleanRedact emails, phones, SSN, IBAN before output.
extractMetadatabooleanPull JSON LD, OpenGraph, author, publish dates.
extractLinksbooleanPer row link graph counts and 25 samples.
infiniteScrollbooleanStage scroll to render lazy content. Playwright only.
waitForSelectorstringWait for a CSS selector before extraction. Playwright only.
cookiesobject[]Cookies to set for pages behind a login.
downloadFilesbooleanSave linked PDF, DOC, XLS files to the key value store.
concurrencyintegerPages processed in parallel.
proxyConfigurationobjectApify proxy. Datacenter is fine for most sites.

API call

curl -X POST \
"https://api.apify.com/v2/acts/YOUR_USER~website-content-crawler/runs?token=YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"startUrls": ["https://docs.example.com/"],
"maxPages": 500,
"chunkOutput": true,
"chunkSize": 1000
}'

Pricing

The first few rows per run are free so you can validate output before paying. After that, one charge per dataset row pushed. Auto chunking, token estimation, link graph, PII redaction, and metadata extraction are all included at no extra cost. File downloads count as one row each.


FAQ

Why is this better than the official Website Content Crawler?

This actor races three extractors and tags the winner per row, ships token estimates on every row, auto chunks for RAG with a single toggle, redacts PII at the source, and adds a link graph (internal vs external counts plus samples) without extra config.

Will this actor scrape JavaScript heavy sites?

Yes. Set crawlerType to playwright or leave it on adaptive. The browser pool ships fingerprinted Chrome with anti detection patches. Use infiniteScroll: true for sites that load content as you scroll, and waitForSelector to wait for a specific element before extraction.

How accurate is the token count?

Token counts use a 4 chars per token estimate for prose and 3 chars per token for fenced code blocks, calibrated against GPT and Claude tokenizers. Real tokenizer counts will be within 5 to 10 percent on English content. Set chunkSize slightly under your model limit to leave headroom.

Does the chunk splitter respect paragraph boundaries?

Yes. The splitter walks paragraphs and packs them into chunks until the token budget is reached. Long paragraphs that exceed the chunk size are split at sentence boundaries. Adjacent chunks share chunkOverlap tokens for context continuity during retrieval.

How does PII redaction work?

Set redactPII: true and emails, phone numbers, US Social Security numbers, and IBAN bank account numbers are replaced with [REDACTED_*] tokens before output. This applies to both Markdown and plain text fields. Useful for GDPR safe RAG indexing of customer support content.

Can I crawl pages behind a login?

Yes. Pass authentication cookies in the cookies field. Format is an array of {name, value, domain} objects. The crawler sets these on every browser context before navigating.

Does it download PDF files for indexing?

Yes. Set downloadFiles: true and choose extensions in downloadFileTypes. PDFs, DOC, DOCX, XLS, XLSX, and CSV files are saved to the key value store with one dataset row per file pointing at the storage key.

Can I run this on a schedule?

Yes. Use the Apify scheduler for hourly, daily, or weekly runs. Combine with a sitemap to capture only new pages, or run a full crawl on a fixed cadence to refresh your vector database.

Is the data in the dataset compatible with LangChain or LlamaIndex?

Yes. The Markdown output, page URL, and metadata fields map directly to LangChain Document and LlamaIndex Node schemas. Use the Apify dataset reader from either framework, or pull the dataset via API and feed your own pipeline.


  • TripAdvisor Property Rank Tracker — daily rank, rating, and competitor signals for hotels and restaurants
  • LinkedIn Profile & Company Post Tracker — public LinkedIn posts without a cookie
  • LinkedIn Hiring Tracker & Salary Intelligence — parsed salary, tech stack, seniority on every job row
  • Google Maps Scraper — local business data with reviews
  • Reddit Brand Monitor & Lead Finder — subreddit mentions and high intent leads