Website to Markdown and RAG Dataset Crawler
Pricing
$3.00 / 1,000 page records
Website to Markdown and RAG Dataset Crawler
Crawl public websites into clean Markdown, text, metadata, links, JSON-LD, and chunks for RAG and knowledge bases.
Pricing
$3.00 / 1,000 page records
Rating
0.0
(0)
Developer
Orbiscribe Labs
Actor stats
0
Bookmarked
2
Total users
0
Monthly active users
21 hours ago
Last modified
Categories
Share
Turn public websites into clean Markdown and RAG-ready datasets. This Actor crawls user-supplied URLs, extracts readable page content, and saves structured page records plus a JSONL chunk export with metadata for retrieval systems. Each page record includes metadata, headings, links, JSON-LD, visible emails and phones, extraction-quality stats, and text chunks.
Use it for documentation sites, help centers, public knowledge bases, competitor content research, SEO audits, and AI search or chatbot projects.
Run This First
Start with a tiny docs crawl so you can inspect chunk quality before scaling:
{"startUrls": [{ "url": "https://docs.apify.com/academy/getting-started" }],"maxPages": 3,"maxDepth": 1,"sameDomainOnly": true,"respectRobotsTxt": true,"includeMarkdown": true,"includeHtml": false,"datasetOutputMode": "chunks","chunkSizeChars": 1200,"chunkOverlapChars": 120,"dryRun": false}
Look first at chunkId, text, canonicalUrl, title, headingPath, and
the RAG_CHUNKS_JSONL key-value output. A practical workflow recipe is in
docs/workflow-recipes/website-rag-dataset-pipeline.md in the GitHub
repository.
What does this website crawler do?
Website to RAG Dataset Crawler starts from one or more URLs, follows internal links up to your limits, and extracts useful page content instead of dumping raw HTML. The output is designed to be easy to export from Apify and load into a database, spreadsheet, vector store, LangChain/LlamaIndex pipeline, or internal research workflow.
It does not require an LLM API key. The extraction is deterministic and keeps costs predictable.
What data can you extract?
| Field | Description |
|---|---|
url | Final crawled page URL |
canonicalUrl | Canonical URL when present |
title | Page title |
metaDescription | Meta description |
headings | H1-H6 heading structure |
mainText | Clean readable text |
markdown | Markdown version of the main content |
links | Internal and external links found on the page |
jsonLd | JSON-LD/schema.org blocks |
emails | Email addresses visibly present on the page |
phones | Phone numbers visibly present on the page |
chunks | Text chunks with character count and token estimate |
RAG_CHUNKS_JSONL | Key-value output with one JSONL record per chunk |
RAG_CHUNKS | Key-value output with the same chunk records as JSON |
MARKDOWN_BUNDLE | One Markdown document combining all extracted pages |
URL_INVENTORY | Compact page inventory with URL, title, depth, word count, and chunk count |
BUYER_BRIEF | Short run brief for reviewing crawl coverage and extraction quality |
wordCount | Approximate word count of extracted readable text |
markdownLength | Character length of generated Markdown |
linkCount | Number of unique links included in the record |
headingCount | Number of extracted H1-H3 headings |
chunkCount | Number of generated text chunks |
extractionMethod | Content root used, such as article, main, or body |
depth | Crawl depth from the start URL |
By default, the dataset contains one row per chunk because that is what most
embedding and vector-database imports expect. Full page records are also stored
in PAGE_RECORDS. Set datasetOutputMode to pages if you prefer one dataset
row per crawled page, or both if you want both shapes in the dataset.
Quick start
- Add one or more start URLs.
- Set
maxPagesandmaxDepthbefore the first run. - Keep
sameDomainOnlyenabled unless you want to follow external links. - Keep
respectRobotsTxtenabled for normal public-site crawling. - Start with
maxPages: 10, inspect the output, then scale.
For a docs site, use the docs homepage or sitemap page as the start URL. For a
single-page extraction, set maxDepth: 0.
Use With n8n, Make, or Zapier
Run the Actor with wait-for-finish enabled, then read the default dataset items
or the RAG_CHUNKS_JSONL key-value output.
Typical workflow:
- Trigger from a new website URL, docs URL, or scheduled refresh.
- Run this Actor with a small
maxPageslimit. - Send chunk rows to your vector database, spreadsheet, or agent knowledge store.
- Store
canonicalUrl,title, andchunkIdso answers can cite sources.
Input example
{"startUrls": [{ "url": "https://example.com/" }],"maxPages": 25,"maxDepth": 2,"sameDomainOnly": true,"respectRobotsTxt": true,"includeMarkdown": true,"includeHtml": false,"datasetOutputMode": "chunks","chunkSizeChars": 2500,"chunkOverlapChars": 250}
Output example
{"url": "https://example.com/","canonicalUrl": "https://example.com/","title": "Example Domain","metaDescription": "Example page used for documentation and tests.","headings": [{"level": 1,"text": "Example Domain"}],"mainText": "Example Domain This domain is for use in illustrative examples in documents.","markdown": "# Example Domain\n\nThis domain is for use in illustrative examples in documents.","links": [{"url": "https://www.iana.org/domains/example","text": "More information","internal": false}],"jsonLd": [],"emails": [],"phones": [],"chunks": [{"chunkId": "https://example.com/#chunk-0","text": "# Example Domain\n\nThis domain is for use in illustrative examples in documents.","charCount": 77,"tokenEstimate": 20}],"wordCount": 12,"markdownLength": 77,"linkCount": 1,"headingCount": 1,"chunkCount": 1,"extractionMethod": "main","depth": 0}
Pricing
This Actor uses pay-per-event pricing. Dry-run examples are not charged. Apify free-plan users get the first 25 page records without this Actor's custom event charge; after that, normal pay-per-event pricing and the user's run spending limit apply.
| Event | Price | What counts |
|---|---|---|
page-record | $0.003 | One crawled page with extracted text, metadata, links, and chunks |
That is $3 per 1,000 emitted page records, plus normal Apify platform usage.
Use maxPages, maxDepth, and sameDomainOnly to control cost.
Tips for better crawls
- Start small. A
maxPages: 10run usually tells you whether the site structure works. - Use
maxDepth: 0for a fixed list of URLs. - Use
sameDomainOnly: trueto avoid crawling unrelated sites. - Set
includeHtml: falseunless you need source HTML. - Shorter chunks are easier to embed; longer chunks keep more context.
- Use
RAG_CHUNKS_JSONLwhen your downstream pipeline wants one JSON object per line for embeddings or batch import. - Use
datasetOutputMode: "pages"when you want a spreadsheet-style page inventory instead of chunk rows. - Check
wordCount,markdownLength, andextractionMethodto spot thin or poorly structured pages. - Failed URLs are recorded in
RUN_SUMMARY.
Limits and compliance
This Actor crawls public pages reachable from user-supplied URLs. It does not log in, bypass paywalls, solve CAPTCHAs, or access private systems.
The respectRobotsTxt option applies best-effort User-agent: * disallow
rules for start domains with a short robots.txt timeout. Buyers are responsible
for checking site terms and permitted use of crawled content.