Fast Website to Markdown & RAG JSONL Crawler
Pricing
$1.00 / 1,000 page records
Fast Website to Markdown & RAG JSONL Crawler
Paste a homepage or sitemap and get clean Markdown, metadata, JSONL chunks, and source URLs for RAG at a low per-page price.
Pricing
$1.00 / 1,000 page records
Rating
0.0
(0)
Developer
Orbiscribe Labs
Maintained by CommunityActor stats
0
Bookmarked
3
Total users
1
Monthly active users
11 days ago
Last modified
Categories
Share
Turn public websites into clean Markdown and RAG-ready JSONL without wrestling with a broad web crawler. Paste a homepage or sitemap, set a page limit, and get structured page records plus chunk rows with stable source metadata for retrieval systems. Each page record includes metadata, headings, links, JSON-LD, visible emails and phones, extraction-quality stats, and text chunks.
Use it for documentation sites, help centers, public knowledge bases, competitor content research, SEO audits, and AI search or chatbot projects.
Why use this instead of a generic crawler?
Use Apify's broad crawlers when you need maximum crawling flexibility. Use this Actor when you want the shortest path from public URLs to an embeddings-ready dataset:
- paste a homepage or
sitemap.xml - auto-discover
/sitemap.xmlin the default mode - output chunk-level JSONL by default
- keep full page records in key-value storage
- include source URLs, canonical URLs, headings, content stats, and crawl source
- pay a predictable per-extracted-page price
The goal is not to expose every crawler knob. The goal is to make the common RAG ingestion run obvious enough that a first run succeeds without tuning.
Run This First
Start with a tiny docs crawl so you can inspect chunk quality before scaling:
{"startUrls": [{ "url": "https://docs.apify.com/" }],"sitemapUrls": [],"crawlStrategy": "auto","maxPages": 5,"maxDepth": 1,"sameDomainOnly": true,"respectRobotsTxt": true,"includeMarkdown": true,"includeHtml": false,"datasetOutputMode": "chunks","chunkSizeChars": 1200,"chunkOverlapChars": 120,"dryRun": false}
Look first at chunkId, text, canonicalUrl, title, crawlSource, and
the RAG_CHUNKS_JSONL key-value output. A practical workflow recipe is in
docs/workflow-recipes/website-rag-dataset-pipeline.md in the GitHub
repository.
What does this website crawler do?
Website to RAG Dataset Crawler starts from one or more URLs or sitemaps, follows
internal links up to your limits, and extracts useful page content instead of
dumping raw HTML. In auto mode, it tries /sitemap.xml for each start domain
so docs/help-center crawls are more complete with less setup. The output is
designed to be easy to export from Apify and load into a database, spreadsheet,
vector store, LangChain/LlamaIndex pipeline, or internal research workflow.
It does not require an LLM API key. The extraction is deterministic and keeps costs predictable.
What data can you extract?
| Field | Description |
|---|---|
url | Final crawled page URL |
canonicalUrl | Canonical URL when present |
title | Page title |
metaDescription | Meta description |
headings | H1-H6 heading structure |
mainText | Clean readable text |
markdown | Markdown version of the main content |
links | Internal and external links found on the page |
jsonLd | JSON-LD/schema.org blocks |
emails | Email addresses visibly present on the page |
phones | Phone numbers visibly present on the page |
chunks | Text chunks with character count and token estimate |
crawlSource | Whether a page came from a start URL, sitemap, or discovered sitemap |
RAG_CHUNKS_JSONL | Key-value output with one JSONL record per chunk |
RAG_CHUNKS | Key-value output with the same chunk records as JSON |
MARKDOWN_BUNDLE | One Markdown document combining all extracted pages |
URL_INVENTORY | Compact page inventory with URL, title, depth, word count, and chunk count |
BUYER_BRIEF | Short run brief for reviewing crawl coverage and extraction quality |
wordCount | Approximate word count of extracted readable text |
markdownLength | Character length of generated Markdown |
linkCount | Number of unique links included in the record |
headingCount | Number of extracted H1-H3 headings |
chunkCount | Number of generated text chunks |
extractionMethod | Content root used, such as article, main, or body |
depth | Crawl depth from the start URL |
By default, the dataset contains one row per chunk because that is what most
embedding and vector-database imports expect. Full page records are also stored
in PAGE_RECORDS. Set datasetOutputMode to pages if you prefer one dataset
row per crawled page, or both if you want both shapes in the dataset.
Quick start
- Add one or more start URLs.
- Leave
crawlStrategyonautounless you know you want links-only or sitemap-only. - Add
sitemapUrlswhen you already know the right sitemap. - Set
maxPagesbefore the first run. - Keep
sameDomainOnlyenabled unless you want to follow external links. - Keep
respectRobotsTxtenabled for normal public-site crawling. - Start with
maxPages: 10, inspect the output, then scale.
For a docs site, use the docs homepage and let auto mode check the sitemap, or
paste the sitemap directly into sitemapUrls. For a single-page extraction, set
crawlStrategy: "linksOnly" and maxDepth: 0.
Use With n8n, Make, or Zapier
Run the Actor with wait-for-finish enabled, then read the default dataset items
or the RAG_CHUNKS_JSONL key-value output.
Typical workflow:
- Trigger from a new website URL, docs URL, or scheduled refresh.
- Run this Actor with a small
maxPageslimit. - Send chunk rows to your vector database, spreadsheet, or agent knowledge store.
- Store
canonicalUrl,title, andchunkIdso answers can cite sources.
Input example
{"startUrls": [{ "url": "https://docs.example.com/" }],"sitemapUrls": ["https://docs.example.com/sitemap.xml"],"crawlStrategy": "auto","maxPages": 25,"maxDepth": 2,"sameDomainOnly": true,"respectRobotsTxt": true,"includeMarkdown": true,"includeHtml": false,"datasetOutputMode": "chunks","chunkSizeChars": 2500,"chunkOverlapChars": 250}
Output example
{"url": "https://example.com/","canonicalUrl": "https://example.com/","title": "Example Domain","metaDescription": "Example page used for documentation and tests.","headings": [{"level": 1,"text": "Example Domain"}],"mainText": "Example Domain This domain is for use in illustrative examples in documents.","markdown": "# Example Domain\n\nThis domain is for use in illustrative examples in documents.","links": [{"url": "https://www.iana.org/domains/example","text": "More information","internal": false}],"jsonLd": [],"emails": [],"phones": [],"chunks": [{"chunkId": "https://example.com/#chunk-0","text": "# Example Domain\n\nThis domain is for use in illustrative examples in documents.","charCount": 77,"tokenEstimate": 20}],"wordCount": 12,"markdownLength": 77,"linkCount": 1,"headingCount": 1,"chunkCount": 1,"extractionMethod": "main","crawlSource": "startUrl","depth": 0}
Pricing
This Actor uses pay-per-event pricing. Dry-run examples are not charged. Apify free-plan users get the first 25 page records without this Actor's custom event charge; after that, normal pay-per-event pricing and the user's run spending limit apply.
| Event | Price | What counts |
|---|---|---|
page-record | $0.001 | One crawled page with extracted text, metadata, links, and chunks |
That is $1 per 1,000 emitted page records, plus normal Apify platform usage.
Use maxPages, maxDepth, and sameDomainOnly to control cost.
Tips for better crawls
- Start small. A
maxPages: 10run usually tells you whether the site structure works. - Use
maxDepth: 0for a fixed list of URLs. - Use
crawlStrategy: "sitemapOnly"when a docs site has a clean sitemap and you do not want link discovery noise. - Use
sameDomainOnly: trueto avoid crawling unrelated sites. - Set
includeHtml: falseunless you need source HTML. - Shorter chunks are easier to embed; longer chunks keep more context.
- Use
RAG_CHUNKS_JSONLwhen your downstream pipeline wants one JSON object per line for embeddings or batch import. - Use
datasetOutputMode: "pages"when you want a spreadsheet-style page inventory instead of chunk rows. - Check
wordCount,markdownLength, andextractionMethodto spot thin or poorly structured pages. - Failed URLs are recorded in
RUN_SUMMARY.
Limits and compliance
This Actor crawls public pages reachable from user-supplied URLs. It does not log in, bypass paywalls, solve CAPTCHAs, or access private systems.
The respectRobotsTxt option applies best-effort User-agent: * disallow
rules for start domains with a short robots.txt timeout. Buyers are responsible
for checking site terms and permitted use of crawled content.