Pricing

$1.00 / 1,000 page records

Fast Website to Markdown & RAG JSONL Crawler

Paste a homepage or sitemap and get clean Markdown, metadata, JSONL chunks, and source URLs for RAG at a low per-page price.

Pricing

$1.00 / 1,000 page records

Rating

0.0

(0)

Developer

Orbiscribe Labs

Actor stats

Bookmarked

Total users

Monthly active users

11 days ago

Last modified

Why use this instead of a generic crawler?

Use Apify's broad crawlers when you need maximum crawling flexibility. Use this Actor when you want the shortest path from public URLs to an embeddings-ready dataset:

paste a homepage or sitemap.xml
auto-discover /sitemap.xml in the default mode
output chunk-level JSONL by default
keep full page records in key-value storage
include source URLs, canonical URLs, headings, content stats, and crawl source
pay a predictable per-extracted-page price

The goal is not to expose every crawler knob. The goal is to make the common RAG ingestion run obvious enough that a first run succeeds without tuning.

Run This First

Start with a tiny docs crawl so you can inspect chunk quality before scaling:

{
  "startUrls": [{ "url": "https://docs.apify.com/" }],
  "sitemapUrls": [],
  "crawlStrategy": "auto",
  "maxPages": 5,
  "maxDepth": 1,
  "sameDomainOnly": true,
  "respectRobotsTxt": true,
  "includeMarkdown": true,
  "includeHtml": false,
  "datasetOutputMode": "chunks",
  "chunkSizeChars": 1200,
  "chunkOverlapChars": 120,
  "dryRun": false
}

Look first at chunkId, text, canonicalUrl, title, crawlSource, and the RAG_CHUNKS_JSONL key-value output. A practical workflow recipe is in docs/workflow-recipes/website-rag-dataset-pipeline.md in the GitHub repository.

What does this website crawler do?

Website to RAG Dataset Crawler starts from one or more URLs or sitemaps, follows internal links up to your limits, and extracts useful page content instead of dumping raw HTML. In auto mode, it tries /sitemap.xml for each start domain so docs/help-center crawls are more complete with less setup. The output is designed to be easy to export from Apify and load into a database, spreadsheet, vector store, LangChain/LlamaIndex pipeline, or internal research workflow.

It does not require an LLM API key. The extraction is deterministic and keeps costs predictable.

What data can you extract?

Field	Description
`url`	Final crawled page URL
`canonicalUrl`	Canonical URL when present
`title`	Page title
`metaDescription`	Meta description
`headings`	H1-H6 heading structure
`mainText`	Clean readable text
`markdown`	Markdown version of the main content
`links`	Internal and external links found on the page
`jsonLd`	JSON-LD/schema.org blocks
`emails`	Email addresses visibly present on the page
`phones`	Phone numbers visibly present on the page
`chunks`	Text chunks with character count and token estimate
`crawlSource`	Whether a page came from a start URL, sitemap, or discovered sitemap
`RAG_CHUNKS_JSONL`	Key-value output with one JSONL record per chunk
`RAG_CHUNKS`	Key-value output with the same chunk records as JSON
`MARKDOWN_BUNDLE`	One Markdown document combining all extracted pages
`URL_INVENTORY`	Compact page inventory with URL, title, depth, word count, and chunk count
`BUYER_BRIEF`	Short run brief for reviewing crawl coverage and extraction quality
`wordCount`	Approximate word count of extracted readable text
`markdownLength`	Character length of generated Markdown
`linkCount`	Number of unique links included in the record
`headingCount`	Number of extracted H1-H3 headings
`chunkCount`	Number of generated text chunks
`extractionMethod`	Content root used, such as `article`, `main`, or `body`
`depth`	Crawl depth from the start URL

By default, the dataset contains one row per chunk because that is what most embedding and vector-database imports expect. Full page records are also stored in PAGE_RECORDS. Set datasetOutputMode to pages if you prefer one dataset row per crawled page, or both if you want both shapes in the dataset.

Quick start

Add one or more start URLs.
Leave crawlStrategy on auto unless you know you want links-only or sitemap-only.
Add sitemapUrls when you already know the right sitemap.
Set maxPages before the first run.
Keep sameDomainOnly enabled unless you want to follow external links.
Keep respectRobotsTxt enabled for normal public-site crawling.
Start with maxPages: 10, inspect the output, then scale.

For a docs site, use the docs homepage and let auto mode check the sitemap, or paste the sitemap directly into sitemapUrls. For a single-page extraction, set crawlStrategy: "linksOnly" and maxDepth: 0.

Use With n8n, Make, or Zapier

Run the Actor with wait-for-finish enabled, then read the default dataset items or the RAG_CHUNKS_JSONL key-value output.

Typical workflow:

Trigger from a new website URL, docs URL, or scheduled refresh.
Run this Actor with a small maxPages limit.
Send chunk rows to your vector database, spreadsheet, or agent knowledge store.
Store canonicalUrl, title, and chunkId so answers can cite sources.

Input example

{
  "startUrls": [{ "url": "https://docs.example.com/" }],
  "sitemapUrls": ["https://docs.example.com/sitemap.xml"],
  "crawlStrategy": "auto",
  "maxPages": 25,
  "maxDepth": 2,
  "sameDomainOnly": true,
  "respectRobotsTxt": true,
  "includeMarkdown": true,
  "includeHtml": false,
  "datasetOutputMode": "chunks",
  "chunkSizeChars": 2500,
  "chunkOverlapChars": 250
}

Output example

{
  "url": "https://example.com/",
  "canonicalUrl": "https://example.com/",
  "title": "Example Domain",
  "metaDescription": "Example page used for documentation and tests.",
  "headings": [
    {
      "level": 1,
      "text": "Example Domain"
    }
  ],
  "mainText": "Example Domain This domain is for use in illustrative examples in documents.",
  "markdown": "# Example Domain\n\nThis domain is for use in illustrative examples in documents.",
  "links": [
    {
      "url": "https://www.iana.org/domains/example",
      "text": "More information",
      "internal": false
    }
  ],
  "jsonLd": [],
  "emails": [],
  "phones": [],
  "chunks": [
    {
      "chunkId": "https://example.com/#chunk-0",
      "text": "# Example Domain\n\nThis domain is for use in illustrative examples in documents.",
      "charCount": 77,
      "tokenEstimate": 20
    }
  ],
  "wordCount": 12,
  "markdownLength": 77,
  "linkCount": 1,
  "headingCount": 1,
  "chunkCount": 1,
  "extractionMethod": "main",
  "crawlSource": "startUrl",
  "depth": 0
}

Pricing

This Actor uses pay-per-event pricing. Dry-run examples are not charged. Apify free-plan users get the first 25 page records without this Actor's custom event charge; after that, normal pay-per-event pricing and the user's run spending limit apply.

Event	Price	What counts
`page-record`	`$0.001`	One crawled page with extracted text, metadata, links, and chunks

That is $1 per 1,000 emitted page records, plus normal Apify platform usage. Use maxPages, maxDepth, and sameDomainOnly to control cost.

Tips for better crawls

Start small. A maxPages: 10 run usually tells you whether the site structure works.
Use maxDepth: 0 for a fixed list of URLs.
Use crawlStrategy: "sitemapOnly" when a docs site has a clean sitemap and you do not want link discovery noise.
Use sameDomainOnly: true to avoid crawling unrelated sites.
Set includeHtml: false unless you need source HTML.
Shorter chunks are easier to embed; longer chunks keep more context.
Use RAG_CHUNKS_JSONL when your downstream pipeline wants one JSON object per line for embeddings or batch import.
Use datasetOutputMode: "pages" when you want a spreadsheet-style page inventory instead of chunk rows.
Check wordCount, markdownLength, and extractionMethod to spot thin or poorly structured pages.
Failed URLs are recorded in RUN_SUMMARY.

Limits and compliance

This Actor crawls public pages reachable from user-supplied URLs. It does not log in, bypass paywalls, solve CAPTCHAs, or access private systems.

The respectRobotsTxt option applies best-effort User-agent: * disallow rules for start domains with a short robots.txt timeout. Buyers are responsible for checking site terms and permitted use of crawled content.

Website & PDF to RAG JSONL Crawler

orbiscribe/linked-pdf-website-rag-crawler

Paste webpage and PDF URLs and get Markdown, JSONL chunks, PDF inventory, source warnings, and RAG-ready records.

Orbiscribe Labs

Docs & Help Center to RAG JSONL

orbiscribe/docs-help-center-rag-snapshot

Paste a docs or help center URL and get clean Markdown, breadcrumbs, page records, and JSONL chunks for RAG.

Orbiscribe Labs

Sitemap to Changed-Only RAG JSONL

orbiscribe/sitemap-to-rag-delta-dataset

Crawl sitemap.xml files and emit only added, changed, or deleted Markdown/JSONL chunks for cheaper RAG reindexing.

Orbiscribe Labs

URL List to RAG & Vector JSONL

orbiscribe/url-list-to-vector-jsonl

Paste a curated URL list and get clean Markdown, document JSONL, vector chunks, ingest manifest, and failed URL report.

Orbiscribe Labs

AI Training Data Scraper - LLM and RAG-Ready

george.the.developer/ai-training-data-scraper

Extract web content formatted for LLM fine-tuning and RAG pipelines. Output in OpenAI JSONL, Claude JSONL, Markdown, or raw text.

George Kioko

PDF URL to Markdown, Tables & RAG Extractor

thescrapelab/Apify-PDF-url-scraper

Extract clean Markdown, page text, tables, metadata, summaries, and AI-ready RAG chunks from PDF URLs.

Inus Grobler

Docs-to-RAG Optimizer

vamsi-krishna/docs-to-rag-optimizer

Convert public developer documentation into clean Markdown, semantic RAG chunks, token counts, duplicate hashes, JSONL exports, and quality warnings for AI assistants.