Fast Website to Markdown & RAG JSONL Crawler avatar

Fast Website to Markdown & RAG JSONL Crawler

Pricing

$1.00 / 1,000 page records

Go to Apify Store
Fast Website to Markdown & RAG JSONL Crawler

Fast Website to Markdown & RAG JSONL Crawler

Paste a homepage or sitemap and get clean Markdown, metadata, JSONL chunks, and source URLs for RAG at a low per-page price.

Pricing

$1.00 / 1,000 page records

Rating

0.0

(0)

Developer

Orbiscribe Labs

Orbiscribe Labs

Maintained by Community

Actor stats

0

Bookmarked

3

Total users

1

Monthly active users

11 days ago

Last modified

Share

Turn public websites into clean Markdown and RAG-ready JSONL without wrestling with a broad web crawler. Paste a homepage or sitemap, set a page limit, and get structured page records plus chunk rows with stable source metadata for retrieval systems. Each page record includes metadata, headings, links, JSON-LD, visible emails and phones, extraction-quality stats, and text chunks.

Use it for documentation sites, help centers, public knowledge bases, competitor content research, SEO audits, and AI search or chatbot projects.

Why use this instead of a generic crawler?

Use Apify's broad crawlers when you need maximum crawling flexibility. Use this Actor when you want the shortest path from public URLs to an embeddings-ready dataset:

  • paste a homepage or sitemap.xml
  • auto-discover /sitemap.xml in the default mode
  • output chunk-level JSONL by default
  • keep full page records in key-value storage
  • include source URLs, canonical URLs, headings, content stats, and crawl source
  • pay a predictable per-extracted-page price

The goal is not to expose every crawler knob. The goal is to make the common RAG ingestion run obvious enough that a first run succeeds without tuning.

Run This First

Start with a tiny docs crawl so you can inspect chunk quality before scaling:

{
"startUrls": [{ "url": "https://docs.apify.com/" }],
"sitemapUrls": [],
"crawlStrategy": "auto",
"maxPages": 5,
"maxDepth": 1,
"sameDomainOnly": true,
"respectRobotsTxt": true,
"includeMarkdown": true,
"includeHtml": false,
"datasetOutputMode": "chunks",
"chunkSizeChars": 1200,
"chunkOverlapChars": 120,
"dryRun": false
}

Look first at chunkId, text, canonicalUrl, title, crawlSource, and the RAG_CHUNKS_JSONL key-value output. A practical workflow recipe is in docs/workflow-recipes/website-rag-dataset-pipeline.md in the GitHub repository.

What does this website crawler do?

Website to RAG Dataset Crawler starts from one or more URLs or sitemaps, follows internal links up to your limits, and extracts useful page content instead of dumping raw HTML. In auto mode, it tries /sitemap.xml for each start domain so docs/help-center crawls are more complete with less setup. The output is designed to be easy to export from Apify and load into a database, spreadsheet, vector store, LangChain/LlamaIndex pipeline, or internal research workflow.

It does not require an LLM API key. The extraction is deterministic and keeps costs predictable.

What data can you extract?

FieldDescription
urlFinal crawled page URL
canonicalUrlCanonical URL when present
titlePage title
metaDescriptionMeta description
headingsH1-H6 heading structure
mainTextClean readable text
markdownMarkdown version of the main content
linksInternal and external links found on the page
jsonLdJSON-LD/schema.org blocks
emailsEmail addresses visibly present on the page
phonesPhone numbers visibly present on the page
chunksText chunks with character count and token estimate
crawlSourceWhether a page came from a start URL, sitemap, or discovered sitemap
RAG_CHUNKS_JSONLKey-value output with one JSONL record per chunk
RAG_CHUNKSKey-value output with the same chunk records as JSON
MARKDOWN_BUNDLEOne Markdown document combining all extracted pages
URL_INVENTORYCompact page inventory with URL, title, depth, word count, and chunk count
BUYER_BRIEFShort run brief for reviewing crawl coverage and extraction quality
wordCountApproximate word count of extracted readable text
markdownLengthCharacter length of generated Markdown
linkCountNumber of unique links included in the record
headingCountNumber of extracted H1-H3 headings
chunkCountNumber of generated text chunks
extractionMethodContent root used, such as article, main, or body
depthCrawl depth from the start URL

By default, the dataset contains one row per chunk because that is what most embedding and vector-database imports expect. Full page records are also stored in PAGE_RECORDS. Set datasetOutputMode to pages if you prefer one dataset row per crawled page, or both if you want both shapes in the dataset.

Quick start

  1. Add one or more start URLs.
  2. Leave crawlStrategy on auto unless you know you want links-only or sitemap-only.
  3. Add sitemapUrls when you already know the right sitemap.
  4. Set maxPages before the first run.
  5. Keep sameDomainOnly enabled unless you want to follow external links.
  6. Keep respectRobotsTxt enabled for normal public-site crawling.
  7. Start with maxPages: 10, inspect the output, then scale.

For a docs site, use the docs homepage and let auto mode check the sitemap, or paste the sitemap directly into sitemapUrls. For a single-page extraction, set crawlStrategy: "linksOnly" and maxDepth: 0.

Use With n8n, Make, or Zapier

Run the Actor with wait-for-finish enabled, then read the default dataset items or the RAG_CHUNKS_JSONL key-value output.

Typical workflow:

  1. Trigger from a new website URL, docs URL, or scheduled refresh.
  2. Run this Actor with a small maxPages limit.
  3. Send chunk rows to your vector database, spreadsheet, or agent knowledge store.
  4. Store canonicalUrl, title, and chunkId so answers can cite sources.

Input example

{
"startUrls": [{ "url": "https://docs.example.com/" }],
"sitemapUrls": ["https://docs.example.com/sitemap.xml"],
"crawlStrategy": "auto",
"maxPages": 25,
"maxDepth": 2,
"sameDomainOnly": true,
"respectRobotsTxt": true,
"includeMarkdown": true,
"includeHtml": false,
"datasetOutputMode": "chunks",
"chunkSizeChars": 2500,
"chunkOverlapChars": 250
}

Output example

{
"url": "https://example.com/",
"canonicalUrl": "https://example.com/",
"title": "Example Domain",
"metaDescription": "Example page used for documentation and tests.",
"headings": [
{
"level": 1,
"text": "Example Domain"
}
],
"mainText": "Example Domain This domain is for use in illustrative examples in documents.",
"markdown": "# Example Domain\n\nThis domain is for use in illustrative examples in documents.",
"links": [
{
"url": "https://www.iana.org/domains/example",
"text": "More information",
"internal": false
}
],
"jsonLd": [],
"emails": [],
"phones": [],
"chunks": [
{
"chunkId": "https://example.com/#chunk-0",
"text": "# Example Domain\n\nThis domain is for use in illustrative examples in documents.",
"charCount": 77,
"tokenEstimate": 20
}
],
"wordCount": 12,
"markdownLength": 77,
"linkCount": 1,
"headingCount": 1,
"chunkCount": 1,
"extractionMethod": "main",
"crawlSource": "startUrl",
"depth": 0
}

Pricing

This Actor uses pay-per-event pricing. Dry-run examples are not charged. Apify free-plan users get the first 25 page records without this Actor's custom event charge; after that, normal pay-per-event pricing and the user's run spending limit apply.

EventPriceWhat counts
page-record$0.001One crawled page with extracted text, metadata, links, and chunks

That is $1 per 1,000 emitted page records, plus normal Apify platform usage. Use maxPages, maxDepth, and sameDomainOnly to control cost.

Tips for better crawls

  • Start small. A maxPages: 10 run usually tells you whether the site structure works.
  • Use maxDepth: 0 for a fixed list of URLs.
  • Use crawlStrategy: "sitemapOnly" when a docs site has a clean sitemap and you do not want link discovery noise.
  • Use sameDomainOnly: true to avoid crawling unrelated sites.
  • Set includeHtml: false unless you need source HTML.
  • Shorter chunks are easier to embed; longer chunks keep more context.
  • Use RAG_CHUNKS_JSONL when your downstream pipeline wants one JSON object per line for embeddings or batch import.
  • Use datasetOutputMode: "pages" when you want a spreadsheet-style page inventory instead of chunk rows.
  • Check wordCount, markdownLength, and extractionMethod to spot thin or poorly structured pages.
  • Failed URLs are recorded in RUN_SUMMARY.

Limits and compliance

This Actor crawls public pages reachable from user-supplied URLs. It does not log in, bypass paywalls, solve CAPTCHAs, or access private systems.

The respectRobotsTxt option applies best-effort User-agent: * disallow rules for start domains with a short robots.txt timeout. Buyers are responsible for checking site terms and permitted use of crawled content.