Website to Markdown and RAG Dataset Crawler avatar

Website to Markdown and RAG Dataset Crawler

Pricing

$3.00 / 1,000 page records

Go to Apify Store
Website to Markdown and RAG Dataset Crawler

Website to Markdown and RAG Dataset Crawler

Crawl public websites into clean Markdown, text, metadata, links, JSON-LD, and chunks for RAG and knowledge bases.

Pricing

$3.00 / 1,000 page records

Rating

0.0

(0)

Developer

Orbiscribe Labs

Orbiscribe Labs

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

0

Monthly active users

21 hours ago

Last modified

Share

Turn public websites into clean Markdown and RAG-ready datasets. This Actor crawls user-supplied URLs, extracts readable page content, and saves structured page records plus a JSONL chunk export with metadata for retrieval systems. Each page record includes metadata, headings, links, JSON-LD, visible emails and phones, extraction-quality stats, and text chunks.

Use it for documentation sites, help centers, public knowledge bases, competitor content research, SEO audits, and AI search or chatbot projects.

Run This First

Start with a tiny docs crawl so you can inspect chunk quality before scaling:

{
"startUrls": [{ "url": "https://docs.apify.com/academy/getting-started" }],
"maxPages": 3,
"maxDepth": 1,
"sameDomainOnly": true,
"respectRobotsTxt": true,
"includeMarkdown": true,
"includeHtml": false,
"datasetOutputMode": "chunks",
"chunkSizeChars": 1200,
"chunkOverlapChars": 120,
"dryRun": false
}

Look first at chunkId, text, canonicalUrl, title, headingPath, and the RAG_CHUNKS_JSONL key-value output. A practical workflow recipe is in docs/workflow-recipes/website-rag-dataset-pipeline.md in the GitHub repository.

What does this website crawler do?

Website to RAG Dataset Crawler starts from one or more URLs, follows internal links up to your limits, and extracts useful page content instead of dumping raw HTML. The output is designed to be easy to export from Apify and load into a database, spreadsheet, vector store, LangChain/LlamaIndex pipeline, or internal research workflow.

It does not require an LLM API key. The extraction is deterministic and keeps costs predictable.

What data can you extract?

FieldDescription
urlFinal crawled page URL
canonicalUrlCanonical URL when present
titlePage title
metaDescriptionMeta description
headingsH1-H6 heading structure
mainTextClean readable text
markdownMarkdown version of the main content
linksInternal and external links found on the page
jsonLdJSON-LD/schema.org blocks
emailsEmail addresses visibly present on the page
phonesPhone numbers visibly present on the page
chunksText chunks with character count and token estimate
RAG_CHUNKS_JSONLKey-value output with one JSONL record per chunk
RAG_CHUNKSKey-value output with the same chunk records as JSON
MARKDOWN_BUNDLEOne Markdown document combining all extracted pages
URL_INVENTORYCompact page inventory with URL, title, depth, word count, and chunk count
BUYER_BRIEFShort run brief for reviewing crawl coverage and extraction quality
wordCountApproximate word count of extracted readable text
markdownLengthCharacter length of generated Markdown
linkCountNumber of unique links included in the record
headingCountNumber of extracted H1-H3 headings
chunkCountNumber of generated text chunks
extractionMethodContent root used, such as article, main, or body
depthCrawl depth from the start URL

By default, the dataset contains one row per chunk because that is what most embedding and vector-database imports expect. Full page records are also stored in PAGE_RECORDS. Set datasetOutputMode to pages if you prefer one dataset row per crawled page, or both if you want both shapes in the dataset.

Quick start

  1. Add one or more start URLs.
  2. Set maxPages and maxDepth before the first run.
  3. Keep sameDomainOnly enabled unless you want to follow external links.
  4. Keep respectRobotsTxt enabled for normal public-site crawling.
  5. Start with maxPages: 10, inspect the output, then scale.

For a docs site, use the docs homepage or sitemap page as the start URL. For a single-page extraction, set maxDepth: 0.

Use With n8n, Make, or Zapier

Run the Actor with wait-for-finish enabled, then read the default dataset items or the RAG_CHUNKS_JSONL key-value output.

Typical workflow:

  1. Trigger from a new website URL, docs URL, or scheduled refresh.
  2. Run this Actor with a small maxPages limit.
  3. Send chunk rows to your vector database, spreadsheet, or agent knowledge store.
  4. Store canonicalUrl, title, and chunkId so answers can cite sources.

Input example

{
"startUrls": [{ "url": "https://example.com/" }],
"maxPages": 25,
"maxDepth": 2,
"sameDomainOnly": true,
"respectRobotsTxt": true,
"includeMarkdown": true,
"includeHtml": false,
"datasetOutputMode": "chunks",
"chunkSizeChars": 2500,
"chunkOverlapChars": 250
}

Output example

{
"url": "https://example.com/",
"canonicalUrl": "https://example.com/",
"title": "Example Domain",
"metaDescription": "Example page used for documentation and tests.",
"headings": [
{
"level": 1,
"text": "Example Domain"
}
],
"mainText": "Example Domain This domain is for use in illustrative examples in documents.",
"markdown": "# Example Domain\n\nThis domain is for use in illustrative examples in documents.",
"links": [
{
"url": "https://www.iana.org/domains/example",
"text": "More information",
"internal": false
}
],
"jsonLd": [],
"emails": [],
"phones": [],
"chunks": [
{
"chunkId": "https://example.com/#chunk-0",
"text": "# Example Domain\n\nThis domain is for use in illustrative examples in documents.",
"charCount": 77,
"tokenEstimate": 20
}
],
"wordCount": 12,
"markdownLength": 77,
"linkCount": 1,
"headingCount": 1,
"chunkCount": 1,
"extractionMethod": "main",
"depth": 0
}

Pricing

This Actor uses pay-per-event pricing. Dry-run examples are not charged. Apify free-plan users get the first 25 page records without this Actor's custom event charge; after that, normal pay-per-event pricing and the user's run spending limit apply.

EventPriceWhat counts
page-record$0.003One crawled page with extracted text, metadata, links, and chunks

That is $3 per 1,000 emitted page records, plus normal Apify platform usage. Use maxPages, maxDepth, and sameDomainOnly to control cost.

Tips for better crawls

  • Start small. A maxPages: 10 run usually tells you whether the site structure works.
  • Use maxDepth: 0 for a fixed list of URLs.
  • Use sameDomainOnly: true to avoid crawling unrelated sites.
  • Set includeHtml: false unless you need source HTML.
  • Shorter chunks are easier to embed; longer chunks keep more context.
  • Use RAG_CHUNKS_JSONL when your downstream pipeline wants one JSON object per line for embeddings or batch import.
  • Use datasetOutputMode: "pages" when you want a spreadsheet-style page inventory instead of chunk rows.
  • Check wordCount, markdownLength, and extractionMethod to spot thin or poorly structured pages.
  • Failed URLs are recorded in RUN_SUMMARY.

Limits and compliance

This Actor crawls public pages reachable from user-supplied URLs. It does not log in, bypass paywalls, solve CAPTCHAs, or access private systems.

The respectRobotsTxt option applies best-effort User-agent: * disallow rules for start domains with a short robots.txt timeout. Buyers are responsible for checking site terms and permitted use of crawled content.