Knowledge Intelligence Engine — Website to Markdown for RAG avatar

Knowledge Intelligence Engine — Website to Markdown for RAG

Pricing

from $20.00 / 1,000 page converteds

Go to Apify Store
Knowledge Intelligence Engine — Website to Markdown for RAG

Knowledge Intelligence Engine — Website to Markdown for RAG

Turn any website, documentation site or help centre into a retrieval-ready knowledge corpus for RAG and AI search. Clean Markdown plus chunks, change detection, deduplication, retrieval scoring, version awareness and a full corpus audit, in one run.

Pricing

from $20.00 / 1,000 page converteds

Rating

0.0

(0)

Developer

Ryan Clinton

Ryan Clinton

Maintained by Community

Actor stats

0

Bookmarked

15

Total users

2

Monthly active users

7 days ago

Last modified

Share

Knowledge Intelligence Engine — turn any site into a retrieval-ready RAG corpus

Most crawlers extract pages. This actor builds knowledge corpora.

Most teams think they need a crawler. What they actually need is a retrieval-ready knowledge corpus. A typical RAG pipeline is eight stages:

Crawl → Boilerplate removal → Deduplication → Chunking → Change detection
→ Quality filtering → Documentation classification → Corpus auditing

Most tools stop after the first step and hand you HTML. This actor does all eight in a single run and hands you a production-ready knowledge corpus: clean Markdown, embedding-ready chunks, change intelligence, retrieval scoring, version awareness, and a full corpus audit.

Feed the output straight into LangChain, LlamaIndex, Pinecone, Weaviate, Qdrant, Chroma, or any vector database. No browser, no JavaScript runtime, no preprocessing pipeline to build and maintain.

Note: the actor's slug is still website-content-to-markdown (its original name). The Markdown conversion is the foundation; everything above is built on top of it.

What makes this different?

Most crawlers extract pages. This actor manages knowledge. It does not just tell you what content exists; it tells you:

  • What changed since the last run (and returns only that, if you want)
  • What matters (an importance and density ranking, not just word counts)
  • What to embed and what to skip (a single retrieval score per page)
  • What is duplicated, stale, or orphaned before it reaches production
  • What is missing (crawl coverage vs the sitemap)
CapabilityGeneric scraperTypical extractorThis actor
Clean Markdown
Boilerplate removal
Change detection across runsPartial
Delta crawls (return only changed pages)
Per-page retrieval scoring
Embedding-ready chunks
Documentation classification
Version awareness
Duplicate / orphan / stale detection
Corpus audit + recommendations
Embedding-cost analysis

Most extraction tools tell you what content exists. This actor tells you what changed, what should be embedded, what should be removed, what is missing, and what is hurting retrieval. That is the moat.

Before and after

BEFORE AFTER (one run)
docs.example.com Corpus Score: 91 / 100
542 pages ✓ 31 duplicate pages flagged
~2.1M tokens ✓ 14 orphan pages found
unknown quality ✓ 27 stale pages found
unknown duplicates ✓ 38% embedding-token reduction
unknown coverage ✓ 96% coverage confidence
93% of pages retrieval-ready
→ ready for Pinecone / Qdrant / Weaviate

(Illustrative; real figures depend on the site.)

Re-embed only what changed (the expensive part of RAG)

Most crawlers return every page every time, so you re-embed, and re-pay for, the whole site on every refresh:

Typical crawler This actor (delta mode)
1,000 pages 1,000 pages
↓ crawl 1,000 ↓ crawl 1,000
↓ embed 1,00017 changed
(next refresh) ↓ embed 17
↓ embed 1,000 AGAIN~98% less embedding work

This actor tracks a content hash per page across runs. Set a watchlistName and deltaOnly, and a 1,000-page documentation site with 17 changed pages returns 17 records. The exact saving varies with how much your docs change, but on a stable site it is enormous.

What happens after the crawl?

Most crawlers stop at extraction. This one keeps going:

Most crawlers stop here This actor continues
Website Website
↓ Discovery ↓ Discovery
↓ Markdown ↓ Markdown
(you build the rest) ↓ Deduplication
Classification (type · archetype · intent)
↓ Change detection (delta vs last run)
Chunking (heading-aware, embedding-ready)
↓ Corpus analysis (coverage · health · versions)
↓ Retrieval audit (what to embed, what to skip)
→ Knowledge corpus → your vector database

The intelligence stack: from raw website through deduplication, classification, change detection, chunking and corpus audit to a retrieval-ready knowledge corpus

What you don't have to build

Most website crawlers stop at extraction. Everything else is a pipeline you build and maintain yourself:

Build it yourself with a typical crawlerWith this actor
Boilerplate-removal pipeline✅ built in
Deduplication pipeline✅ built in
Chunking pipeline✅ built in
Change-detection system✅ built in
Documentation classification✅ built in
Retrieval / quality scoring✅ built in
Corpus quality auditing✅ built in
Token-cost estimation✅ built in

A typical extraction tool returns Markdown plus a little metadata. This actor returns Markdown, chunks, change intelligence, a corpus audit, retrieval scoring, coverage analysis, version awareness, and a knowledge graph. One side is a converter; the other is infrastructure.

Why RAG teams choose this

Built for AI ingestion, not generic scraping. The result: lower embedding costs (dedup + delta), higher retrieval quality (scoring + classification), smaller vector stores (skip thin and duplicate pages), faster refresh cycles (delta crawls), and far less post-processing code.

When a page renders its content client-side or restricts automated access, you get an explicit diagnostic record (jsRenderingRequired / botProtection) instead of a silently missing page, and those pages are never charged.

Corpus Intelligence Report

Why this matters: most teams never discover their duplicate documentation, orphaned pages, stale content, missing sitemap coverage, thin pages, or version conflicts until retrieval quality has already degraded in production. This actor finds them automatically, on every run, before they reach your model.

The feature almost no other extraction tool offers: the actor does not just hand you text, it tells you what is wrong with your knowledge corpus. Every run writes a SUMMARY to the key-value store with a corpusScore, a coverageConfidence ("did I crawl the whole thing?"), a retrievalAudit (excellent / good / poor pages + the top issues), an embedding-cost compression report, corpusDrift since the last run, and plain-English recommendations like "31 duplicate pages detected, dedupe on canonicalUrl" and "captured 500 of 560 sitemap pages (89.3%), raise maxPagesPerDomain for full coverage". Most tools tell you what content exists; this one tells you what to fix. (Full SUMMARY shape is documented in the output section below.)

A run summary reads like a corpus health dashboard:

Corpus Score 91 / 100
Coverage Confidence 94%
Retrieval-ready 412 excellent · 83 good · 17 poor
Duplicate pages 31
Orphan pages 14
Stale pages 27
Embedding savings 2.1M → 1.3M tokens (38% reduction)
Corpus drift 17% (vs last run)

Who is this for?

Built for: RAG engineers, AI platform teams, documentation teams, internal-search teams, and knowledge-management teams who need a retrieval-ready corpus, not just scraped text.

Not for: general web scraping, ecommerce product extraction, lead generation, or browser automation. If you only need raw HTML or a few fields, a generic crawler is simpler.

When a crawler is enough vs when you need this

You only need...Use a generic crawler
Markdown + basic metadataA plain extraction tool is fine
You need...Use this actor
Change detection + delta ingestion
Retrieval scoring + quality filtering
Corpus audit + coverage analysis
Version awareness + documentation classification
A knowledge graph + embedding-cost analysis

A real example

Point it at a documentation site and you get an ingestion-ready corpus plus a report, not just pages:

Input: https://docs.example.com
542 pages discovered (from sitemap + link following)
511 pages extracted to clean Markdown
31 duplicate pages detected (canonical variants)
14 orphan pages detected (in sitemap, no inbound links)
27 stale pages detected (not updated in over a year)
38% token reduction after dedup
93% of pages retrieval-ready (pageScore ≥ 75)
96% coverage confidence
Result: ~1.3M retrieval-ready tokens, embedding-ready, with a corpus health report

(Illustrative figures; actual numbers depend on the target site.)

What data can you extract?

Data PointSourceExample
📄 Markdown contentConverted page HTML# Getting Started\n\nThis guide covers...
🔗 Page URLRequest URLhttps://docs.pinnacletech.io/guides/setup
📌 Page titleOpenGraph, <title>, or <h1>Getting Started — Pinnacle Docs
📝 Meta descriptionOpenGraph or <meta name="description">Learn how to set up your Pinnacle account...
🔢 Word countCounted from Markdown output1,843
🧮 Token estimateMarkdown length ÷ 4 (GPT-style)2,410
Content qualitywordCount + extraction methodrich
🧩 Extraction methodWhich content path matchedsemantic-main
🖥️ JS rendering requiredFramework shell + thin static HTMLfalse
🛡️ Bot protectionAnti-bot challenge detection{ "detected": false }
🌐 Language<html lang> attributeen
🔽 Crawl depthHops from starting URL2
🕐 Crawled atApify runtime timestamp2026-06-06T09:12:44.000Z

Sample dataset output: per-page pageScore, content archetype, intent, token estimate and change type

Why use Website Content to Markdown?

Four built-in capabilities: delta mode, page-score embed gate, corpus intelligence report, and embedding-ready chunks

Large language models and RAG pipelines need clean text — not raw HTML packed with <nav> elements, cookie consent banners, sidebar widgets, and tracking scripts. Preparing web content for AI consumption by hand means copy-pasting from dozens of pages, reformatting manually, and re-doing the work every time the source changes. That process does not scale.

This actor automates the entire pipeline: it discovers pages through sitemap.xml and internal link following, extracts the main content using semantic HTML selectors, strips more than 30 categories of boilerplate, and converts the result to GitHub Flavored Markdown in a single run. Every page becomes a clean, consistently formatted document ready for downstream AI processing.

Beyond the conversion itself, the Apify platform gives you tools that matter at scale:

  • Scheduling — run weekly or on a custom cron to keep your knowledge base snapshots current
  • API access — trigger runs from Python, JavaScript, or any HTTP client and pipe results directly into your pipeline
  • Proxy rotation — scrape at scale without IP blocks using Apify's built-in residential and datacenter proxy infrastructure
  • Monitoring — get Slack or email alerts when runs fail or produce unexpected results
  • Integrations — connect output to LangChain, LlamaIndex, Pinecone, Weaviate, Zapier, Make, or webhooks in minutes

Features

Grouped by the job they do. Every field below is documented in full in the output fields table.

Content extraction

  • Semantic main-content extraction (10 selectors, <main>/<article> first), 30+ categories of boilerplate stripped, clean GitHub Flavored Markdown (headings, code, tables, task lists).
  • Sitemap discovery across five common locations + index files, breadth-first link following (depth ≤ 5), per-domain page caps, concurrent crawling with a session pool.

Retrieval optimization

  • pageScore (the headline 0-100), with retrievalScore, qualityScore, knowledgeDensity and extractionConfidence as components.
  • Heading-aware chunks (opt-in, embedding-ready with token counts), tokenEstimate per page, contentArchetype + intent + pageType for filtering and query routing.

Corpus intelligence

  • Run-level SUMMARY: corpusScore, coverageConfidence, retrievalAudit, embedding-cost compression, crawl-gap coverage, corpus recommendations.
  • corpusFingerprint + corpusDrift to measure how much the whole corpus changed between runs.

Change management

  • watchlistName content hashing + changeType (new / content / structure / unchanged), and deltaOnly to emit (and charge for) only changed pages.

Documentation intelligence

  • documentationVersion + run-level version-family resolution, API-reference detection (apiReference + endpoints), freshnessScore / stalenessDays, completeness, toc, codeBlocks, breadcrumbs.

Deduplication & hygiene

  • canonicalUrl + duplicateContent so the same page is never embedded twice, orphan / thin / stale detection at run level, and a link graph (extractLinks) for graph-RAG.

Reliability

  • JS-SPA + anti-bot detection emits an explicit diagnostic record (never a silent gap, never charged), failed URLs are classified with a failureType + recommendation, and proxy support for sites that restrict automated access.

The 3 fields most teams actually use

Don't let the field count intimidate you. Most pipelines branch on just three:

  • pageScore — the headline 0-100. Embed pages above your threshold.
  • retrievalScore — the embed gate (>= 75 is a clean cutoff).
  • changeTypenew / content / structure / unchanged, for re-embedding only what moved.

Everything else is optional intelligence: turn it on when you need corpus audits, chunking, classification, or the knowledge graph. The defaults give you clean Markdown plus those three fields.

Use cases for converting websites to Markdown

RAG pipeline ingestion

AI engineers building retrieval-augmented generation systems need clean text to chunk and embed. This actor converts entire documentation sites into structured Markdown pages in a single run. The wordCount field on each record lets you estimate token cost before committing to chunking and embedding, avoiding expensive surprises downstream.

LLM fine-tuning dataset preparation

Teams preparing fine-tuning datasets for instruction-following or domain-specific models need high-quality, boilerplate-free text. This actor converts blog posts, knowledge bases, and technical documentation with all navigation and ad content removed — so training data reflects actual prose, not menu structures.

AI chatbot and knowledge base construction

Product teams building internal chatbots or customer-facing support tools need to ingest their documentation into a vector store. This actor converts company wikis, help centers, and product docs into Markdown that integrates directly with LangChain's UnstructuredMarkdownLoader and LlamaIndex's SimpleDirectoryReader.

Competitive content analysis

Marketing and strategy teams analyzing competitor websites can convert entire competitor blogs and resource libraries to Markdown, then run LLM-based content gap analysis, keyword extraction, and tone comparison — all from structured text rather than raw HTML.

Documentation archival and migration

Engineering teams migrating from legacy CMS platforms or creating offline documentation snapshots need a reliable way to extract content as portable Markdown. This actor crawls the full site and produces files ready for import into Hugo, Jekyll, Astro, or any Markdown-based documentation tool.

Content monitoring and freshness tracking

Set a watchlistName and schedule the actor: each run hashes every page and marks it new, content, structure, or unchanged against the prior run. Re-embed only the pages whose changed flag is true instead of rebuilding your whole vector store on every refresh. The change intelligence is built in, with no second actor and no external state to manage.

How to convert a website to Markdown

  1. Enter your URLs — Paste one or more website URLs into the "Website URLs" field. Bare domains like pinnacletech.io are auto-prefixed with https://. Use section-specific URLs like https://docs.pinnacletech.io to target only relevant content rather than an entire homepage.
  2. Set your depth and page limit — The defaults (10 pages, depth 2) work for most documentation sections. Set depth to 0 if you only need the specific pages you listed. Increase maxPagesPerDomain up to 100 for larger sites.
  3. Run the actor — Click "Start" and wait. A 10-page run typically completes in under 30 seconds. A 100-page run takes 2–5 minutes depending on page size.
  4. Download your Markdown — Open the Dataset tab and export as JSON. Each record contains the markdown field with clean, LLM-ready content. You can also stream results via the API as they arrive.

Input parameters

ParameterTypeRequiredDefaultDescription
urlsstring[]YesStarting URLs to crawl. Bare domains are auto-prefixed with https://. One URL per line in the UI
maxPagesPerDomainintegerNo10Maximum pages to convert per domain (range: 1–100)
maxCrawlDepthintegerNo2Link-following depth from each starting URL. 0 = only the starting pages, 5 = maximum
includeMetadatabooleanNotrueInclude title, meta description, language code, and word count in each output record
onlyMainContentbooleanNotrueStrip navigation, footers, sidebars, and ads. Extract only main article content
generateChunksbooleanNofalseSplit each page into heading-aware chunks ready for embedding (each carries its heading path + token count). Skips the chunking step in your RAG pipeline
chunkSizeintegerNo1000Approximate target chunk size in tokens (used only when generateChunks is on; range 100–8000)
extractLinksbooleanNofalseInclude per-page internal/external links and a run-level navigation graph (inbound link counts) in the key-value store
extractAssetsbooleanNofalseInclude page images that have alt text (type, alt, URL) on each record
watchlistNamestringNoName a watchlist to enable cross-run change detection. The actor stores each page's content hash under this name and marks pages new / content / structure / unchanged on the next run
deltaOnlybooleanNofalseRequires watchlistName. Crawls the whole site but emits (and charges for) only pages that are new or changed since the last run, so you re-embed just what moved
proxyConfigurationobjectNoApify ProxyProxy settings for crawling. Defaults to Apify's datacenter proxy pool. Use residential proxies for sites that restrict automated access

Input examples

Convert a documentation site (most common use case):

{
"urls": ["https://docs.pinnacletech.io"],
"maxPagesPerDomain": 50,
"maxCrawlDepth": 3,
"includeMetadata": true,
"onlyMainContent": true
}

Batch-convert multiple sites for a knowledge base:

{
"urls": [
"https://docs.pinnacletech.io",
"https://help.betaindustries.com",
"https://support.acmecorp.com/articles"
],
"maxPagesPerDomain": 25,
"maxCrawlDepth": 2,
"includeMetadata": true,
"onlyMainContent": true
}

Single-page extraction (no link following):

{
"urls": [
"https://blog.acmecorp.com/2026/03/product-launch-guide",
"https://blog.acmecorp.com/2026/02/api-best-practices"
],
"maxPagesPerDomain": 1,
"maxCrawlDepth": 0,
"includeMetadata": true,
"onlyMainContent": true
}

Input tips

  • Start with depth 0 for specific pages — if you already know which pages you need, list them explicitly and set maxCrawlDepth: 0 to avoid crawling unrelated content.
  • Use section-specific URLs — targeting https://docs.acmecorp.com/api-reference rather than https://acmecorp.com means the crawler starts in the right area and page limits apply to the relevant section.
  • Keep "main content only" enabled for AI workflows — disabling it includes navigation and sidebar text in the Markdown, which degrades chunking quality and wastes token budget in downstream LLM calls.
  • Process multiple sites in one run — the actor deduplicates by domain and tracks page limits per domain independently, so batching 10 sites in one run is more efficient than 10 separate runs.
  • Check word counts before embedding — the wordCount field lets you filter out near-empty pages and estimate token costs before sending to an embedding API.

Output example

{
"recordType": "page",
"schemaVersion": "1.1.0",
"url": "https://docs.pinnacletech.io/guides/getting-started",
"title": "Getting Started — Pinnacle Docs",
"description": "Everything you need to set up your first Pinnacle integration in under 10 minutes.",
"markdown": "# Getting Started\n\nThis guide walks you through creating your first Pinnacle integration.\n\n## Prerequisites\n\nBefore you begin, make sure you have:\n\n- A Pinnacle account ([sign up free](https://pinnacletech.io/signup))\n- Node.js 18+ or Python 3.10+\n- Your API key from the [dashboard](https://dashboard.pinnacletech.io)\n\n## Step 1: Install the SDK\n\n```bash\nnpm install @pinnacle/sdk\n```\n\n## Next steps\n\n- [Authentication guide](/guides/auth)\n- [API reference](/api)",
"wordCount": 412,
"tokenEstimate": 540,
"language": "en",
"crawlDepth": 0,
"extractionMethod": "semantic-main",
"contentQuality": "rich",
"pageScore": 89,
"qualityScore": 87,
"retrievalScore": 91,
"knowledgeDensity": 84,
"intent": "implement",
"completeness": { "hasOverview": true, "hasExamples": true, "hasCode": true, "hasFaq": false, "hasTroubleshooting": false },
"documentationVersion": "v2",
"extractionConfidence": 95,
"confidenceReason": ["main_element", "high_text_density", "low_navigation_ratio"],
"htmlSize": 118000,
"contentSize": 2480,
"boilerplateReduction": 97.9,
"pageType": "documentation",
"contentArchetype": "how-to",
"siteSection": "guides",
"faqCount": 0,
"procedure": true,
"procedureSteps": 2,
"breadcrumbs": ["Docs", "Guides", "Getting Started"],
"freshnessScore": 92,
"stalenessDays": 7,
"discoveredVia": "sitemap",
"publishedAt": "2026-01-12T00:00:00.000Z",
"modifiedAt": "2026-05-30T00:00:00.000Z",
"apiReference": false,
"endpoints": [],
"toc": [
{ "level": 1, "title": "Getting Started" },
{ "level": 2, "title": "Prerequisites" },
{ "level": 2, "title": "Next steps" }
],
"codeBlocks": [{ "language": "bash", "lines": 1 }],
"containsCode": true,
"contentHash": "9f2c…",
"structureHash": "4a1b…",
"canonicalUrl": "https://docs.pinnacletech.io/guides/getting-started",
"duplicateContent": false,
"changeType": "content",
"changed": true,
"previousHash": "7d3e…",
"parentPage": "https://docs.pinnacletech.io/guides",
"jsRenderingRequired": false,
"jsFramework": null,
"botProtection": { "detected": false, "vendor": null },
"crawledAt": "2026-06-06T09:12:44.331Z"
}

A diagnostic record for a page that needs browser rendering looks like this (empty markdown, contentQuality: "empty", failureType: "js-required", and an actionable recommendation — this record is not charged):

{
"recordType": "page",
"schemaVersion": "1.1.0",
"url": "https://app.pinnacletech.io/dashboard",
"title": "Pinnacle",
"description": null,
"markdown": "",
"wordCount": 0,
"tokenEstimate": 0,
"language": "en",
"crawlDepth": 1,
"extractionMethod": "body-fallback",
"contentQuality": "empty",
"jsRenderingRequired": true,
"jsFramework": "Next.js",
"botProtection": { "detected": false, "vendor": null },
"failureType": "js-required",
"scrapeError": "Page renders content client-side (Next.js) — static HTML yielded no content. Browser rendering required.",
"recommendation": "This page renders its content with client-side JavaScript. Use a browser-based crawler to extract it.",
"crawledAt": "2026-06-06T09:12:47.882Z"
}

Pages that never loaded (network error, repeated block, timeout) also land as records with failureType set to blocked / timeout / no-data and a matching recommendation, so no requested URL silently disappears. Filter WHERE failureType IS NOT NULL (or open the Failures & diagnostics dataset view) to see everything that needs attention.

Output fields

FieldTypeDescription
recordTypestringpage for a converted (or diagnostic) page, error for a run-level failure record. Stable enum for filtering
schemaVersionstringOutput schema version (semver). Branch on this if you pin output shape in a pipeline
urlstringFull URL of the converted page
titlestringPage title from OpenGraph og:title, then <title> tag, then first <h1>. Empty string if includeMetadata is false
descriptionstringMeta description from OpenGraph or <meta name="description">. On a diagnostic record, this carries the reason no content was extracted
markdownstringFull page content converted to GitHub Flavored Markdown. The primary output field. Empty on diagnostic records
wordCountintegerWord count of the Markdown text
tokenEstimateintegerRough GPT-style token count (Markdown length ÷ 4) for context-window budgeting and RAG chunk sizing
languagestring or nullLanguage code from <html lang> attribute, lowercased and trimmed of region suffix (e.g., "en-US" becomes "en"). Null if not set
crawlDepthintegerNumber of link hops from the starting URL. 0 means the starting page itself
extractionMethodstringHow content was located: semantic-main, body-fallback, or full-page. Stable enum
contentQualitystringRichness band: rich (300+ words), moderate (50-299), thin, or empty. Filter a RAG corpus on this. Stable enum
jsRenderingRequiredbooleanTrue when a JS-framework shell was detected and the static HTML yielded little content. The page likely needs browser rendering
jsFrameworkstring or nullDetected client-side framework (Next.js, Nuxt, React, Angular, Vue, Svelte) when JS rendering is required
botProtectionobject{ detected: boolean, vendor: string | null }. Set when the response looks like an anti-bot challenge rather than content
failureTypestring or nullWhy a page produced no content: no-data / blocked / timeout / js-required / parse-error. Null on content-bearing pages. Filter WHERE failureType IS NOT NULL for pages that need attention
scrapeErrorstring or nullHuman-readable reason a page produced no content. Null on success
recommendationstring or nullActionable next step for a failed page (enable proxy, use a browser-based crawler, etc.). Null on success
qualityScoreinteger0-100 content quality (word volume + heading structure + extraction cleanliness + richness). Sort on this; filter on the contentQuality band
extractionConfidenceinteger0-100 confidence that extraction captured the right content, with confidenceReason[] tags. Distinct from qualityScore (which rates the content)
htmlSize / contentSize / boilerplateReductionmixedRaw HTML bytes, extracted Markdown bytes, and the % of boilerplate removed (the per-page embedding-volume saving)
pageScoreintegerThe single headline score (0-100). Read this one number; the scores below are its drill-down components
retrievalScoreintegerComponent. 0-100 embed-worthiness gate (quality + extraction confidence + structure + dedup). retrievalScore >= 75 is a clean ingestion filter
knowledgeDensityintegerComponent. 0-100 information density (value per word). A short API reference can outscore a long marketing page
completenessobject{ hasOverview, hasExamples, hasCode, hasFaq, hasTroubleshooting } — find docs lacking examples or troubleshooting
intentstringRetrieval-routing intent: learn / implement / troubleshoot / reference / other. A query-router branches on this
documentationVersionstring or nullVersion detected in the URL (v2, 2025, latest, ...). Keeps multi-version docs from mixing in one corpus
versionFamilystring or nullVersion-independent path so versions of the same doc group together. The SUMMARY resolves the latest version per family
pageTypestringdocumentation / reference / blog / article / landing / other. Filter a mixed corpus by type
contentArchetypestringFiner intent: how-to / tutorial / faq / reference / changelog / release-notes / explanation / blog-post / landing / other. "Ingest docs but skip release notes"
siteSectionstringURL-path section: api / reference / docs / guides / tutorials / blog / changelog / help / faq / other
faqCount / procedure / procedureStepsmixedKnowledge-object signals: question-style heading count, whether the page is step-by-step, and how many steps
breadcrumbsarraySemantic breadcrumb trail (labels) from breadcrumb nav / schema.org BreadcrumbList
freshnessScore / stalenessDaysinteger or nullFreshness from modifiedAt (100 = recent, decays to 0 at ~2 years; days since modified). Null when no date metadata
discoveredViastringHow the URL entered the crawl: seed / sitemap / internal-link. For coverage troubleshooting
publishedAt / modifiedAtstring or nullPublish / last-modified dates from page metadata. Null when the page declares none
apiReference / endpointsmixedapiReference flags API docs; endpoints lists detected { method, path } pairs
tocarrayHeading hierarchy: { level, title }[], for navigation and semantic chunking
codeBlocksarrayFenced code blocks: { language, lines }[]
containsCodebooleanTrue when the page has at least one code block
chunksarrayHeading-aware chunks for RAG: { chunkId, headingPath, content, tokens, quality }[]. Each chunk carries a 0-100 quality. Empty unless generateChunks is on
contentHashstringsha256 of the Markdown. Diff across runs to skip re-embedding unchanged pages
structureHashstringsha256 of the heading skeleton. Distinguishes a restructure from a prose edit
canonicalUrlstring or null<link rel="canonical"> target, when present
duplicateContentbooleanTrue when the canonical URL differs from the crawled URL (e.g. a ?utm= variant). Dedupe your corpus on this
changed / changeType / previousHashmixedWatchlist mode only (null otherwise). changeType is new / content / structure / unchanged
parentPagestring or nullThe page this URL was discovered from. Null for seed URLs
internalLinks / externalLinksarraySame-domain / cross-domain links. Empty unless extractLinks is on
assetsarrayPage images with alt text: { type, alt, url }[]. Empty unless extractAssets is on
crawledAtstringISO 8601 timestamp of when the page was crawled and converted

A run-level SUMMARY record turns the run into a full Corpus Intelligence Report in the key-value store:

{
corpusScore, // 0-100 overall corpus health
corpusFingerprint, // hash of the whole corpus
corpusDrift, corpusSimilarity, // % change vs the prior watchlist run
coverageConfidence, // 0-100 "did we crawl the whole site?"
retrievalAudit: { excellent, good, poor, topIssues: [ ... ] },
pages, words, tokens,
compression: { rawTokens, dedupedTokens, reduction }, // embedding-cost saving from dedup
embeddingSavingsPct,
crawlCoverage: { expectedFromSitemap, captured, coveragePct },
coverage: { bySection, byArchetype },
versions, // doc-version breakdown
versionFamilies: { "<family>": { versions, latest, pages } },
corpusHealth: { excellent, good, poor },
retrievalReadyPages, pagesToSkip,
duplicatePages, orphanPages, thinPages, stalePages, documentationDebtPages,
averageQuality, languages,
changedPages, skippedUnchanged,
recommendations: [ ... ], // plain-English findings
pagesNeedingJsRendering, pagesBotBlocked, pagesFailed, chargedEvents, generatedAt
}

The recommendations[] array surfaces plain-English findings like "31 duplicate pages detected — dedupe on canonicalUrl", "14 orphan pages found (in the sitemap, no inbound links)", and "Captured 500 of 560 sitemap pages (89.3%) — raise maxPagesPerDomain for full coverage". When extractLinks is on, a NAVIGATION_GRAPH key holds { nodes: [{ url, inboundLinks, importanceScore, orphan }], edges: [{ source, target }], generatedAt } — a PageRank-lite importance ranking plus the link edges for graph-based retrieval.

How much does it cost?

This actor uses Apify's pay-per-event pricing: $0.02 per page converted to Markdown, plus a negligible $0.00005 per run to start. Always check the Store page for the latest rate.

You only pay for content you receive. Pages that are skipped, blocked, require JavaScript rendering, fail to load, or are unchanged in delta mode are not charged, they are reported as diagnostic records at no cost.

Two controls keep spend predictable:

  • maxPagesPerDomain caps how many pages each domain can convert, so a run can never exceed your page budget.
  • deltaOnly (with a watchlistName) charges only for pages that changed since the last run. On a stable site, a scheduled refresh costs a small fraction of a full crawl, which is the core cost advantage over crawlers that re-process everything every time.

Compared with building this yourself: manual copy-paste preparation of 100 pages for a RAG pipeline takes hours and produces inconsistent formatting; this actor returns clean, scored, embedding-ready Markdown in minutes.

Convert websites to Markdown using the API

Python

from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("ryanclinton/website-content-to-markdown").call(run_input={
"urls": ["https://docs.pinnacletech.io"],
"maxPagesPerDomain": 30,
"maxCrawlDepth": 2,
"includeMetadata": True,
"onlyMainContent": True,
})
for page in client.dataset(run["defaultDatasetId"]).iterate_items():
print(f"{page['url']}{page['wordCount']} words (~{int(page['wordCount'] * 1.3)} tokens)")
# Save each page as a .md file for LangChain / LlamaIndex ingestion
safe_name = page["url"].replace("https://", "").replace("/", "_")
with open(f"{safe_name}.md", "w") as f:
f.write(page["markdown"])

JavaScript

import { ApifyClient } from "apify-client";
import { writeFileSync } from "fs";
const client = new ApifyClient({ token: "YOUR_API_TOKEN" });
const run = await client.actor("ryanclinton/website-content-to-markdown").call({
urls: ["https://docs.pinnacletech.io"],
maxPagesPerDomain: 30,
maxCrawlDepth: 2,
includeMetadata: true,
onlyMainContent: true,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
for (const page of items) {
console.log(`${page.url}${page.wordCount} words`);
// Feed into LangChain UnstructuredMarkdownLoader or a vector database
const safeName = page.url.replace("https://", "").replace(/\//g, "_");
writeFileSync(`${safeName}.md`, page.markdown);
}

cURL

# Start the actor run
curl -X POST "https://api.apify.com/v2/acts/ryanclinton~website-content-to-markdown/runs?token=YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://docs.pinnacletech.io"],
"maxPagesPerDomain": 30,
"maxCrawlDepth": 2,
"includeMetadata": true,
"onlyMainContent": true
}'
# Fetch results (replace DATASET_ID from the run response above)
curl "https://api.apify.com/v2/datasets/DATASET_ID/items?token=YOUR_API_TOKEN&format=json"

How Website Content to Markdown works

Phase 1: URL discovery and sitemap parsing

When the actor starts, it normalizes each input URL (adding https:// for bare domains, validating format) and deduplicates by hostname. For each unique starting URL, it fetches /sitemap.xml using a 10-second timeout with an ApifyBot/1.0 User-Agent. The sitemap parser handles both standard sitemaps (extracting <loc> tags) and sitemap index files (fetching the first child sitemap). URLs matching binary file extensions — jpg, png, gif, pdf, zip, mp4, xml — are excluded. The combined list of starting URLs and sitemap-discovered URLs forms the initial request queue.

Phase 2: Breadth-first crawling

A CheerioCrawler runs with 10 concurrent workers at up to 120 requests/minute, with session pooling and persistent cookies for stable multi-page crawls. Three retries are attempted on failure. The handler skips responses without an html Content-Type to avoid processing XML sitemaps or JSON feeds that sneak through. Per-domain page counts and visited URL sets are tracked in a shared Map<string, DomainState> that enforces both the page cap and URL deduplication (trailing slash normalized). For each successfully processed page, the handler enqueues same-domain internal links from <a href> elements, filtering out fragments, external domains, and binary file extensions, up to maxCrawlDepth levels deep using BFS with __crawlDepth userData propagation.

Phase 3: Content extraction and Markdown conversion

The extraction pipeline runs in sequence for each page. First, extractContent() tries the 10 semantic selectors in order (<main>, <article>, [role="main"], #content, .content, .post-content, .entry-content, .article-body, .page-content, .main-content) and uses the first element with 200+ characters of inner HTML. A second Cheerio pass strips the 30+ non-content selectors from within the matched container. If no semantic container matches, the full <body> is used with the same stripping applied globally.

Next, htmlToMarkdown() passes the cleaned HTML to a pre-configured Turndown instance with ATX heading style, fenced code blocks (triple backtick), bullet markers, inline links, and preformattedCode: true to preserve code block whitespace. The turndown-plugin-gfm plugin adds table and strikethrough support. Two custom Turndown rules are applied: images without alt text or with data-URI sources are dropped entirely; anchor tags with empty text content are removed. The resulting Markdown is cleaned with cleanMarkdown() which collapses multiple blank lines, trims line-trailing whitespace, and strips whitespace-only lines.

Pages producing fewer than 50 characters of Markdown are checked for a JS-framework shell and anti-bot challenge markers. If either is detected, a diagnostic record is emitted (jsRenderingRequired / botProtection set, empty markdown, not charged) so the page is visible rather than silently lost; otherwise the page is skipped as genuinely empty. The final record includes the record type and schema version, URL, title (OpenGraph > <title> > <h1>), description (OpenGraph > meta description), Markdown, word count, token estimate, content-quality band, extraction method, JS-rendering and bot-protection signals, language, crawl depth, and ISO timestamp.

Tips for best results

  1. Target section roots, not homepages. Pointing at https://docs.acmecorp.com rather than https://acmecorp.com ensures your page budget is spent on documentation rather than marketing pages. The crawler follows internal links from the starting point.

  2. Use depth 0 when you have a URL list. If you already know which pages to convert, list them all in urls and set maxCrawlDepth: 0. This is faster and more predictable than relying on link discovery.

  3. Estimate token budgets before embedding. Sum the wordCount values in your output and multiply by 1.3. A 100-page documentation site averaging 800 words per page produces roughly 104,000 tokens — helpful to know before choosing an embedding model.

  4. Disable metadata for bulk training data. If you are building a fine-tuning dataset and only need raw Markdown text, set includeMetadata: false. It has negligible cost impact but keeps output records leaner.

  5. Run on a schedule for living knowledge bases. Use Apify's scheduling to re-run this actor weekly against your source sites. Pair it with the Website Change Monitor to trigger re-conversion only when content actually changes.

  6. For SPAs, use the Pro version. If a site loads content through JavaScript (React, Vue, Angular apps), this actor will return the skeleton HTML, not the rendered content. See the Limitations section.

  7. Combine with Company Deep Research for enterprise content. Feed company website Markdown directly into the Company Deep Research actor for comprehensive intelligence reports that include the company's own published content.

  8. Set proxy for rate-limited sites. Enable proxyConfiguration with Apify residential proxies if a site returns 429 or 403 errors during crawling. The session pool will rotate identities across requests.

Combine with other Apify actors

ActorHow to combine
AI Training Data CuratorConvert websites to Markdown, then pass to the curator for deduplication, quality filtering, and fine-tuning dataset formatting
Website Change MonitorDetect when source pages change, then trigger this actor to re-convert only the updated pages for incremental knowledge base updates
Company Deep ResearchConvert a company's public website to Markdown and feed the content into deep research workflows for comprehensive intelligence reports
Website Contact ScraperRun both actors on the same domain: one extracts contacts, the other extracts page content for enriched company profiles
Website Tech Stack DetectorIdentify a site's technology stack first, then convert its content to Markdown — useful for contextualizing technical documentation
Competitor Analysis ReportConvert competitor sites to Markdown, then run competitive analysis on the structured text using LLMs
B2B Lead Gen SuiteEnrich lead profiles with content extracted from their company websites converted to Markdown

Use in Dify

Drop this actor into Dify workflows via the Apify plugin's Run Actor node. Every page returns clean Markdown plus the classification enums your downstream node branches on. A generic HTML scraper pointed at the same site returns raw markup you then have to clean, classify, and triage by hand; this returns the Markdown already filtered and tagged for ingestion.

  • Actor ID: ryanclinton/website-content-to-markdown
  • Sample input (convert a documentation site for a RAG knowledge base):
{
"urls": ["https://docs.pinnacletech.io"],
"maxPagesPerDomain": 50,
"maxCrawlDepth": 3,
"onlyMainContent": true
}

Branching example — a Dify if/else node routes each record on stable enums, no prose parsing:

  • retrievalScore >= 75 → chunk + embed into the vector store; below that → skip (keeps low-value pages out of the corpus)
  • contentArchetype == "release-notes" or "changelog" → route to a separate index (or drop) instead of the main docs corpus
  • failureType == "js-required" → route the URL to a browser-based crawler, then re-ingest
  • failureType == "blocked" → route to a proxy-enabled re-run
  • failureType != null (any value) → error branch for review

Because failureType is null on every content-bearing page and set on every page that produced no content, one equality check (failureType IS NULL) cleanly separates the embed path from the triage path. retrievalScore is the single embed-gate; tokenEstimate feeds chunk-sizing on the embed branch; and with deltaOnly + a watchlist, the actor only emits changed pages so the whole Dify flow runs over the delta, not the full corpus.

Limitations

  • No JavaScript rendering — the actor uses CheerioCrawler, which parses the server-delivered HTML response. Single-page applications (React, Vue, Angular, Next.js with client-side rendering) that load content via JavaScript return an empty shell. Rather than dropping these pages silently, the actor flags them with jsRenderingRequired: true and the detected jsFramework, so you can route them to a browser-based actor. JS-detected pages are not charged.
  • No authenticated content — only publicly accessible pages are processed. Login walls, paywalls, and members-only content produce their gate page, not the protected content behind it.
  • Same-domain crawling only — the crawler never follows links to external domains. If a site's documentation is split across multiple subdomains (e.g., docs.acmecorp.com and api.acmecorp.com), list both as separate starting URLs.
  • 100-page maximum per domain — set by the input schema's maximum constraint. For very large sites, run multiple targeted crawls against specific sections.
  • Discovery depends on links + sitemaps — the crawler checks five common sitemap locations (/sitemap.xml, /sitemap_index.xml, /sitemap-index.xml, /sitemaps.xml, /sitemap/sitemap.xml) and follows internal links, but a page that is neither linked from a crawled page nor listed in one of those sitemaps will not be discovered. Orphaned pages require explicit URL input.
  • No PDF or binary content — only HTML pages are converted. PDF documents, Word files, and embedded media are skipped.
  • English-biased class name selectors — the semantic content selectors use English CSS class names (.content, .post-content, .entry-content). Sites using non-English or unusual class naming conventions may need onlyMainContent: false to capture all content, at the cost of including some boilerplate.
  • No JavaScript execution in content — dynamically inserted content (lazy-loaded sections, infinite scroll, tab-hidden content) is not captured because it requires browser execution.

Integrations

  • LangChain / LlamaIndex — use the Apify integration to load Markdown output directly into your RAG pipeline as document chunks
  • Zapier — send converted Markdown pages to Notion, Google Docs, Confluence, or Slack on run completion
  • Make — chain conversion runs with Airtable, HubSpot, or Slack steps in automated content workflows
  • Google Sheets — export URL, title, word count, and language to a spreadsheet for content audits
  • Apify API — trigger runs programmatically from CI/CD pipelines and retrieve Markdown via REST for embedding workflows
  • Webhooks — receive a POST notification with the dataset URL when conversion finishes, enabling async pipeline triggers
  • Vector databases (Pinecone, Weaviate, Qdrant, Chroma) — pipe the markdown field directly into your embedding and upsert pipeline after chunking

Troubleshooting

  • Output Markdown is empty or very short — check the jsRenderingRequired and botProtection fields on the record. If jsRenderingRequired is true, the source site renders content client-side and needs a browser-based actor. If botProtection.detected is true, the site restricts automated access from shared IPs (enable residential proxies). These pages are reported but not charged.

  • Getting unexpected navigation or sidebar content in Markdown — some sites use non-standard markup without semantic HTML elements (<main>, <article>). The actor falls back to body-level stripping, which may miss some structural elements. Try disabling onlyMainContent and stripping the specific selectors yourself in post-processing, or provide a more specific section URL.

  • Run stopped before reaching the page limit — the actor logs a warning when the per-domain pageCount cap is reached. Increase maxPagesPerDomain (up to 100) or run multiple crawls targeting different sections of the site.

  • Some pages failing with 403 or 429 errors — the target site is blocking the crawler. Enable proxyConfiguration with "useApifyProxy": true and optionally set proxyUrls to residential proxies. The session pool will rotate IPs across requests.

  • Sitemap URLs not being picked up — some sites serve their sitemap at a non-standard path (e.g., /sitemap_index.xml or /sitemaps/pages.xml). The actor only checks /sitemap.xml. For sites with non-standard sitemap locations, add the specific page URLs manually to the urls input.

Responsible use

  • This actor only accesses publicly available web pages.
  • Respect robots.txt directives and website terms of service regarding automated access.
  • Do not use converted content in ways that violate the original site's copyright or content license.
  • Comply with applicable data protection laws (GDPR, CCPA) when storing or processing scraped content.
  • For guidance on web scraping legality, see Apify's guide.

FAQ

How do I convert a website to Markdown for a RAG pipeline? Enter the documentation site URL, set maxPagesPerDomain to the number of pages you want, and set onlyMainContent: true. Each output record contains a markdown field ready for chunking and embedding. The wordCount field helps you estimate token counts before sending to your embedding API.

What types of websites does this actor convert best? Text-heavy, server-rendered sites: documentation portals, developer guides, help centers, blogs, knowledge bases, and informational pages. Sites that rely on JavaScript to render their content (React SPAs, Angular apps) are not supported — use a headless browser approach for those.

Can I use the Markdown output directly with ChatGPT, Claude, or Gemini? Yes. The Markdown format is natively understood by all major LLMs. Feed the markdown field directly into prompts, or use the word count to gauge how many pages fit within a context window (rough estimate: 1 word ≈ 1.3 tokens).

How many pages can I convert in one run? Up to 100 pages per domain per run, across as many domains as you provide. For larger sites, run multiple targeted crawls against different sections and merge the datasets. There is no limit on the number of domains in a single run.

Does this actor follow links to other domains? No. The crawler only follows internal links within the same domain (and subdomain) as each starting URL. If you need content from multiple domains, add each as a separate entry in the urls input.

How is this different from manually copy-pasting website content? Manual copy-paste for 50 pages takes 2–4 hours and produces inconsistent formatting. This actor processes 50 pages in under 2 minutes, produces consistently formatted GitHub Flavored Markdown, strips all boilerplate automatically, and runs unattended on a schedule. The per-page word count and metadata fields are not available from manual copying.

How does the "main content only" mode work? The actor tries 10 semantic HTML selectors in priority order — <main>, <article>, [role="main"], and 7 common content class names. The first matching element with 200+ characters of inner HTML is used as the content container. Non-content elements (nav, footer, sidebar, ads, etc.) are then stripped from within that container. If no semantic container is found, the full <body> is used with the same stripping applied.

Is it legal to convert website content to Markdown? Accessing publicly available web pages is generally legal in most jurisdictions. However, you should review each target website's terms of service, respect robots.txt directives, and ensure your use of the converted content complies with copyright law. For commercial AI training use cases, some site terms explicitly restrict automated scraping. See Apify's guide on web scraping legality for a detailed overview.

Can I schedule this actor to run automatically? Yes. Apify's scheduling feature lets you set recurring runs on a cron schedule (daily, weekly, or custom). This is ideal for keeping documentation snapshots current or monitoring competitor content.

What happens to pages that fail to load? Failed requests are retried up to 3 times with exponential backoff. If still failing after retries, the page is logged as a warning and skipped. Skipped pages do not count toward the per-domain page limit, so your budget is not wasted on failures.

How is this different from Apify's Website Content Crawler? Both convert web pages to text, but this actor is a lightweight, cost-efficient solution for straightforward HTML sites. It uses CheerioCrawler (no browser, ~256 MB memory) and outputs structured JSON with per-page metadata. Apify's Website Content Crawler uses a full browser and supports JavaScript rendering but runs at higher cost. Choose this actor for static and server-rendered sites; choose a browser-based solution for SPAs.

Can I use this actor's output with LangChain or LlamaIndex? Yes. The markdown field integrates directly with LangChain's UnstructuredMarkdownLoader and LlamaIndex's SimpleDirectoryReader. Apify also provides a native LangChain integration that loads dataset items as LangChain documents without any custom code.

Help us improve

If you encounter issues, you can help us debug faster by enabling run sharing in your Apify account:

  1. Go to Account Settings > Privacy
  2. Enable Share runs with public Actor creators

This lets us see your run details when something goes wrong, so we can fix issues faster. Your data is only visible to the actor developer, not publicly.

Support

Found a bug or have a feature request? Open an issue in the Issues tab on this actor's page. For custom solutions or enterprise integrations, reach out through the Apify platform.