URL to Markdown (JustHTML) - Clean Markdown Extractor avatar
URL to Markdown (JustHTML) - Clean Markdown Extractor

Pricing

Pay per usage

Go to Apify Store
URL to Markdown (JustHTML) - Clean Markdown Extractor

URL to Markdown (JustHTML) - Clean Markdown Extractor

Convert webpages to clean Markdown for RAG and archiving. Uses JustHTML and supports optional Cloudflare/Turnstile bypass plus CSS selector extraction.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Anass Seb

Anass Seb

Maintained by Community

Actor stats

1

Bookmarked

3

Total users

1

Monthly active users

8 days ago

Last modified

Share

Link to Markdown (JustHTML + Cloudflare Bypass)

🔗 URL → 🧼 Clean Markdown • 🛡️ Optional bypass • 🎯 CSS selector

Convert web links into clean Markdown for RAG, archiving, content pipelines, and AI agents.

This Actor fetches a URL, optionally bypasses Cloudflare challenges using the same Camoufox-based open source bypass approach in this repository, and converts the resulting HTML to Markdown using JustHTML (pure Python HTML5 parser with built-in safe output).

Keywords

link to markdown, html to markdown, webpage to markdown, url to markdown, cloudflare bypass, turnstile, anti-bot, RAG, LLM, AI agent, markdown extractor

Why this Actor (SEO)

If you need a dependable URL → Markdown converter for RAG pipelines, you usually hit three problems:

  1. Broken or messy HTML that produces garbage Markdown
  2. Heavy JavaScript pages that hide the real content
  3. Anti-bot / Cloudflare interstitials that block simple fetchers

This Actor is built to be a practical extractor for AI agents, vector databases, knowledge bases, and content archiving workflows.

Common use cases

  • Convert product docs pages into Markdown for RAG
  • Build internal knowledge base snapshots from URLs
  • Extract “article” content with a CSS selector (main, article, .content)
  • Prepare clean Markdown for embedding/search indexing

Tips for better extraction

  • Set selector to target the content container (article, main, .markdown-body)
  • Use includeHtml=true only when debugging extraction
  • Keep safe=true when ingesting untrusted pages into downstream systems

What you get

  • Markdown output per URL (optionally for a specific CSS selector like article, main, or .markdown-body)
  • Safe-by-default sanitization for untrusted HTML
  • Optional Cloudflare challenge bypass fallback when normal fetching fails
  • Dataset output suitable for exporting to JSON/CSV

Input

  • urls (array) or url (string)
  • selector (string, optional)
  • safe (boolean, default: true)
  • useCloudflareBypass (boolean, default: true)
  • bypassCache (boolean, default: false)
  • proxyUrl (string, optional)
  • includeHtml (boolean, default: false)
  • maxConcurrency (int, default: 2)

Output (dataset items)

Each item contains:

  • url, finalUrl
  • status (success or failed)
  • title
  • markdown
  • statusCode, contentType
  • bypassed (boolean)
  • error (string, if failed)

Example input

{
"urls": [
"https://github.com/EmilStenstrom/justhtml"
],
"selector": ".markdown-body",
"safe": true,
"useCloudflareBypass": true
}

Deploy to Apify

  1. Install Apify CLI and log in
  2. From this Actor directory, run:
$apify push

Then publish from the Apify Console with a title/description similar to this README for strong discoverability:

  • Keywords: link to markdown, html to markdown, justhtml, cloudflare bypass, turnstile, RAG

Licensing

  • This Actor’s code in this repository follows the repository’s license.
  • JustHTML is vendored under and distributed under its own license (see its LICENSE file).