Pricing

from $1.50 / 1,000 pages

Site to Markdown — any site to clean, LLM-ready markdown

Scrape any website to clean, LLM-ready markdown — a compliant Firecrawl alternative for RAG ingestion, robots.txt always on.

Pricing

from $1.50 / 1,000 pages

Rating

0.0

(0)

Developer

Connor Teskey

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

Site to Markdown

Turn any website into clean, LLM-ready markdown — one document per page, with robots.txt compliance locked on.

Built for AI agents, RAG builders, and documentation pipelines that need a website-to-markdown step without running crawler infrastructure. Point it at a URL: it crawls breadth-first, strips navigation, ads, and boilerplate, and keeps only the main content as tidy markdown. If you have been looking for a Firecrawl alternative on Apify for scrape-to-markdown jobs, this is that actor.

What you get

One dataset item per page:

Field	Meaning
`url`	The URL that was requested.
`finalUrl`	URL after redirects.
`status`	HTTP status code (0 when the fetch itself failed).
`title`	Page title, when found.
`markdown`	Clean, LLM-ready markdown of the page's main content.
`text`	Plain-text version (only when `outputFormat` is `markdown+text`).
`linksCount`	Number of links discovered on the page.
`fetchedAt`	ISO-8601 fetch timestamp.
`rendered`	Whether a headless browser rendered the page (always `false` in v1).
`error`	Error message when the page failed, otherwise `null`.

Every run also writes a RUN_SUMMARY record to the key-value store with page counts and a failure breakdown.

Quick start

{
    "startUrls": [{ "url": "https://docs.python.org/3/" }],
    "crawlMode": "site-crawl",
    "maxPages": 10,
    "maxDepth": 1
}

A run like this returns one markdown document per crawled page and typically finishes in well under a minute; the verification crawl of docs.python.org converted 5 of 5 pages.

Output example

{
    "url": "https://docs.python.org/3/tutorial/index.html",
    "finalUrl": "https://docs.python.org/3/tutorial/index.html",
    "status": 200,
    "title": "The Python Tutorial — Python 3.14.6 documentation",
    "markdown": "# The Python Tutorial\n\nPython is an easy to learn, powerful programming language. It has efficient high-level data st...",
    "linksCount": 35,
    "fetchedAt": "2026-06-11T00:49:18+00:00",
    "rendered": false,
    "error": null
}

Why this one

Robots-locked by design. Compliance is hard-coded into the crawler call, not an input default someone can flip. That makes the output safe to build a product on.
Selector-free extraction. Main content is found by trafilatura with an automatic readability-style fallback — no CSS selectors to maintain when a site redesigns.
Honest zero-yield. If no pages produce markdown, the run fails with a classified failure breakdown instead of finishing green on an empty dataset.
Precise scope control. Include/exclude glob patterns match against the full URL, exclude wins, and same-domain crawling is the default.
Open foundation. Built on trawl (MIT), a clean-room crawler, with trafilatura as the quality extraction engine — the exact wheel is vendored into the image.

Compliance and reliability

Topsail actors are built compliance-first and ship with self-healing plumbing:

robots.txt is always respected — locked on. Every fetch goes through the crawler with robots compliance hard-coded; there is no input to turn it off. Pages disallowed by robots.txt are reported as robots-blocked, never fetched, and robots Crawl-delay is honored when larger than your politeness delay.
This actor reads only the public, static HTML pages you point it at — the same documents any browser receives without logging in — and only where robots.txt permits.
Transient failures retry with backoff (408, 425, 429, and 5xx responses, honoring Retry-After); persistent failures are reported, not hidden.
Every run writes a HEALTH summary (RUN_SUMMARY) to the key-value store with page counts, a failure breakdown — robots-blocked, http-4xx, http-5xx, timeout, extract-fail — and a per-URL failedPages list, so you can see exactly which pages delivered and which were blocked, empty, or erroring. Only successful pages become dataset results.
No PII, no paywalled or login-gated content, no circumvention.

Pricing

Pay per result: $1.50 per 1,000 pages successfully extracted ($0.0015 per page), plus a fraction-of-a-cent actor start fee. Every dataset result is one extracted page — robots-blocked pages, failed fetches, and pages dropped by your URL filters never become results, so they cost nothing. The 10-page quick start above costs about two cents.

Honest limits

No JavaScript rendering. Static HTML only — SPAs that render entirely client-side will come back thin. Headless rendering is on the roadmap for v2.
No sitemap.xml seeding yet; discovery is link-following from your start URLs.
One markdown document per page; no site-level concatenated export (easy to build downstream from the dataset).
robots.txt compliance cannot be disabled. If your use case requires ignoring robots.txt, this actor is not for you — by design.

FAQ

Is this a Firecrawl alternative? For the core scrape and crawl endpoints, yes: website to markdown, one clean document per page, ready for RAG ingestion — as an Apify actor instead of separate infrastructure. It does not replicate Firecrawl's JS rendering or search features in v1.

Can it scrape JavaScript-heavy sites? Not in v1. It fetches static HTML, so server-rendered sites, documentation, and blogs work well; client-side SPAs come back thin.

How do I scrape a single page to markdown? Set crawlMode to single-page and list your URLs in startUrls; each one is converted on its own with no link following.

How do I keep a crawl focused on one section of a site? Use full-URL glob patterns: include https://docs.example.com/en/* and exclude */changelog/*, for example. Exclude always wins.

Can I turn off robots.txt compliance? No. It is hard-coded on, with no input to disable it. Disallowed pages are reported as robots-blocked so you can see what was skipped.

More compliant data feeds from Topsail

GTA 6 Countdown & Developments Tracker — countdown, confirmed facts, diffed developments, market odds
Commodity Intel — oil, gold, uranium headlines from permitted sources
Crypto News — BTC/ETH/DeFi headlines from major outlets
AI Research Radar — new papers and lab announcements

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Logiover

Website Content Crawler — Text, Markdown & HTML for AI/LLM

hichemdev/website-content-crawler

Crawl any website and extract clean text, Markdown, and HTML from every page — ready for LLM, RAG, and AI ingestion.

Hichem Ben Moussa

Site Crawler: Website → Markdown Corpus for LLM/RAG

boxbox10/site-crawler

Crawl a whole website or docs site and get one clean, LLM-ready Markdown + JSON record per page (title, headings, content, links, token count). Built for RAG ingestion and AI knowledge bases.

Marvin Eguilos

Website Content to Markdown (LLM-ready)

vivid_astronaut/website-content-to-markdown

Turn any website into clean, LLM-ready Markdown for RAG pipelines, AI agents and knowledge bases. Scrape single pages or crawl entire sites. Compliance-first: robots.txt honored.

Fabio Suizu

Website to Markdown for LLMs

agentictools/website-to-markdown-llm

Crawl a site and export clean Markdown with token counts and chunks, ready for RAG.

Ken Agland

Website to Markdown – Clean LLM & RAG Content Extractor

dataquarry/website-to-markdown

Convert any public web page to clean, LLM-ready Markdown with metadata — by URL, a list of URLs, or a whole-site crawl. Strips nav/ads/boilerplate, keeps headings/lists/tables/code. Respects robots.txt. No API key.

Daniel Brenner

URL to Markdown for LLMs (polite, robots-respecting)

weltverbenzer/url-to-markdown-for-llms

Turn any URL into clean, LLM-ready Markdown for AI agents and RAG pipelines. Enforces robots.txt, extracts main content (Readability) and converts to Markdown. Returns title, byline and markdown.