Site to Markdown — any site to clean, LLM-ready markdown avatar

Site to Markdown — any site to clean, LLM-ready markdown

Pricing

from $1.50 / 1,000 pages

Go to Apify Store
Site to Markdown — any site to clean, LLM-ready markdown

Site to Markdown — any site to clean, LLM-ready markdown

Scrape any website to clean, LLM-ready markdown — a compliant Firecrawl alternative for RAG ingestion, robots.txt always on.

Pricing

from $1.50 / 1,000 pages

Rating

0.0

(0)

Developer

Connor Teskey

Connor Teskey

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

Site to Markdown

Turn any website into clean, LLM-ready markdown — one document per page, with robots.txt compliance locked on.

Built for AI agents, RAG builders, and documentation pipelines that need a website-to-markdown step without running crawler infrastructure. Point it at a URL: it crawls breadth-first, strips navigation, ads, and boilerplate, and keeps only the main content as tidy markdown. If you have been looking for a Firecrawl alternative on Apify for scrape-to-markdown jobs, this is that actor.

What you get

One dataset item per page:

FieldMeaning
urlThe URL that was requested.
finalUrlURL after redirects.
statusHTTP status code (0 when the fetch itself failed).
titlePage title, when found.
markdownClean, LLM-ready markdown of the page's main content.
textPlain-text version (only when outputFormat is markdown+text).
linksCountNumber of links discovered on the page.
fetchedAtISO-8601 fetch timestamp.
renderedWhether a headless browser rendered the page (always false in v1).
errorError message when the page failed, otherwise null.

Every run also writes a RUN_SUMMARY record to the key-value store with page counts and a failure breakdown.

Quick start

{
"startUrls": [{ "url": "https://docs.python.org/3/" }],
"crawlMode": "site-crawl",
"maxPages": 10,
"maxDepth": 1
}

A run like this returns one markdown document per crawled page and typically finishes in well under a minute; the verification crawl of docs.python.org converted 5 of 5 pages.

Output example

{
"url": "https://docs.python.org/3/tutorial/index.html",
"finalUrl": "https://docs.python.org/3/tutorial/index.html",
"status": 200,
"title": "The Python Tutorial — Python 3.14.6 documentation",
"markdown": "# The Python Tutorial\n\nPython is an easy to learn, powerful programming language. It has efficient high-level data st...",
"linksCount": 35,
"fetchedAt": "2026-06-11T00:49:18+00:00",
"rendered": false,
"error": null
}

Why this one

  • Robots-locked by design. Compliance is hard-coded into the crawler call, not an input default someone can flip. That makes the output safe to build a product on.
  • Selector-free extraction. Main content is found by trafilatura with an automatic readability-style fallback — no CSS selectors to maintain when a site redesigns.
  • Honest zero-yield. If no pages produce markdown, the run fails with a classified failure breakdown instead of finishing green on an empty dataset.
  • Precise scope control. Include/exclude glob patterns match against the full URL, exclude wins, and same-domain crawling is the default.
  • Open foundation. Built on trawl (MIT), a clean-room crawler, with trafilatura as the quality extraction engine — the exact wheel is vendored into the image.

Compliance and reliability

Topsail actors are built compliance-first and ship with self-healing plumbing:

  • robots.txt is always respected — locked on. Every fetch goes through the crawler with robots compliance hard-coded; there is no input to turn it off. Pages disallowed by robots.txt are reported as robots-blocked, never fetched, and robots Crawl-delay is honored when larger than your politeness delay.
  • This actor reads only the public, static HTML pages you point it at — the same documents any browser receives without logging in — and only where robots.txt permits.
  • Transient failures retry with backoff (408, 425, 429, and 5xx responses, honoring Retry-After); persistent failures are reported, not hidden.
  • Every run writes a HEALTH summary (RUN_SUMMARY) to the key-value store with page counts, a failure breakdown — robots-blocked, http-4xx, http-5xx, timeout, extract-fail — and a per-URL failedPages list, so you can see exactly which pages delivered and which were blocked, empty, or erroring. Only successful pages become dataset results.
  • No PII, no paywalled or login-gated content, no circumvention.

Pricing

Pay per result: $1.50 per 1,000 pages successfully extracted ($0.0015 per page), plus a fraction-of-a-cent actor start fee. Every dataset result is one extracted page — robots-blocked pages, failed fetches, and pages dropped by your URL filters never become results, so they cost nothing. The 10-page quick start above costs about two cents.

Honest limits

  • No JavaScript rendering. Static HTML only — SPAs that render entirely client-side will come back thin. Headless rendering is on the roadmap for v2.
  • No sitemap.xml seeding yet; discovery is link-following from your start URLs.
  • One markdown document per page; no site-level concatenated export (easy to build downstream from the dataset).
  • robots.txt compliance cannot be disabled. If your use case requires ignoring robots.txt, this actor is not for you — by design.

FAQ

Is this a Firecrawl alternative? For the core scrape and crawl endpoints, yes: website to markdown, one clean document per page, ready for RAG ingestion — as an Apify actor instead of separate infrastructure. It does not replicate Firecrawl's JS rendering or search features in v1.

Can it scrape JavaScript-heavy sites? Not in v1. It fetches static HTML, so server-rendered sites, documentation, and blogs work well; client-side SPAs come back thin.

How do I scrape a single page to markdown? Set crawlMode to single-page and list your URLs in startUrls; each one is converted on its own with no link following.

How do I keep a crawl focused on one section of a site? Use full-URL glob patterns: include https://docs.example.com/en/* and exclude */changelog/*, for example. Exclude always wins.

Can I turn off robots.txt compliance? No. It is hard-coded on, with no input to disable it. Disallowed pages are reported as robots-blocked so you can see what was skipped.

More compliant data feeds from Topsail