Site to Markdown — any site to clean, LLM-ready markdown
Pricing
from $1.50 / 1,000 pages
Site to Markdown — any site to clean, LLM-ready markdown
Scrape any website to clean, LLM-ready markdown — a compliant Firecrawl alternative for RAG ingestion, robots.txt always on.
Pricing
from $1.50 / 1,000 pages
Rating
0.0
(0)
Developer
Connor Teskey
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
Site to Markdown
Turn any website into clean, LLM-ready markdown — one document per page, with robots.txt compliance locked on.
Built for AI agents, RAG builders, and documentation pipelines that need a website-to-markdown step without running crawler infrastructure. Point it at a URL: it crawls breadth-first, strips navigation, ads, and boilerplate, and keeps only the main content as tidy markdown. If you have been looking for a Firecrawl alternative on Apify for scrape-to-markdown jobs, this is that actor.
What you get
One dataset item per page:
| Field | Meaning |
|---|---|
url | The URL that was requested. |
finalUrl | URL after redirects. |
status | HTTP status code (0 when the fetch itself failed). |
title | Page title, when found. |
markdown | Clean, LLM-ready markdown of the page's main content. |
text | Plain-text version (only when outputFormat is markdown+text). |
linksCount | Number of links discovered on the page. |
fetchedAt | ISO-8601 fetch timestamp. |
rendered | Whether a headless browser rendered the page (always false in v1). |
error | Error message when the page failed, otherwise null. |
Every run also writes a RUN_SUMMARY record to the key-value store with page counts and a failure breakdown.
Quick start
{"startUrls": [{ "url": "https://docs.python.org/3/" }],"crawlMode": "site-crawl","maxPages": 10,"maxDepth": 1}
A run like this returns one markdown document per crawled page and typically finishes in well under a minute; the verification crawl of docs.python.org converted 5 of 5 pages.
Output example
{"url": "https://docs.python.org/3/tutorial/index.html","finalUrl": "https://docs.python.org/3/tutorial/index.html","status": 200,"title": "The Python Tutorial — Python 3.14.6 documentation","markdown": "# The Python Tutorial\n\nPython is an easy to learn, powerful programming language. It has efficient high-level data st...","linksCount": 35,"fetchedAt": "2026-06-11T00:49:18+00:00","rendered": false,"error": null}
Why this one
- Robots-locked by design. Compliance is hard-coded into the crawler call, not an input default someone can flip. That makes the output safe to build a product on.
- Selector-free extraction. Main content is found by trafilatura with an automatic readability-style fallback — no CSS selectors to maintain when a site redesigns.
- Honest zero-yield. If no pages produce markdown, the run fails with a classified failure breakdown instead of finishing green on an empty dataset.
- Precise scope control. Include/exclude glob patterns match against the full URL, exclude wins, and same-domain crawling is the default.
- Open foundation. Built on trawl (MIT), a clean-room crawler, with trafilatura as the quality extraction engine — the exact wheel is vendored into the image.
Compliance and reliability
Topsail actors are built compliance-first and ship with self-healing plumbing:
- robots.txt is always respected — locked on. Every fetch goes through the crawler with robots compliance hard-coded; there is no input to turn it off. Pages disallowed by robots.txt are reported as
robots-blocked, never fetched, and robotsCrawl-delayis honored when larger than your politeness delay. - This actor reads only the public, static HTML pages you point it at — the same documents any browser receives without logging in — and only where robots.txt permits.
- Transient failures retry with backoff (408, 425, 429, and 5xx responses, honoring
Retry-After); persistent failures are reported, not hidden. - Every run writes a HEALTH summary (
RUN_SUMMARY) to the key-value store with page counts, a failure breakdown —robots-blocked,http-4xx,http-5xx,timeout,extract-fail— and a per-URLfailedPageslist, so you can see exactly which pages delivered and which were blocked, empty, or erroring. Only successful pages become dataset results. - No PII, no paywalled or login-gated content, no circumvention.
Pricing
Pay per result: $1.50 per 1,000 pages successfully extracted ($0.0015 per page), plus a fraction-of-a-cent actor start fee. Every dataset result is one extracted page — robots-blocked pages, failed fetches, and pages dropped by your URL filters never become results, so they cost nothing. The 10-page quick start above costs about two cents.
Honest limits
- No JavaScript rendering. Static HTML only — SPAs that render entirely client-side will come back thin. Headless rendering is on the roadmap for v2.
- No sitemap.xml seeding yet; discovery is link-following from your start URLs.
- One markdown document per page; no site-level concatenated export (easy to build downstream from the dataset).
- robots.txt compliance cannot be disabled. If your use case requires ignoring robots.txt, this actor is not for you — by design.
FAQ
Is this a Firecrawl alternative? For the core scrape and crawl endpoints, yes: website to markdown, one clean document per page, ready for RAG ingestion — as an Apify actor instead of separate infrastructure. It does not replicate Firecrawl's JS rendering or search features in v1.
Can it scrape JavaScript-heavy sites? Not in v1. It fetches static HTML, so server-rendered sites, documentation, and blogs work well; client-side SPAs come back thin.
How do I scrape a single page to markdown?
Set crawlMode to single-page and list your URLs in startUrls; each one is converted on its own with no link following.
How do I keep a crawl focused on one section of a site?
Use full-URL glob patterns: include https://docs.example.com/en/* and exclude */changelog/*, for example. Exclude always wins.
Can I turn off robots.txt compliance?
No. It is hard-coded on, with no input to disable it. Disallowed pages are reported as robots-blocked so you can see what was skipped.
More compliant data feeds from Topsail
- GTA 6 Countdown & Developments Tracker — countdown, confirmed facts, diffed developments, market odds
- Commodity Intel — oil, gold, uranium headlines from permitted sources
- Crypto News — BTC/ETH/DeFi headlines from major outlets
- AI Research Radar — new papers and lab announcements