Pricing

Pay per usage

Site to Agent Feed (URL to RAG-ready Markdown)

Turn any URL into clean, RAG-ready Markdown + structured JSON for LLMs and AI agents. Self-healing main-content extraction (survives redesigns), headings/links/tables, optional change-detection. No paid APIs.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Actor stats

Bookmarked

Total users

Monthly active users

16 days ago

Last modified

Site to Agent Feed (URL → RAG-ready Markdown)

Give it any URL(s); get back clean Markdown + structured JSON built for LLMs and AI agents — main-content extraction (via trafilatura, which adapts to page layout instead of relying on brittle CSS selectors), plus title, headings, links, and a table count. Optional change-detection turns it into a site monitor.

Why

Agents and RAG pipelines want Markdown as a first-class return type (not raw HTML), and extraction that doesn't break on every redesign.

How it works

Fetches each URL's HTML over HTTP (httpx), optionally through Apify Proxy with an automatic direct-request fallback.
Extracts the main content with trafilatura → Markdown + plain text. Falls back to a BeautifulSoup strip + markdownify if trafilatura returns nothing.
Pulls structure (title, h1–h3 headings, links, table count) with BeautifulSoup.
If detectChanges is on, stores a SHA-256 content hash per URL in the actor's key-value store and sets changed: true when it differs from the previous run.

Input

Field	Type	Default	Notes
`urls`	array of strings (required)	`["https://en.wikipedia.org/wiki/Retrieval-augmented_generation"]`	Pages to convert to Markdown + structured data.
`outputFormat`	string	`"markdown"`	`"markdown"` = structured item with `markdown` (no raw `text`); `"both"` = also include the raw extracted `text`.
`detectChanges`	boolean	`false`	Remember each URL's content hash and flag `changed` when it differs across runs.
`maxChars`	integer	`50000`	Truncate each page's `markdown` and `text` to at most this many characters (range 1000–500000).
`proxyConfiguration`	object	off (direct request)	Optionally route through Apify Proxy to get past simple datacenter-IP blocks. The actor falls back to a direct request if a proxy attempt fails.

Per-URL output

Each successfully fetched page produces a Dataset item with: url, fetched_at (UTC ISO timestamp), title, markdown, headings[] (h1–h3, capped at 50), links[] ({text, href}, capped at 200), table_count, word_count, content_hash (SHA-256 of the extracted text), and (if detectChanges) changed. The raw text field is included only when outputFormat: "both". text and markdown are truncated to maxChars per page.

If a URL fails to fetch or its content can't be extracted, its item is just { "url": ..., "error": ... } — the run still completes successfully.

outputFormat: "markdown" (default) returns the structured item with markdown (no raw text field); "both" additionally includes the raw extracted text. markdown, headings, links, and all other structured fields are present in both modes.

Use as a monitor

Schedule it with detectChanges: true — each run flags which pages changed, so an agent only re-ingests what's new.

Pricing

Billed on standard Apify platform usage (compute units for the run, plus Apify Proxy traffic only if you enable it). It calls no paid third-party APIs of its own.

Limitations — read this

Server-rendered HTML only. No JavaScript execution. It uses a plain HTTP fetch, not a browser. Single-page apps and content injected by JS will be missing or sparse. Use a browser-based scraper for those.
Heavily bot-protected sites return 403. Sites behind Akamai/Cloudflare-class bot protection (e.g. SEC.gov, FINRA.org) block non-browser TLS fingerprints and will fail even through a residential proxy. This lightweight fetcher is for normal/server-rendered pages; use a real-browser scraper for those. Optional Apify Proxy (off by default) helps only with simple datacenter-IP blocks, not bot-protection.
Extraction quality depends on trafilatura. On unusual layouts it may grab too much or too little; the fallback is a coarse text strip.
Change-detection is whole-page hashing. Any change (including dynamic timestamps, view counters, or rotating banners) flips changed to true — it does not diff what changed. Hashes persist in the actor's default key-value store, so change flags are meaningful only across runs of the same actor/store.
No anti-bot handling, JS challenges, logins, or pagination. Pages behind Cloudflare/auth or requiring clicks won't work. The actor fetches only the exact URLs you pass — it does not crawl or follow links.
links and headings are capped (200 / 50) and may be truncated on large pages; markdown/text are capped at maxChars.
Sends only a basic browser User-Agent; you are responsible for honoring each site's terms and robots policy.

AI Web Extractor: URL → Clean Markdown + JSON for LLM/RAG

boxbox10/ai-web-extractor

Turn any URL into clean, LLM-ready Markdown + structured JSON (title, headings, main content, links, metadata, token count). Perfect for RAG pipelines, AI agents, and LLM context.

Marvin Eguilos

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

Dev with Bobby

PDF to Markdown & JSON (RAG-Ready)

basisweb/pdf-to-markdown-rag

Convert PDFs to clean Markdown and structured JSON (text + tables) for RAG, LLMs, and vector DBs. Batch URLs, pay per page.

BasisWeb

URL to Markdown for LLMs (polite, robots-respecting)

weltverbenzer/url-to-markdown-for-llms

Turn any URL into clean, LLM-ready Markdown for AI agents and RAG pipelines. Enforces robots.txt, extracts main content (Readability) and converts to Markdown. Returns title, byline and markdown.

Johannes Witt

PDF URL to Markdown, Tables & RAG Extractor

thescrapelab/Apify-PDF-url-scraper

Extract clean Markdown, page text, tables, metadata, summaries, and AI-ready RAG chunks from PDF URLs.

Inus Grobler

Website to Markdown for LLMs

agentictools/website-to-markdown-llm

Crawl a site and export clean Markdown with token counts and chunks, ready for RAG.

Ken Agland

LLM-Ready Web Extractor: URL to Clean Markdown & JSON

f0rty7even/llm-web-extractor

Turn any web page or site into clean, LLM-ready Markdown and structured JSON for RAG, agents, and fine-tuning. Strips nav/ads/boilerplate; returns main content + metadata.

Michael Yousrie

Website to Markdown for LLMs and RAG

rodrgds/website-to-markdown

Convert webpages into clean markdown for LLMs, RAG pipelines, AI datasets, archives, and content extraction. Simple pay-per-page pricing.

Rodrigo Dias

PDF to Markdown & JSON Extractor for LLMs

f0rty7even/pdf-extractor

Turn any PDF URL into clean, LLM-ready Markdown and structured JSON. Extracts text + tables + document metadata for RAG, agents, and fine-tuning. No AGPL components.

Michael Yousrie