Site to Agent Feed (URL to RAG-ready Markdown) avatar

Site to Agent Feed (URL to RAG-ready Markdown)

Pricing

Pay per usage

Go to Apify Store
Site to Agent Feed (URL to RAG-ready Markdown)

Site to Agent Feed (URL to RAG-ready Markdown)

Turn any URL into clean, RAG-ready Markdown + structured JSON for LLMs and AI agents. Self-healing main-content extraction (survives redesigns), headings/links/tables, optional change-detection. No paid APIs.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

CQ

CQ

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

16 hours ago

Last modified

Share

Site to Agent Feed (URL → RAG-ready Markdown)

Give it any URL(s); get back clean Markdown + structured JSON built for LLMs and AI agents — main-content extraction (via trafilatura, which adapts to page layout instead of relying on brittle CSS selectors), plus title, headings, links, and a table count. Optional change-detection turns it into a site monitor.

Why

Agents and RAG pipelines want Markdown as a first-class return type (not raw HTML), and extraction that doesn't break on every redesign. Pairs well with MCP-based agent stacks.

How it works

  1. Fetches each URL's HTML over HTTP (httpx).
  2. Extracts the main content with trafilatura → Markdown + plain text. Falls back to a BeautifulSoup strip + markdownify if trafilatura returns nothing.
  3. Pulls structure (title, h1–h3 headings, links, table count) with BeautifulSoup.
  4. If detectChanges is on, stores a content hash per URL and sets changed: true when it differs from the previous run.

Per-URL output

Each successfully fetched page produces a Dataset item with: url, fetched_at (UTC ISO timestamp), title, markdown, headings[] (h1–h3, capped at 50), links[] ({text, href}, capped at 200), table_count, word_count, content_hash (SHA-256 of the extracted text), and (if detectChanges) changed. The raw text field is included only when outputFormat: "both". text and markdown are truncated to maxChars per page.

If a URL fails to fetch, its item is just { "url": ..., "error": ... }.

outputFormat: "markdown" (default) returns the structured item with markdown (no raw text field); "both" additionally includes the raw extracted text. markdown, headings, links, and all other structured fields are always present in both modes.

Use as a monitor

Schedule it with detectChanges: true — each run flags which pages changed, so an agent only re-ingests what's new.

Limitations — read this

  • Server-rendered HTML only. No JavaScript execution. It uses a plain HTTP fetch, not a browser. Single-page apps and content injected by JS will be missing or sparse. Use a browser-based scraper for those.
  • Heavily bot-protected sites return 403. Sites behind Akamai/Cloudflare-class bot protection (e.g. SEC.gov, FINRA.org) block non-browser TLS fingerprints and will fail even through residential proxy. This lightweight fetcher is for normal/server-rendered pages; use a real-browser scraper for those. Optional Apify Proxy (off by default) helps only with simple datacenter-IP blocks, not bot-protection.
  • Extraction quality depends on trafilatura. On unusual layouts it may grab too much or too little; the fallback is a coarse text strip.
  • Change-detection is whole-page hashing. Any change (including dynamic timestamps, view counters, or rotating banners) flips changed to true — it does not diff what changed.
  • No anti-bot handling, JS challenges, logins, or pagination. Pages behind Cloudflare/auth or requiring clicks won't work.
  • links and headings are capped (200 / 50) and may be truncated on large pages.
  • Respects nothing beyond a basic User-Agent; you are responsible for honoring each site's terms and robots policy.