Site to LLM Knowledge Base avatar

Site to LLM Knowledge Base

Pricing

from $0.75 / 1,000 results

Go to Apify Store
Site to LLM Knowledge Base

Site to LLM Knowledge Base

Turn any website or docs into clean, LLM-ready Markdown for RAG and AI agents — one record per page, each with a token count. Sitemap- and robots.txt-aware, with predictable per-page pricing (no token credits). Simple knowledge-base ingestion.

Pricing

from $0.75 / 1,000 results

Rating

0.0

(0)

Developer

Mohamed Adam BOUNHAR

Mohamed Adam BOUNHAR

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

13 hours ago

Last modified

Share

Crawl an entire site or documentation set into clean, LLM-ready Markdown — one record per page, each with a token estimate. Built for RAG ingestion and feeding knowledge bases to AI agents.

The wedge: predictable per-page pricing (not opaque token credits), a built-in est_tokens count per page (so you can budget your context window before ingesting), and full robots.txt + sitemap awareness. These are the exact gaps developers hit with Firecrawl/Jina.

Why it's ToS-clean: the caller chooses the site, robots.txt is respected on every page, only public content is read, and there's a hard page cap. No logins, no paywalls.

Input

FieldTypeNotes
startUrlstringSite/docs URL to crawl. Uses sitemap.xml if present, else same-domain links.
maxPagesintegerHard cap on pages (default 25, max 500).
respectRobotsbooleanDefault true. Disable only for sites you own.

Output (one dataset item per page)

{ "url": "https://site/docs/intro", "title": "Intro",
"markdown": "# Intro...", "word_count": 540, "est_tokens": 720 }

Run locally

python scripts/new_actor.py --sync # from repo root
cd actors/site-to-knowledge-base
apify run

Monetization

Pay-per-event, charging one page event per crawled page. See docs/pricing.md. Crawl logic is shared (shared/crawl.py) — edit there, then --sync.

Known limits

Fetches server-rendered HTML (no headless browser), so JavaScript-only pages return little content. A renderJs premium mode is the natural future upgrade.