Site to LLM Knowledge Base
Pricing
from $0.75 / 1,000 results
Site to LLM Knowledge Base
Turn any website or docs into clean, LLM-ready Markdown for RAG and AI agents — one record per page, each with a token count. Sitemap- and robots.txt-aware, with predictable per-page pricing (no token credits). Simple knowledge-base ingestion.
Pricing
from $0.75 / 1,000 results
Rating
0.0
(0)
Developer
Mohamed Adam BOUNHAR
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
13 hours ago
Last modified
Categories
Share
Crawl an entire site or documentation set into clean, LLM-ready Markdown — one record per page, each with a token estimate. Built for RAG ingestion and feeding knowledge bases to AI agents.
The wedge: predictable per-page pricing (not opaque token credits), a built-in
est_tokens count per page (so you can budget your context window before ingesting),
and full robots.txt + sitemap awareness. These are the exact gaps developers hit
with Firecrawl/Jina.
Why it's ToS-clean: the caller chooses the site, robots.txt is respected on every
page, only public content is read, and there's a hard page cap. No logins, no paywalls.
Input
| Field | Type | Notes |
|---|---|---|
startUrl | string | Site/docs URL to crawl. Uses sitemap.xml if present, else same-domain links. |
maxPages | integer | Hard cap on pages (default 25, max 500). |
respectRobots | boolean | Default true. Disable only for sites you own. |
Output (one dataset item per page)
{ "url": "https://site/docs/intro", "title": "Intro","markdown": "# Intro...", "word_count": 540, "est_tokens": 720 }
Run locally
python scripts/new_actor.py --sync # from repo rootcd actors/site-to-knowledge-baseapify run
Monetization
Pay-per-event, charging one page event per crawled page. See docs/pricing.md.
Crawl logic is shared (shared/crawl.py) — edit there, then --sync.
Known limits
Fetches server-rendered HTML (no headless browser), so JavaScript-only pages return
little content. A renderJs premium mode is the natural future upgrade.