Pricing

from $0.75 / 1,000 results

Site to LLM Knowledge Base

Turn any website or docs into clean, LLM-ready Markdown for RAG and AI agents — one record per page, each with a token count. Sitemap- and robots.txt-aware, with predictable per-page pricing (no token credits). Simple knowledge-base ingestion.

Pricing

from $0.75 / 1,000 results

Rating

0.0

(0)

Developer

Mohamed Adam BOUNHAR

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

Input

Field	Type	Notes
`startUrl`	string	Site/docs URL to crawl. Uses `sitemap.xml` if present, else same-domain links.
`maxPages`	integer	Hard cap on pages (default 25, max 500).
`respectRobots`	boolean	Default `true`. Disable only for sites you own.

Output (one dataset item per page)

{ "url": "https://site/docs/intro", "title": "Intro",
  "markdown": "# Intro...", "word_count": 540, "est_tokens": 720 }

Run locally

python scripts/new_actor.py --sync     # from repo root
cd actors/site-to-knowledge-base
apify run

Monetization

Pay-per-event, charging one page event per crawled page. See docs/pricing.md. Crawl logic is shared (shared/crawl.py) — edit there, then --sync.

Known limits

Fetches server-rendered HTML (no headless browser), so JavaScript-only pages return little content. A renderJs premium mode is the natural future upgrade.

Site Crawler: Website → Markdown Corpus for LLM/RAG

boxbox10/site-crawler

Crawl a whole website or docs site and get one clean, LLM-ready Markdown + JSON record per page (title, headings, content, links, token count). Built for RAG ingestion and AI knowledge bases.

Marvin Eguilos

PDF Extractor: PDF → Clean Markdown + JSON for LLM/RAG

boxbox10/pdf-extractor

Turn any PDF URL into clean, LLM-ready Markdown + structured JSON (title, metadata, per-page text, page count, word count, token count). Perfect for RAG pipelines, AI agents, and LLM document ingestion.

Marvin Eguilos

Website to RAG Knowledge Dataset

ghostgrid/website-to-rag-knowledge-dataset

Convert a website sitemap into clean text and markdown rows for RAG, AI search, chatbots, and knowledge base workflows.

GhostGrid

Website Content to Markdown (LLM-ready)

vivid_astronaut/website-content-to-markdown

Turn any website into clean, LLM-ready Markdown for RAG pipelines, AI agents and knowledge bases. Scrape single pages or crawl entire sites. Compliance-first: robots.txt honored.

Fabio Suizu

Site to Markdown — any site to clean, LLM-ready markdown

topsail/site-to-markdown

Scrape any website to clean, LLM-ready markdown — a compliant Firecrawl alternative for RAG ingestion, robots.txt always on.

Connor Teskey

AI Web Extractor: URL → Clean Markdown + JSON for LLM/RAG

boxbox10/ai-web-extractor

Turn any URL into clean, LLM-ready Markdown + structured JSON (title, headings, main content, links, metadata, token count). Perfect for RAG pipelines, AI agents, and LLM context.

Marvin Eguilos

Website To LLM Knowledge Pack

attainable_iota/website-to-llm-knowledge-pack

Crawl any website and turn it into an LLM-ready knowledge pack. This Actor extracts clean main text + metadata, follows links with depth/URL filters, and outputs per-page dataset items plus knowledge.jsonl, knowledge.md, and manifest.json for RAG/embeddings pipelines.

M Junaid Shaukat

Website Content Crawler — Text, Markdown & HTML for AI/LLM

hichemdev/website-content-crawler

Crawl any website and extract clean text, Markdown, and HTML from every page — ready for LLM, RAG, and AI ingestion.

Hichem Ben Moussa

LLM-Ready Web Extractor

phantom_horse/my-actor-1

Turn any web page into clean, LLM-ready Markdown. Strips scripts, nav, and page chrome, then converts the main content to tidy Markdown with title, meta description, and token counts. Perfect for AI prompts and RAG ingestion pipelines.