Deprecated

Pricing

$49.00/month + usage

See alternative Actors

Go to Apify Store

Universal Knowledge Base Scraper (RAG Ready)

Deprecated

See alternative Actors

Turn any Help Center into LLM-ready Markdown. Supports Zendesk, Intercom, Docusaurus, and generic sites. Perfect for RAG and AI Agents.

Pricing

$49.00/month + usage

Rating

0.0

(0)

Developer

Actums

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

🧠 Universal Knowledge Base Scraper (RAG Ready)

Feed your AI Agents with clean, structured Markdown. Stop feeding them HTML garbage.

🚀 What is Universal RAG Scraper?

Universal RAG Scraper is an "ETL-in-a-Box" for AI Developers. It turns messy Help Centers (Zendesk, Intercom, Docusaurus, Notion) into pure, train-ready Markdown (.md) files.

If you are building RAG Pipelines (Retrieval-Augmented Generation) or AI Agents, you know that HTML noise (navbars, footers, cookie banners) ruins your vector embeddings. This Actor solves that problem instantly.

Why not just use a generic scraper?

Generic scrapers give you the page. We give you the content.

Auto-Detect: We identify the platform (e.g., Zendesk) and apply surgical clean-up rules.
Markdown Native: We don't just "strip tags"; we convert tables, lists, and code blocks into perfect Markdown.
Metadata Rich: We extract the Title, URL, and Last Updated Date for your Vector DB.

⚡ Enterprise-Grade Features

Built for scale and reliability:

🛡️ Zero-Config Proxies: Scrape protected Help Centers without getting 403 Blocked. Request rotation is built-in.
⏰ Auto-Sync Scheduling: Set it to run every Friday night. Keep your RAG Knowledge Base in sync with your product docs automatically.
💾 Infinite Storage: Scrape 10,000 pages or 10 million. All data is stored, indexed, and ready for export (JSON, CSV, Excel).
🔌 Native Integrations: Pipe the Markdown directly to Pinecone, LangChain, or Zapier. No glue code needed.

🎯 Supported Platforms (Auto-Detected)

Platform	Capability
Zendesk	Full support. Strips "Related Articles" & sidebars.
Intercom	Full support. Handles dynamic loading.
Docusaurus	Perfect for V2/V3 docs. Preserves code block languages.
Notion	Scrapes public Notion Knowledge Bases.
Generic	Smart Fallback: If we don't recognize the platform, we use advanced readability algorithms to extract the main content.

📚 How to scrape a Knowledge Base in 3 steps

Paste the URL: Go to the input tab and enter the URL of the Help Center home page (e.g., https://support.zoom.us/hc/en-us).
Set Depth: Choose how many links to follow (default: 2 levels deep).
Run: Click "Start". In minutes, you can download a JSON file containing all articles in Markdown.

💰 Pricing & Usage

This is a Rental Actor.

Free Trial: You can test the scraper for a limited time to verify the Markdown quality.
Rental Plan: Access unlimited scale, high-frequency scheduling, and priority support.

Cost Estimation:

Scraping a typical Help Center (500 pages) takes ~5-10 minutes.
The output is "Vector Ready" - no post-processing costs.

📤 Input & Output

Input Configuration

Simple, developer-friendly input:

{
    "startUrls": [ { "url": "https://docs.apify.com" } ],
    "maxDepth": 10,
    "outputFormat": "markdown"
}

Output (JSON/Dataset)

Each item in the dataset is one article:

{
    "url": "https://docs.apify.com/academy/web-scraping",
    "title": "Web Scraping Academy",
    "platform": "Docusaurus",
    "scrapedAt": "2023-10-27T10:00:00Z",
    "markdown": "# Web Scraping Academy\n\nLearn how to scrape..."
}

❓ FAQ

Can I scrape a custom-built Help Center?

Yes. The Actor uses a "Smart Fallback" (Readability algorithm). If it doesn't detect Zendesk/Intercom, it will still scan the page, identify the visual "main content" area, and extract it.

Does this handle dynamic Javascript sites?

Yes. We use Playwright (headless browser) under the hood. We render the full page, execute JavaScript, and then scrape. This works even on React/Vue/Angular apps.

How do I feed this into my LLM?

Run the Actor.
Download the JSON output.
Use the markdown field as the content in your LLM Prompt or Embedding request.

📞 Support & Feedback

Found a site we can't scrape? Missing a platform?

Report a Bug: Use the "Issues" tab.
Request a Feature: We add new Platforms (e.g., Gitbook, ReadTheDocs) based on user votes!

Rag Architect

ai_solutionist/rag-architect

Transform any website into vector-store-ready knowledge chunks for Pinecone, Weaviate, LangChain, LlamaIndex, Supabase, n8n & more. AI-generated Q&A pairs, smart chunking, PII scrubbing. Build hallucination-free RAG chatbots in minutes.

Jason Pellerin

Quick Website Content Scraper ( Extract Text for RAG & LLMs )

automateitplease/ai-web-content-scraper-extract-text-for-rag-llms

Extract clean text from any website for AI/LLM applications. Supports both static and JavaScript-rendered sites (React, Vue, Angular). Perfect for RAG systems, chatbot training, and content analysis.

AutomateItPlease Workflow And Automaton Ops

Website Content Crawler Pro

datascoutapi/website-content-crawler-pro

Crawl websites and extract clean, structured content in Markdown, JSON, or plain text for AI models, LLMs, vector DBs, or RAG pipelines. Fast, reliable, and stealthy, with bulk processing, advanced metadata extraction, and seamless integration with LangChain, LlamaIndex, and AI workflows.

halam

456

3.4

(3)

Universal Markdown Scraper for LLMs

botflowtech/universal-markdown-scraper-for-llms

Universal Markdown Scraper for LLMs

BotFlowTech

Youtube Subtitles

red.cars/youtube-subtitles

Extract subtitles from YouTube videos in multiple formats (JSON, SRT, VTT, TXT) with support for playlists, channels, and advanced features like multi-language extraction and text cleaning.

AutomateLab

Google Maps Scraper Pro

red.cars/google-maps-scraper-pro

Professional Google Maps scraper for lead generation, market research, and business intelligence. Extract verified business contacts, ratings, and competitor analysis with 90% accuracy. Perfect for sales teams, marketing agencies, and investment research. Try FREE for 14 days!

AutomateLab

RAG Website Intelligence Crawler

quantifiable_bouquet/rag-website-intelligence-crawler

Crawl websites into clean Markdown + RAG-ready chunks. Outputs structural sitemap (KV), SimHash deduplication, and optional change detection (baseline + diff) for docs monitoring & KB sync. Built on Crawlee (Playwright) with optional Stagehand schema extraction.

Hayder Al-Khalissi

LLM-Ready Web Scraper

devoted_helix/llm-web-scraper

Convert web pages to clean, LLM-friendly text. Perfect for RAG pipelines, AI chatbot training, and fine-tuning datasets. Removes ads,menus, and clutter automatically.

batuhan senavci

YouTube Transcript & Subtitles Scraper - No API Key Required

george.the.developer/youtube-transcript-scraper

Extract transcripts, subtitles, and captions from YouTube videos at scale. No API key needed.

George Kioko

Website to Markdown Crawler â€” AI/RAG Data Pipeline

sovereigntaylor/website-to-markdown

Crawl any website and convert every page to clean, structured Markdown. Perfect for RAG pipelines, LLM training data, vector database ingestion, knowledge base building, and AI-powered search. Extracts main content, strips boilerplate, handles metadata, and chunks output for embeddings. Works with L