Universal Knowledge Base Scraper (RAG Ready) avatar
Universal Knowledge Base Scraper (RAG Ready)

Pricing

$49.00/month + usage

Go to Apify Store
Universal Knowledge Base Scraper (RAG Ready)

Universal Knowledge Base Scraper (RAG Ready)

Turn any Help Center into LLM-ready Markdown. Supports Zendesk, Intercom, Docusaurus, and generic sites. Perfect for RAG and AI Agents.

Pricing

$49.00/month + usage

Rating

0.0

(0)

Developer

Actums

Actums

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

๐Ÿง  Universal Knowledge Base Scraper (RAG Ready)

Feed your AI Agents with clean, structured Markdown. Stop feeding them HTML garbage.


๐Ÿš€ What is Universal RAG Scraper?

Universal RAG Scraper is an "ETL-in-a-Box" for AI Developers. It turns messy Help Centers (Zendesk, Intercom, Docusaurus, Notion) into pure, train-ready Markdown (.md) files.

If you are building RAG Pipelines (Retrieval-Augmented Generation) or AI Agents, you know that HTML noise (navbars, footers, cookie banners) ruins your vector embeddings. This Actor solves that problem instantly.

Why not just use a generic scraper?

Generic scrapers give you the page. We give you the content.

  • Auto-Detect: We identify the platform (e.g., Zendesk) and apply surgical clean-up rules.
  • Markdown Native: We don't just "strip tags"; we convert tables, lists, and code blocks into perfect Markdown.
  • Metadata Rich: We extract the Title, URL, and Last Updated Date for your Vector DB.

โšก Enterprise-Grade Features

Built for scale and reliability:

  1. ๐Ÿ›ก๏ธ Zero-Config Proxies: Scrape protected Help Centers without getting 403 Blocked. Request rotation is built-in.
  2. โฐ Auto-Sync Scheduling: Set it to run every Friday night. Keep your RAG Knowledge Base in sync with your product docs automatically.
  3. ๐Ÿ’พ Infinite Storage: Scrape 10,000 pages or 10 million. All data is stored, indexed, and ready for export (JSON, CSV, Excel).
  4. ๐Ÿ”Œ Native Integrations: Pipe the Markdown directly to Pinecone, LangChain, or Zapier. No glue code needed.

๐ŸŽฏ Supported Platforms (Auto-Detected)

PlatformCapability
ZendeskFull support. Strips "Related Articles" & sidebars.
IntercomFull support. Handles dynamic loading.
DocusaurusPerfect for V2/V3 docs. Preserves code block languages.
NotionScrapes public Notion Knowledge Bases.
GenericSmart Fallback: If we don't recognize the platform, we use advanced readability algorithms to extract the main content.

๐Ÿ“š How to scrape a Knowledge Base in 3 steps

  1. Paste the URL: Go to the input tab and enter the URL of the Help Center home page (e.g., https://support.zoom.us/hc/en-us).
  2. Set Depth: Choose how many links to follow (default: 2 levels deep).
  3. Run: Click "Start". In minutes, you can download a JSON file containing all articles in Markdown.

๐Ÿ’ฐ Pricing & Usage

This is a Rental Actor.

  • Free Trial: You can test the scraper for a limited time to verify the Markdown quality.
  • Rental Plan: Access unlimited scale, high-frequency scheduling, and priority support.

Cost Estimation:

  • Scraping a typical Help Center (500 pages) takes ~5-10 minutes.
  • The output is "Vector Ready" - no post-processing costs.

๐Ÿ“ค Input & Output

Input Configuration

Simple, developer-friendly input:

{
"startUrls": [ { "url": "https://docs.apify.com" } ],
"maxDepth": 10,
"outputFormat": "markdown"
}

Output (JSON/Dataset)

Each item in the dataset is one article:

{
"url": "https://docs.apify.com/academy/web-scraping",
"title": "Web Scraping Academy",
"platform": "Docusaurus",
"scrapedAt": "2023-10-27T10:00:00Z",
"markdown": "# Web Scraping Academy\n\nLearn how to scrape..."
}

โ“ FAQ

Can I scrape a custom-built Help Center?

Yes. The Actor uses a "Smart Fallback" (Readability algorithm). If it doesn't detect Zendesk/Intercom, it will still scan the page, identify the visual "main content" area, and extract it.

Does this handle dynamic Javascript sites?

Yes. We use Playwright (headless browser) under the hood. We render the full page, execute JavaScript, and then scrape. This works even on React/Vue/Angular apps.

How do I feed this into my LLM?

  1. Run the Actor.
  2. Download the JSON output.
  3. Use the markdown field as the content in your LLM Prompt or Embedding request.

๐Ÿ“ž Support & Feedback

Found a site we can't scrape? Missing a platform?

  • Report a Bug: Use the "Issues" tab.
  • Request a Feature: We add new Platforms (e.g., Gitbook, ReadTheDocs) based on user votes!