Pricing

Pay per usage

AI Web Scraper for RAG - Markdown & Chunking

Convert any URL to clean markdown, structured JSON, or auto-chunked text for RAG/LLM pipelines. Removes ads, nav, footers. Firecrawl alternative at $0.05/page. AI training data extraction.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

daehwan kim

Actor stats

Bookmarked

Total users

Monthly active users

19 hours ago

Last modified

What does AI Web Scraper for RAG do?

AI Web Scraper for RAG fetches any public URL and returns its content in a format ready for LLM pipelines, vector databases, or AI training datasets. It removes ads, navigation bars, cookie banners, footers, and other noise using Cheerio-based HTML cleaning before producing the output you request.

The Actor supports four output modes. Markdown mode gives you clean, readable text with preserved structure — headings, links, images, and tables all converted correctly. Structured mode returns a parsed JSON object with individual fields for title, headings, paragraphs, links, images, and tables. Chunks mode auto-splits the page content into fixed-size overlapping segments ready to insert into Pinecone, Weaviate, Chroma, or any vector store. Full mode combines all three in a single run.

Unlike Firecrawl (starting at $19/month with rate limits) or Jina Reader (metered API), this Actor charges only when it succeeds — $0.05 per page, nothing for failures. There are no monthly seats, no rate-limit tiers, and no API key required beyond your Apify account.

Key features

Four output modes: markdown, structured JSON, auto-chunked text, or all formats combined in one run
Noise removal: strips navigation, ads, footers, cookie banners, and script/style blocks before extraction
Configurable chunking: set chunk size (200–5,000 chars) and overlap (0–1,000 chars) to match your embedding model's context window
Structured data extraction: tables parsed as arrays, links as { text, href } pairs, images as { src, alt } pairs
Full metadata extraction: page title, meta description, Open Graph tags, canonical URL, and language
Word count and reading time: top-level summary fields surfaced for quick dataset inspection
Selective output: toggle links, images, tables, and metadata independently
Pay-per-event pricing: charged only on successful extraction, not on errors or invalid URLs
Clean markdown output: heading levels preserved, inline formatting intact, suitable for direct LLM prompt injection

Use cases

AI developers building RAG pipelines: ingest documentation, blog posts, or product pages as clean chunks ready for embedding
LLM fine-tuning teams: collect structured training data from web sources without building a scraping pipeline
Content teams: convert competitor pages or research articles into editable markdown
Automation engineers: integrate page extraction into n8n, Make, or Zapier workflows without maintaining a scraper
Data scientists: extract tables and structured content from report pages for downstream analysis
No-code builders: use Apify's scheduled runs to refresh content snapshots on a recurring basis

How to use AI Web Scraper for RAG

Configure input — provide the URL to scrape and select your output mode (markdown, structured, chunks, or full); optionally set chunk size, overlap, and toggle which content types to include
Run the Actor — click "Start" in Apify Console or call via the Apify API
Get structured results — output is pushed to the Apify dataset as structured JSON

Input parameters

Parameter	Type	Required	Default	Description
`url`	string	Yes	—	The URL of the web page to extract content from
`mode`	string	No	`markdown`	Output format: `markdown`, `structured`, `chunks`, or `full`
`chunkSize`	integer	No	`1000`	Target chunk size in characters (200–5,000); applies to `chunks` and `full` modes
`chunkOverlap`	integer	No	`200`	Overlapping characters between consecutive chunks (0–1,000) to prevent context loss at boundaries
`includeLinks`	boolean	No	`true`	Include hyperlinks found in the page content
`includeImages`	boolean	No	`true`	Include image URLs and alt text
`includeTables`	boolean	No	`true`	Extract and include tables as structured data
`includeMetadata`	boolean	No	`true`	Include page metadata: title, meta description, Open Graph tags, canonical URL, language

Output example

{
  "url": "https://blog.apify.com/web-scraping-guide/",
  "mode": "chunks",
  "title": "The Complete Guide to Web Scraping",
  "wordCount": 3842,
  "chunkCount": 18,
  "chunks": [
    {
      "index": 0,
      "total": 18,
      "text": "The Complete Guide to Web Scraping Web scraping is the automated extraction of data from websites. It powers price monitoring, lead generation, research, and countless other use cases across industries...",
      "charCount": 998,
      "metadata": {
        "url": "https://blog.apify.com/web-scraping-guide/",
        "title": "The Complete Guide to Web Scraping",
        "description": "Learn how web scraping works, which tools to use, and how to avoid common pitfalls."
      }
    },
    {
      "index": 1,
      "total": 18,
      "text": "...avoid common pitfalls. How Web Scraping Works At its core, web scraping involves three steps: fetching the HTML of a page, parsing the structure, and extracting the data you need...",
      "charCount": 1001,
      "metadata": {
        "url": "https://blog.apify.com/web-scraping-guide/",
        "title": "The Complete Guide to Web Scraping",
        "description": "Learn how web scraping works, which tools to use, and how to avoid common pitfalls."
      }
    }
  ],
  "avgChunkSize": 987,
  "metadata": {
    "title": "The Complete Guide to Web Scraping",
    "description": "Learn how web scraping works, which tools to use, and how to avoid common pitfalls.",
    "ogTitle": "The Complete Guide to Web Scraping",
    "canonicalUrl": "https://blog.apify.com/web-scraping-guide/",
    "language": "en"
  },
  "timestamp": "2025-03-21T09:14:22.003Z"
}

Pricing

Each successful page extraction costs $0.05 under Apify's pay-per-event model. You only pay when the extraction completes and data is pushed to the dataset. Failed runs, invalid URLs, and unreachable pages are not charged. Learn more about pay-per-event pricing.

API and integrations

Call this Actor via the Apify API, schedule recurring runs, or connect to Make, n8n, or Zapier to trigger extractions from other tools. Results are available as JSON, CSV, or Excel from the Apify dataset. You can also pass the output directly into vector database ingestion workflows using the Apify API output endpoint.

Limitations

JavaScript-rendered content (single-page apps that load data client-side) may return incomplete results, as the Actor uses Cheerio rather than a full browser
Pages behind login walls, CAPTCHAs, or aggressive bot detection are not supported
Very large pages (100,000+ words) may produce many chunks; use a larger chunkSize to reduce count

AI-Powered Smart Web Scraper

cloud9_ai/ai-web-scraper

Intelligent content extraction from any website using Crawlee + AI. Auto-detects structure, adapts to layout changes, handles JavaScript rendering. No custom code needed. Extract articles, products, listings from 1000s of pages.

cloud9

Compliance-Grade Web Intelligence for AI Agents

ai_solutionist/compliance-web-intel

The scraper AI agents trust. Extract grounded facts with citations, entities, claims & RAG chunks. Built for LangChain, LlamaIndex, AutoGPT. Quality scoring, auto-citations, 6 task modes.

Jason Pellerin

RAG Web Extractor

junipr/rag-web-extractor

Extract clean markdown from websites for RAG pipelines. Strip nav, ads, boilerplate. Preserve headings, links, images. Recursive crawling with depth control. Chunked output for embedding pipelines. Build AI knowledge bases.

junipr

Docs to Markdown + AI Embeddings → Vector DB Crawler

badruddeen/docs-to-markdown-ai-embeddings---vector-db-crawler

Turn any documentation site into clean Markdown, intelligently chunked content with embeddings (Azure/OpenAI), and directly upsert into MongoDB Atlas, Pinecone, Weaviate, Qdrant, or Milvus — ready for RAG, AI assistants, and semantic search in minutes.

Badruddeen Naseem

5.0

(1)

Web Scraper RAG Ready

traorealexy/Web-Sraper-RAG-Ready

Turn any website into clean, token-efficient Markdown ready for RAG and LLM pipelines. Removes boilerplate, handles JavaScript rendering, and outputs structured JSON for LangChain, LlamaIndex, and vector databases.

Alexy Traore

AI Training Data Scraper - LLM and RAG-Ready

george.the.developer/ai-training-data-scraper

Extract web content formatted for LLM fine-tuning and RAG pipelines. Output in OpenAI JSONL, Claude JSONL, Markdown, or raw text.

George Kioko

RAG Spider - Web to Markdown Crawler for AI Training Data

lenient_grove/RAG-Spider

Enterprise-grade web crawler that converts messy websites into clean, chunked Markdown for AI systems. Uses Mozilla Readability for 95% cleaner extraction than competitors. Outputs RAG-ready data with metadata and token estimates. Perfect for building knowledge bases and training AI chatbots.

Tejas Rawool

5.0

(1)

Tech Docs to LLM-Ready Markdown

hedelka/tech-docs-scraper

Scrapes technical documentation sites (Docusaurus, GitBook, MkDocs, ReadTheDocs) and converts them to clean, structured Markdown for RAG pipelines, LLM training, and AI assistants. Automatically detects documentation framework and removes navigation elements.

Dmitry Goncharov

Website to Markdown Crawler â€” AI/RAG Data Pipeline

sovereigntaylor/website-to-markdown

Crawl any website and convert every page to clean, structured Markdown. Perfect for RAG pipelines, LLM training data, vector database ingestion, knowledge base building, and AI-powered search. Extracts main content, strips boilerplate, handles metadata, and chunks output for embeddings. Works with L