AI Web Scraper for RAG - Markdown & Chunking
Pricing
Pay per usage
AI Web Scraper for RAG - Markdown & Chunking
Convert any URL to clean markdown, structured JSON, or auto-chunked text for RAG/LLM pipelines. Removes ads, nav, footers. Firecrawl alternative at $0.05/page. AI training data extraction.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
daehwan kim
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
19 hours ago
Last modified
Categories
Share
Convert any URL to clean markdown, structured JSON, or RAG-ready text chunks — for a fraction of what Firecrawl or Jina charge — at $0.05 per page.
What does AI Web Scraper for RAG do?
AI Web Scraper for RAG fetches any public URL and returns its content in a format ready for LLM pipelines, vector databases, or AI training datasets. It removes ads, navigation bars, cookie banners, footers, and other noise using Cheerio-based HTML cleaning before producing the output you request.
The Actor supports four output modes. Markdown mode gives you clean, readable text with preserved structure — headings, links, images, and tables all converted correctly. Structured mode returns a parsed JSON object with individual fields for title, headings, paragraphs, links, images, and tables. Chunks mode auto-splits the page content into fixed-size overlapping segments ready to insert into Pinecone, Weaviate, Chroma, or any vector store. Full mode combines all three in a single run.
Unlike Firecrawl (starting at $19/month with rate limits) or Jina Reader (metered API), this Actor charges only when it succeeds — $0.05 per page, nothing for failures. There are no monthly seats, no rate-limit tiers, and no API key required beyond your Apify account.
Key features
- Four output modes: markdown, structured JSON, auto-chunked text, or all formats combined in one run
- Noise removal: strips navigation, ads, footers, cookie banners, and script/style blocks before extraction
- Configurable chunking: set chunk size (200–5,000 chars) and overlap (0–1,000 chars) to match your embedding model's context window
- Structured data extraction: tables parsed as arrays, links as
{ text, href }pairs, images as{ src, alt }pairs - Full metadata extraction: page title, meta description, Open Graph tags, canonical URL, and language
- Word count and reading time: top-level summary fields surfaced for quick dataset inspection
- Selective output: toggle links, images, tables, and metadata independently
- Pay-per-event pricing: charged only on successful extraction, not on errors or invalid URLs
- Clean markdown output: heading levels preserved, inline formatting intact, suitable for direct LLM prompt injection
Use cases
- AI developers building RAG pipelines: ingest documentation, blog posts, or product pages as clean chunks ready for embedding
- LLM fine-tuning teams: collect structured training data from web sources without building a scraping pipeline
- Content teams: convert competitor pages or research articles into editable markdown
- Automation engineers: integrate page extraction into n8n, Make, or Zapier workflows without maintaining a scraper
- Data scientists: extract tables and structured content from report pages for downstream analysis
- No-code builders: use Apify's scheduled runs to refresh content snapshots on a recurring basis
How to use AI Web Scraper for RAG
- Configure input — provide the URL to scrape and select your output mode (markdown, structured, chunks, or full); optionally set chunk size, overlap, and toggle which content types to include
- Run the Actor — click "Start" in Apify Console or call via the Apify API
- Get structured results — output is pushed to the Apify dataset as structured JSON
Input parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
url | string | Yes | — | The URL of the web page to extract content from |
mode | string | No | markdown | Output format: markdown, structured, chunks, or full |
chunkSize | integer | No | 1000 | Target chunk size in characters (200–5,000); applies to chunks and full modes |
chunkOverlap | integer | No | 200 | Overlapping characters between consecutive chunks (0–1,000) to prevent context loss at boundaries |
includeLinks | boolean | No | true | Include hyperlinks found in the page content |
includeImages | boolean | No | true | Include image URLs and alt text |
includeTables | boolean | No | true | Extract and include tables as structured data |
includeMetadata | boolean | No | true | Include page metadata: title, meta description, Open Graph tags, canonical URL, language |
Output example
{"url": "https://blog.apify.com/web-scraping-guide/","mode": "chunks","title": "The Complete Guide to Web Scraping","wordCount": 3842,"chunkCount": 18,"chunks": [{"index": 0,"total": 18,"text": "The Complete Guide to Web Scraping Web scraping is the automated extraction of data from websites. It powers price monitoring, lead generation, research, and countless other use cases across industries...","charCount": 998,"metadata": {"url": "https://blog.apify.com/web-scraping-guide/","title": "The Complete Guide to Web Scraping","description": "Learn how web scraping works, which tools to use, and how to avoid common pitfalls."}},{"index": 1,"total": 18,"text": "...avoid common pitfalls. How Web Scraping Works At its core, web scraping involves three steps: fetching the HTML of a page, parsing the structure, and extracting the data you need...","charCount": 1001,"metadata": {"url": "https://blog.apify.com/web-scraping-guide/","title": "The Complete Guide to Web Scraping","description": "Learn how web scraping works, which tools to use, and how to avoid common pitfalls."}}],"avgChunkSize": 987,"metadata": {"title": "The Complete Guide to Web Scraping","description": "Learn how web scraping works, which tools to use, and how to avoid common pitfalls.","ogTitle": "The Complete Guide to Web Scraping","canonicalUrl": "https://blog.apify.com/web-scraping-guide/","language": "en"},"timestamp": "2025-03-21T09:14:22.003Z"}
Pricing
Each successful page extraction costs $0.05 under Apify's pay-per-event model. You only pay when the extraction completes and data is pushed to the dataset. Failed runs, invalid URLs, and unreachable pages are not charged. Learn more about pay-per-event pricing.
API and integrations
Call this Actor via the Apify API, schedule recurring runs, or connect to Make, n8n, or Zapier to trigger extractions from other tools. Results are available as JSON, CSV, or Excel from the Apify dataset. You can also pass the output directly into vector database ingestion workflows using the Apify API output endpoint.
Limitations
- JavaScript-rendered content (single-page apps that load data client-side) may return incomplete results, as the Actor uses Cheerio rather than a full browser
- Pages behind login walls, CAPTCHAs, or aggressive bot detection are not supported
- Very large pages (100,000+ words) may produce many chunks; use a larger
chunkSizeto reduce count