AI-Ready Web Content Crawler (LLM/RAG Optimized) avatar

AI-Ready Web Content Crawler (LLM/RAG Optimized)

Pricing

from $20.00 / 1,000 results

Go to Apify Store
AI-Ready Web Content Crawler (LLM/RAG Optimized)

AI-Ready Web Content Crawler (LLM/RAG Optimized)

Deep-crawl websites and extract LLM-ready Markdown with OG tags, JSON-LD, author, dates, token estimates, native RAG chunking, language filtering, content-hash dedup, and per-page error reporting. Enforced timeouts. Zero silent failures.

Pricing

from $20.00 / 1,000 results

Rating

0.0

(0)

Developer

Yuliia Kulakova

Yuliia Kulakova

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

7 days ago

Last modified

Share

AI-Ready Web Content Crawler

AI-Ready Web Content Crawler

Crawl any website and get clean, structured Markdown ready for your AI pipeline. Built for developers building RAG applications, fine-tuning datasets, and AI-powered content workflows.


What you get

Every page you crawl is returned as a clean, structured record with:

  • Clean Markdown — nav, ads, footers, cookie banners automatically removed
  • Plain text — stripped version for embeddings and search indexes
  • Rich metadata — title, author, publish date, Open Graph, Twitter Card, JSON-LD structured data, language, canonical URL, hreflang
  • Token estimate — per-page token count so you know your LLM costs upfront
  • Content type — automatically classified as article, documentation, product, or landing page
  • RAG-ready chunks — split at semantic boundaries (headings, paragraphs) with configurable overlap
  • Link graph — internal links, external links, and PDF links per page
  • Crawl analytics — word counts, token totals, language distribution, depth distribution

Quick start

Just paste a URL and click Run. That's it.

{
"startUrls": [{ "url": "https://docs.example.com" }]
}

The crawler will crawl up to 100 pages at depth 5, extract clean Markdown with full metadata, and return everything as structured JSON.


Use cases

Build a RAG knowledge base

Crawl your documentation site and get chunks ready to embed — no post-processing needed.

{
"startUrls": [{ "url": "https://docs.yoursite.com" }],
"maxCrawlPages": 500,
"languageFilter": ["en"],
"chunkContent": true,
"chunkSize": 1500,
"chunkOverlap": 150,
"deduplicateByContent": true
}

Each page comes with a chunks array. Each chunk includes text, position, and token estimate. Feed directly to OpenAI, Pinecone, Weaviate, or any vector database.

Monitor competitor content

Track what your competitors publish, when they update it, and how they structure it.

{
"startUrls": [{ "url": "https://blog.competitor.com" }],
"globs": ["https://blog.competitor.com/posts/**"],
"excludeGlobs": ["**/tag/**", "**/author/**"],
"extractMetadata": true,
"extractLinks": true,
"maxCrawlPages": 200
}

Get author names, publish dates, content types, and full link graphs for every article.

Scrape a static site fast

Don't need JavaScript rendering? Switch to Cheerio mode for 3-5x faster crawling at lower cost.

{
"startUrls": [{ "url": "https://static-site.com" }],
"crawlerType": "cheerio",
"maxConcurrency": 10,
"maxCrawlPages": 1000
}

Crawl behind authentication

Pass session cookies and crawl pages that require login.

{
"startUrls": [{ "url": "https://app.example.com/dashboard" }],
"initialCookies": [
{ "name": "session", "value": "abc123", "domain": "app.example.com", "path": "/" }
],
"maxCrawlDepth": 3
}

Why this crawler?

Built-in proxy with automatic fallback

Every request goes through a residential proxy. If it gets blocked, the crawler automatically switches to a backup proxy and retries. You don't configure anything — it just works.

Filtered pages don't burn your budget

Language filter, content length filter, and deduplication all run before counting against your page limit. If you set maxCrawlPages: 100 and 30 pages get filtered, you still get 100 real pages.

No silent failures

Other crawlers show "SUCCEEDED" with an empty dataset. This crawler tracks every failed URL with a reason (CAPTCHA, 403, timeout, proxy error) and stores them in the key-value store. You always know what happened.

Graceful timeout handling

Apify hard-kills actors after 1 hour. This crawler monitors the remaining time and stops gracefully 90 seconds before the limit — no partial records, no data loss.

Smart content extraction

Uses Mozilla Readability (the same engine behind Firefox Reader View) to extract article content. Automatically removes navigation, ads, sidebars, cookie banners, and other noise. Falls back to raw HTML extraction when Readability can't parse the page.


Output example

{
"url": "https://example.com/blog/ai-trends",
"metadata": {
"title": "Top AI Trends for 2025",
"author": "Jane Doe",
"publishDate": "2025-01-15T10:00:00.000Z",
"languageCode": "en",
"contentType": "article",
"wordCount": 1842,
"tokenEstimate": 2456,
"ogImage": "https://example.com/img/ai-trends.jpg",
"jsonLd": [{ "@type": "Article", "..." : "..." }]
},
"markdown": "# Top AI Trends for 2025\n\nClean article content...",
"text": "Top AI Trends for 2025. Clean article content...",
"chunks": [
{
"chunkIndex": 0,
"text": "# Top AI Trends...",
"tokenEstimate": 461
}
],
"depth": 1,
"httpStatusCode": 200
}

Free analytics with every run

The last record in your dataset is a crawl summary — total words, tokens, pages by language, pages by content type, pages by depth. Use it to estimate LLM costs or monitor content changes over time.


Crawler engines

EngineBest forSpeed
Playwright Chrome (default)JavaScript-heavy sites, SPAs, bot-protected pagesStandard
Playwright FirefoxSites that block Chrome specificallyStandard
CheerioStatic HTML sites, blogs, documentation3-5x faster

Key features at a glance

FeatureDetails
Output formatMarkdown + plain text + metadata JSON
RAG chunkingSemantic splits with configurable size and overlap
MetadataOG tags, JSON-LD, author, dates, Twitter Card, hreflang
Token estimatePer page and total across the crawl
Content typeAuto-classified: article, documentation, product, landing
Language filterFilter by ISO 639-1 codes without wasting page budget
DeduplicationURL + canonical + optional content-hash (MD5)
Link extractionInternal, external, and PDF links per page
Error trackingEvery failed URL logged with reason in KV store
ProxyBuilt-in residential with automatic fallback
Timeout safetyGraceful stop 90s before Apify hard-kill
Cookie bannersAuto-dismissed before extraction
AuthenticationCookie injection for logged-in crawling

Pricing

Pay per page crawled. No monthly fees. No hidden costs.

What you pay forPrice
Page crawled$0.02 per page
Apify platform usageStandard compute costs

Crawl 100 pages = $2. Crawl 1,000 pages = $20.


Input reference

FieldTypeDefaultDescription
startUrlsarrayrequiredOne or more seed URLs
maxCrawlDepthinteger5Max link depth from seed (0 = seed only)
maxCrawlPagesinteger100Max pages saved (filtered pages don't count)
crawlerTypeselectplaywright:chromeRendering engine
globsstring[]Only crawl matching URL patterns
excludeGlobsstring[]Skip matching URL patterns
useSitemapsbooleanfalseAuto-discover URLs from sitemap.xml
htmlTransformerselectreadabilityContent extraction method
languageFilterstring[]Only save pages in these languages
contentMinLengthinteger100Skip pages with fewer characters
deduplicateByContentbooleanfalseSkip duplicate content (MD5 hash)
chunkContentbooleanfalseEnable RAG chunking
chunkSizeinteger2000Target chunk size in characters
chunkOverlapinteger200Overlap between chunks
extractMetadatabooleantrueExtract rich metadata
extractLinksbooleanfalseExtract page links
saveMarkdownbooleantrueInclude Markdown in output
saveTextbooleantrueInclude plain text in output
saveHtmlbooleanfalseSave cleaned HTML to KV store
aggressivePrunebooleanfalseRemove sidebars, comments, widgets
dismissCookieBannersbooleantrueAuto-click cookie consent dialogs
maxConcurrencyinteger3Parallel requests
requestTimeoutSecsinteger60Hard timeout per page

FAQ

Is this compatible with apify/website-content-crawler? Yes. Same output format (url, crawl, metadata, markdown, text). You can switch without changing your pipeline.

Can I crawl JavaScript-rendered pages? Yes. The default Playwright Chrome engine renders JavaScript, handles SPAs, and bypasses basic bot protection.

How do I crawl only specific sections of a site? Use globs to include patterns (e.g. https://example.com/blog/**) and excludeGlobs to exclude patterns (e.g. **/tag/**).

What happens if a page is blocked? The crawler detects CAPTCHA and bot-wall pages, retries with a fresh session, and logs the failure. Blocked pages don't count against your page limit.

Can I use this for multiple languages? Yes. Set languageFilter to ["en", "de", "fr"] to keep only those languages. Pages in other languages are skipped but don't waste your budget.

How does chunking work? Content is split at semantic boundaries (headings, paragraph breaks, code blocks). Each chunk includes position data and a token estimate. Configure chunkSize and chunkOverlap to match your embedding model's context window.