Website Content Crawler
Pricing
$10.00 / 1,000 results
Website Content Crawler
A powerful web crawler that extracts text content from websites, optimized for AI models, Large Language Models (LLMs), vector databases, and Retrieval-Augmented Generation (RAG) pipelines.
Pricing
$10.00 / 1,000 results
Rating
0.0
(0)
Developer

Jason Giang
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
Web Content Crawler
A powerful web crawler that extracts text content from websites, optimized for AI models, Large Language Models (LLMs), vector databases, and Retrieval-Augmented Generation (RAG) pipelines.
Features
- Multiple Crawling Engines: Choose between Playwright (Chrome/Firefox), Cheerio (fast HTTP client), or JSDOM based on your needs
- Markdown Output: Automatically converts HTML content to clean Markdown format
- Smart Content Extraction: Removes unwanted elements like cookie banners, navigation, ads, and more
- Customizable Selectors: Keep or remove specific elements using CSS selectors
- Deep Crawling: Recursively crawl websites with configurable depth limits
- AI-Ready Output: Structured data perfect for feeding into AI models and vector databases
- Proxy Support: Built-in proxy configuration for reliable crawling
- Screenshot Capture: Optional screenshot capture for visual documentation (Playwright only)
- File Downloads: Download and save linked files like PDFs and documents
Use Cases
- Knowledge Base Extraction: Crawl documentation sites and help centers
- Content Aggregation: Collect articles, blog posts, and web content at scale
- AI Training Data: Extract clean text for training or fine-tuning language models
- RAG Pipelines: Feed content into retrieval-augmented generation systems
- Vector Database Population: Prepare text content for embedding and semantic search
- Website Migration: Extract content from existing websites for migration
- Competitive Analysis: Monitor and analyze competitor content
Input Parameters
Required
- Start URLs (
startUrls): Array of URLs where the crawler will begin. The crawler will only process pages under these URLs.
Crawler Configuration
-
Crawler Type (
crawlerType): Select the crawling enginecheerio(default): Fast HTTP client, best for static websitesplaywright:chrome: Chrome browser with full JavaScript supportplaywright:firefox: Firefox browser, useful for sites with anti-bot measuresjsdom: Experimental JavaScript-capable crawler
-
Max Crawling Depth (
maxCrawlDepth): Maximum link depth from start URLs (default: 1)- 0 = Only crawl start URLs
- 1 = Crawl start URLs and pages directly linked from them
- 2+ = Continue crawling to specified depth
-
Max Pages (
maxCrawlPages): Maximum number of pages to crawl (default: 100) -
Max Requests Per Minute (
maxRequestsPerMinute): Rate limiting (default: 0 = unlimited)
Content Extraction
-
Readable Text Char Threshold (
readableTextCharThreshold): Minimum characters required to save a page (default: 100) -
Remove Cookie Warnings (
removeCookieWarnings): Automatically remove cookie consent dialogs (default: true) -
Click Elements CSS Selector (
clickElementsCssSelector): CSS selector for elements to click before extraction (e.g., "Show more" buttons) -
HTML Transformer (
htmlTransformer): How to process HTMLreadableText(default): Remove scripts, styles, navigationnone: Keep original HTML
-
Remove Elements CSS Selector (
removeElementsCssSelector): CSS selector for elements to remove (e.g.,nav, footer, .ads) -
Keep Elements CSS Selector (
keepElementsCssSelector): CSS selector for elements to keep (removes everything else)
Output Options
-
Save Markdown (
saveMarkdown): Convert content to Markdown format (default: true) -
Save HTML (
saveHtml): Save raw HTML to key-value store (default: false) -
Save Screenshots (
saveScreenshots): Capture page screenshots (Playwright only, default: false) -
Save Files (
saveFiles): Download linked files like PDFs (default: false)
Advanced Options
-
Max Scroll Height (
maxScrollHeightPixels): Scroll down pages with infinite scroll (default: 0 = disabled) -
Proxy Configuration (
proxyConfiguration): Proxy settings for the crawler -
Max Request Retries (
maxRequestRetries): Number of retry attempts for failed requests (default: 3) -
Debug Mode (
debugMode): Enable detailed logging (default: false)
Output Format
Each crawled page produces a dataset item with the following structure:
{"url": "https://example.com/page","title": "Page Title","description": "Page meta description","canonicalUrl": "https://example.com/page","text": "Extracted plain text content...","markdown": "# Page Title\n\nExtracted content in Markdown...","crawl": {"loadedUrl": "https://example.com/page","depth": 1,"httpStatusCode": 200,"loadedAt": "2024-01-01T12:00:00.000Z"}}
Example Usage
Basic Crawl
{"startUrls": [{ "url": "https://example.com/docs" }],"crawlerType": "cheerio","maxCrawlDepth": 2,"maxCrawlPages": 50}
Advanced Configuration
{"startUrls": [{ "url": "https://example.com/blog" }],"crawlerType": "playwright:chrome","maxCrawlDepth": 3,"maxCrawlPages": 200,"removeElementsCssSelector": "nav, footer, .sidebar, .comments","removeCookieWarnings": true,"saveMarkdown": true,"saveScreenshots": false,"maxRequestRetries": 5,"proxyConfiguration": {"useApifyProxy": true}}
Extract Specific Content
{"startUrls": [{ "url": "https://example.com" }],"keepElementsCssSelector": "article, .content, main","htmlTransformer": "readableText","readableTextCharThreshold": 500,"saveMarkdown": true}
How It Works
The crawler starts from your specified URLs and:
- Fetches and processes each page using your selected crawling engine
- Extracts and cleans the content by removing unwanted elements
- Converts the content to your preferred format (Markdown, plain text, or HTML)
- Follows links to discover and crawl additional pages (up to your depth limit)
- Saves all extracted data to the dataset for easy access