
AI Web Scraper - Powered by Crawl4AI
Pay $25.00 for 1,000 Results

AI Web Scraper - Powered by Crawl4AI
Pay $25.00 for 1,000 Results
A blazing‑fast, AI‑ready web scraper built on top of the open‑source Crawl4AI library. Perfect for feeding data to LLMs, AI agents, or model‑training pipelines. Supports BFS/DFS/BestFirst deep crawls, multiple extraction strategies (including JSON and LLM), and flexible output (markdown or JSON).
Actor Metrics
2 monthly users
No reviews yet
No bookmarks yet
Created in Mar 2025
Modified a day ago
AI Web Scraper
Do you need reliable data for your AI agents, LLM pipelines, or training workflows? The AI Web Scraper Actor is your key to fast, flexible, and AI‑friendly web extraction on Apify. Under the hood, it relies on the open‑source Crawl4AI engine to handle anything from simple single‑page scrapes to deep multi‑link traversals (BFS/DFS/BestFirst). Whether you want clean markdown, JSON extraction, or LLM summarization, just specify your desired strategy via the Actor’s input UI, and you’re set.
Below is an overview of each setting you’ll see in the Apify interface and how it affects your crawls.
Quick How-To
-
Start with URLs
At a minimum, providestartUrls
in the UI (or JSON input)—the pages you want to scrape. -
Pick a Crawler & Extraction Style
Choose between various crawl strategies (e.g., BFS or DFS) and extraction methods (simple markdown, LLM-based, JSON CSS, etc.). You can also enable content filtering or deeper link exploration. -
Review the Output
Once the Actor finishes, your results will appear in the Apify Dataset—structured JSON, markdown, or whichever format you chose.
Input Fields Explained
These fields appear in the Actor’s input UI. Customize them to match your use case.
1. startUrls (Required)
List of pages to scrape. For each entry, just provide "url"
.
Example:
1{ 2 "startUrls": [ 3 { "url": "https://example.com" } 4 ] 5}
2. browserConfig (Optional)
Configure Playwright’s browser behavior—headless mode, custom user agent, viewport size, etc.
browser_type
: “chromium”, “firefox”, or “webkit”headless
: Boolean to run in headless modeverbose_logging
: Extra debug logsignore_https_errors
: Accept invalid certsuser_agent
: E.g. “random” or a custom stringproxy
: Proxy server URLviewport_width
/viewport_height
: Window sizeaccept_downloads
: Whether downloads are allowedextra_headers
: Additional request headers
3. crawlerConfig (Optional)
Core crawling settings—time limits, caching, JavaScript hooks, or multi‑page concurrency.
cache_mode
: “BYPASS” (no cache), “ENABLED”, etc.page_timeout
: Milliseconds to wait for page loadssimulate_user
: Stealth by mimicking user actionsremove_overlay_elements
: Attempt to remove popupsdelay_before_return_html
: Extra wait before final extractionwait_for
: Wait time or wait conditionscreenshot
/pdf
: Capture screenshot or PDFenable_rate_limiting
: Rate limit large URL listssemaphore_count
: Concurrency limitmemory_threshold_percent
: Pause if memory is too highword_count_threshold
: Discard short text blockscss_selector
,excluded_tags
,excluded_selector
: Further refine or skip sections of the DOMonly_text
: Keep plain text onlyprettify
: Attempt to clean up HTMLkeep_data_attributes
: Keep or drop data-* attributesremove_forms
: Strip<form>
elementsbypass_cache
/disable_cache
/no_cache_read
/no_cache_write
: Fine‑grained caching controlswait_until
: E.g. “domcontentloaded” or “networkidle”wait_for_images
: Wait for images to fully loadcheck_robots_txt
: Respect robots.txt?mean_delay
,max_range
: Introduce a random delay rangejs_code
: Custom JS to run on each pagejs_only
: Reuse the same page context without re-navigationignore_body_visibility
: Include hidden elementsscan_full_page
: Scroll from top to bottom for lazy loadingscroll_delay
: Delay between scroll stepsprocess_iframes
: Also parse iframesoverride_navigator
: Additional stealth tweakmagic
: Enable multiple advanced anti-bot tricksadjust_viewport_to_content
: Resize viewport to fit contentscreenshot_wait_for
: Wait time before taking a screenshotscreenshot_height_threshold
: Max doc height to screenshotimage_description_min_word_threshold
: Filter out images with minimal alt textimage_score_threshold
: Remove lower‑score imagesexclude_external_images
: No external imagesexclude_social_media_domains
,exclude_domains
: Avoid these domains entirelyexclude_external_links
,exclude_social_media_links
: Strip external or social media linksverbose
: Extra logslog_console
: Show browser console logs?stream
: Stream results as they come in, or wait until done
4. deepCrawlConfig (Optional)
When you select BFS, DFS, or BestFirst crawling, this config guides link exploration.
max_pages
: Stop after crawling this many pagesmax_depth
: Depth of link‑followinginclude_external
: Follow off‑domain links?score_threshold
: Filter out low‑score links (BestFirst)filter_chain
: Extra link filter rulesurl_scorer
: If you want a custom approach to scoring discovered URLs
5. markdownConfig (Optional)
For HTML→Markdown conversions.
ignore_links
: Skip anchor linksignore_images
: Omit markdown imagesescape_html
: Turn<div>
into<div>
skip_internal_links
: Remove same‑page anchorsinclude_sup_sub
: Preserve<sup>/<sub>
textcitations
: Put footnotes at bottom of filebody_width
: Wrap lines at N charsfit_markdown
: Use advanced “fit” mode if also using a filter
6. contentFilterConfig (Optional)
Prune out nav bars, sidebars, or extra text using “pruning,” “bm25,” or a second LLM filter.
type
: e.g. “pruning”, “bm25”threshold
: Score cutoffmin_word_threshold
: Minimum words to keepbm25_threshold
: BM25 filter paramapply_llm_filter
: Iftrue
, do a second pass with an LLMsemantic_filter
: Keep only text about a certain topicword_count_threshold
: Another word thresholdsim_threshold
,max_dist
,top_k
,linkage_method
: For advanced clustering
7. userAgentConfig (Optional)
Rotate or fix your user agent.
user_agent_mode
: “random” or “fixed”device_type
: “desktop” or “mobile”browser_type
: e.g. “chrome”num_browsers
: If rotating among multiple agents
8. llmConfig (Optional)
For LLM-based extraction or filtering.
provider
: e.g. “openai/gpt-4”, “groq/deepseek-r1-distill-llama-70b”api_token
: Model’s API keyinstruction
: Prompt the LLM about how to parse or summarizebase_url
: For custom endpointschunk_token_threshold
: Big pages → chunk themapply_chunking
: Boolean for chunkinginput_format
: “markdown” or “html”temperature
,max_tokens
: Standard LLM config
9. session_id (Optional)
Provide a session ID to reuse browser context across multiple runs (logins, multi-step flows, etc.).
10. extractionStrategy (Optional)
Pick one:
SimpleExtractionStrategy
: Simple HTML→Markdown.LLMExtractionStrategy
: Let an LLM parse or summarize.CosineStrategy
: Group similar text blocks.JsonCssExtractionStrategy
/JsonXPathExtractionStrategy
: Provide a schema to produce structured JSON.
11. crawlStrategy (Optional)
SimpleCrawlStrategy
: Just the given start URLsBFSDeepCrawlStrategy
: Breadth-first approachDFSDeepCrawlStrategy
: Depth-first approachBestFirstCrawlingStrategy
: Score links, pick the best first
12. extractionSchema (Optional)
If using CSS/XPath extraction.
name
: Your extraction scheme namebaseSelector
: Parent selector for repeated elementsfields
: Each withname
,selector
,type
, and optionalattribute
Example:
1"extractionSchema": { 2 "name": "Custom Extraction", 3 "baseSelector": "div.article", 4 "fields": [ 5 { "name": "title", "selector": "h1", "type": "text" }, 6 { "name": "link", "selector": "a", "type": "attribute", "attribute": "href" } 7 ] 8}
Usage Examples
Minimal
1{ 2 "startUrls": [ 3 { "url": "https://example.com" } 4 ] 5}
Scrapes a single page in headless mode with standard markdown output.
JSON CSS Extraction
1{ 2 "startUrls": [ 3 { "url": "https://news.ycombinator.com/" } 4 ], 5 "extractionStrategy": "JsonCssExtractionStrategy", 6 "extractionSchema": { 7 "name": "HackerNews", 8 "baseSelector": "tr.athing", 9 "fields": [ 10 { 11 "name": "title", 12 "selector": ".titleline a", 13 "type": "text" 14 }, 15 { 16 "name": "link", 17 "selector": ".titleline a", 18 "type": "attribute", 19 "attribute": "href" 20 } 21 ] 22 } 23}
Generates a JSON array, each object containing “title” and “link.”
Pro Tips
- Deep crawling: If you want BFS or DFS, set
crawlStrategy
to “BFSDeepCrawlStrategy” or “DFSDeepCrawlStrategy” and configuredeepCrawlConfig
. - Content filtering: Combine
contentFilterConfig
withextractionStrategy
for maximum clarity and minimal noise. - LLM-based: Choose “LLMExtractionStrategy” plus
llmConfig
for advanced summarization or structured data. Great for building AI pipelines.
Thanks for trying out the AI Web Scraper—enjoy harnessing rich, clean data for your Apify-based AI solutions!