Pricing

$8.00 / 1,000 results

Go to Apify Store

AI Web Scraper - Powered by Crawl4AI

Try for free

Developed by

Raizen Technology

A blazing-fast AI web scraper powered by Crawl4AI. Perfect for LLMs, AI agents, AI automation, model training, sentiment analysis, and content generation. Supports deep crawling, multiple extraction strategies and flexible output (Markdown/JSON). Seamlessly integrates with Make.com, n8n, and Zapier.

1.0 (1)

Pricing

$8.00 / 1,000 results

Last modified

4 months ago

Agents

Automation

You can access the AI Web Scraper - Powered by Crawl4AI programmatically from your own applications by using the Apify API. You can also choose the language preference from below. To use the Apify API, you’ll need an Apify account and your API token, found in Integrations settings in Apify Console.

Python

JavaScript

CLI

OpenAPI

HTTP

MCP

# Set API token
$API_TOKEN=<YOUR_API_TOKEN>

# Prepare Actor input
$cat > input.json << 'EOF'
<{
<  "startUrls": [
<    {
<      "url": "https://www.cnbc.com/2025/03/12/googles-deepmind-says-it-will-use-ai-models-to-power-physical-robots.html"
<    }
<  ],
<  "browserConfig": {
<    "browser_type": "chromium",
<    "headless": true,
<    "verbose_logging": false,
<    "ignore_https_errors": true,
<    "user_agent": "random",
<    "proxy": "",
<    "viewport_width": 1280,
<    "viewport_height": 720,
<    "accept_downloads": false,
<    "extra_headers": {}
<  },
<  "crawlerConfig": {
<    "cache_mode": "BYPASS",
<    "page_timeout": 20000,
<    "simulate_user": true,
<    "override_navigator": true,
<    "magic": true,
<    "remove_overlay_elements": true,
<    "delay_before_return_html": 0.75,
<    "wait_for": "",
<    "screenshot": false,
<    "pdf": false,
<    "enable_rate_limiting": false,
<    "memory_threshold_percent": 90,
<    "word_count_threshold": 200,
<    "css_selector": "",
<    "excluded_tags": [],
<    "excluded_selector": "",
<    "only_text": false,
<    "prettify": false,
<    "keep_data_attributes": false,
<    "remove_forms": false,
<    "bypass_cache": false,
<    "disable_cache": false,
<    "no_cache_read": false,
<    "no_cache_write": false,
<    "wait_until": "domcontentloaded",
<    "wait_for_images": false,
<    "check_robots_txt": false,
<    "mean_delay": 0.1,
<    "max_range": 0.3,
<    "js_code": "",
<    "js_only": false,
<    "ignore_body_visibility": true,
<    "scan_full_page": false,
<    "scroll_delay": 0.2,
<    "process_iframes": false,
<    "adjust_viewport_to_content": false,
<    "screenshot_wait_for": 0,
<    "screenshot_height_threshold": 20000,
<    "image_description_min_word_threshold": 50,
<    "image_score_threshold": 3,
<    "exclude_external_images": false,
<    "exclude_social_media_domains": [],
<    "exclude_external_links": false,
<    "exclude_social_media_links": false,
<    "exclude_domains": [],
<    "verbose": true,
<    "log_console": false,
<    "stream": false
<  },
<  "deepCrawlConfig": {
<    "max_pages": 100,
<    "max_depth": 3,
<    "include_external": false,
<    "score_threshold": 0.5,
<    "filter_chain": [],
<    "keywords": [
<      "crawl",
<      "example",
<      "async",
<      "configuration"
<    ],
<    "weight": 0.7
<  },
<  "markdownConfig": {
<    "ignore_links": false,
<    "ignore_images": false,
<    "escape_html": true,
<    "skip_internal_links": false,
<    "include_sup_sub": false,
<    "citations": false,
<    "body_width": 80,
<    "fit_markdown": false
<  },
<  "contentFilterConfig": {
<    "type": "pruning",
<    "user_query": "",
<    "threshold": 0.45,
<    "min_word_threshold": 5,
<    "bm25_threshold": 1.2,
<    "apply_llm_filter": false,
<    "semantic_filter": "",
<    "word_count_threshold": 10,
<    "sim_threshold": 0.3,
<    "max_dist": 0.2,
<    "top_k": 3,
<    "linkage_method": "ward"
<  },
<  "userAgentConfig": {
<    "user_agent_mode": "random",
<    "device_type": "desktop",
<    "browser_type": "chrome",
<    "num_browsers": 1
<  },
<  "llmConfig": {
<    "provider": "groq/deepseek-r1-distill-llama-70b",
<    "api_token": "",
<    "instruction": "Summarize content in clean markdown.",
<    "base_url": "",
<    "chunk_token_threshold": 2048,
<    "apply_chunking": true,
<    "input_format": "markdown",
<    "temperature": 0.7,
<    "max_tokens": 4096
<  },
<  "extractionSchema": {
<    "name": "Custom Extraction",
<    "baseSelector": "div.article",
<    "fields": [
<      {
<        "name": "title",
<        "selector": "h1",
<        "type": "text"
<      },
<      {
<        "name": "link",
<        "selector": "a",
<        "type": "attribute",
<        "attribute": "href"
<      }
<    ]
<  }
<}
<EOF

# Run the Actor using an HTTP API
# See the full API reference at https://docs.apify.com/api/v2
$curl "https://api.apify.com/v2/acts/raizen~ai-web-scraper/runs?token=$API_TOKEN" \
<  -X POST \
<  -d @input.json \
<  -H 'Content-Type: application/json'

AI Web Scraper - Crawl4AI for LLMs, AI Agents & Automation API

Below, you can find a list of relevant HTTP API endpoints for calling the AI Web Scraper - Powered by Crawl4AI Actor. For this, you’ll need an Apify account. Replace <YOUR_API_TOKEN> in the URLs with your Apify API token, which you can find under Integrations in Apify Console. For details, see the API reference.

Run Actor

POST

https://api.apify.com/v2/acts/raizen~ai-web-scraper/runs?token=<YOUR_API_TOKEN>

Note: By adding the method=POST query parameter, this API endpoint can be called using a GET request and thus used in third-party webhooks. Please refer to our Run Actor API documentation.

Run Actor synchronously and get dataset items

POST

https://api.apify.com/v2/acts/raizen~ai-web-scraper/run-sync-get-dataset-items?token=<YOUR_API_TOKEN>

Note: This endpoint supports both POST and GET request methods. However, only the POST method allows you to pass input data. For more information, please refer to our Run Actor synchronously and get dataset items API documentation.

Get Actor

GET

https://api.apify.com/v2/acts/raizen~ai-web-scraper?token=<YOUR_API_TOKEN>

For more information, please refer to our Get Actor API documentation.

Actors can be used to scrape web pages, extract data, or automate browser tasks. Use the AI Web Scraper - Powered by Crawl4AI API programmatically via the Apify API.

You can choose from:

AI Web Scraper - Powered by Crawl4AI API in Python

AI Web Scraper - Powered by Crawl4AI API in JavaScript

AI Web Scraper - Powered by Crawl4AI API through CLI

AI Web Scraper - Powered by Crawl4AI OpenAPI definition

You can start AI Web Scraper - Powered by Crawl4AI with the Apify API by sending an HTTP POST request to the Run Actorendpoint. An Actor’s input and its content type can be passed as a payload of the POST request, and additional options can be specified using URL query parameters. The AI Web Scraper - Powered by Crawl4AI is identified within the API by its ID, which is the creator’s username and the name of the Actor.

When the AI Web Scraper - Powered by Crawl4AI run finishes you can list the data from its default dataset(storage) via the API or you can preview the data directly on Apify Console.

Universal AI GPT Scraper

louisdeconinck/ai-gpt-scraper

Transform any website into structured data with AI-powered extraction. This versatile tool combines advanced web scraping with intelligent content analysis to deliver clean, customized JSON output - perfect for automating data collection from any web source.

Louis Deconinck

105

5.0

Crawl4AI

janbuchar/crawl4ai

Wraps the Crawl4AI open-source library for retrieving text content from websites.

Jan Buchar

563

5.0

RAG Web Browser

apify/rag-web-browser

Web browser for OpenAI Assistants, RAG pipelines, or AI agents, similar to a web browser in ChatGPT. It queries Google Search, scrapes the top N pages, and returns their content as Markdown for further processing by an LLM. It can also scrape individual URLs. Supports Model Context Protocol (MCP).

Apify

5.4K

4.4

Smartcontext AI Web Crawler

bluelightco/smartcontext-ai-crawler

Scrape any website and extract structured data using AI-powered instructions. Provide URLs and a natural language prompt to get tailored JSON outputs.

Bluelight

5.0

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

🚀 Transform web content into clean, LLM-ready Markdown! 📘 Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! 🌐📝🧠

EasyApi

106

5.0

Smart Scrape AI

llayaa112/smart-scrape-ai

Smart Scrape AI is an autonomous web automation and scraping actor powered by Playwright and AI. It dynamically interprets prompts, navigates websites, performs tasks, extracts data, and provides intelligent answers. Ideal for zero-code, prompt-driven data extraction and interaction workflows.

laya albshlawy

AI-Powered Web Content & Link Extractor

scrapercoder/ai-powered-web-content-link-extractor

Crawls websites to extract clean, structured content for AI/LLM use, ideal for training datasets, knowledge bases, and RAG systems. Json output includes: * text: Normalized page content * links: Extracted sub-URLs

wallnut.ai

114

AI Website Content Markdown Scraper

quaking_pail/ai-website-content-markdown-scraper

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.

AI_Builder

674

4.6

🔥 FireScrape AI Website Content Markdown Scraper

mohamedgb00714/fireScraper-AI-Website-Content-Markdown-Scraper

Advanced web scraper powered by Crawlee and Puppeteer — extracts website content, converts it to Markdown, and structures it for LLM training datasets.

mohamed el hadi msaid

122

3.8

Ai Web Scraper - Extract Data With Ease

eloquent_mountain/ai-web-scraper-extract-data-with-ease

Ai Web Scraper enables scraping for everyone, including non-techies! It uses Google's Gemini LLM to scrape websites with natural language commands. It dynamically extracts data, no selector input needed, handles dynamic content and cookie consent, avoids bot detection, outputs JSON or other formats.