Pricing

from $20.00 / 1,000 page extractions

Try for free

Go to Apify Store

AI Web Scraper

Try for free

AI-first web scraper that extracts structured data from any website using natural-language prompts. No programming knowledge required. No hard-coded logic that breaks when a website changes.

Pricing

from $20.00 / 1,000 page extractions

Rating

4.3

(12)

Developer

Apify

Actor stats

151

Bookmarked

7.9K

Total users

283

Monthly active users

2.5 days

Issues response

5 days ago

Last modified

What is AI Web Scraper?

This Actor combines web scraping with large language model (LLM) technologies.

The Actor has 3 operating modes:

Single page: visits all the URLs you add to the Start URLs list and uses the Page extraction prompt to extract the data you need from each page.
Scout: extracts sitemap for each domain provided in Start URLs list and returns markdown for relevant pages according to the Page extraction prompt.
Agentic: extracts sitemap for each domain provided in Start URLs list and returns structured data from relevant pages based on the Page extraction prompt.

This scraper "sees" a website like a human does, so you can describe what you want in plain language. Using LLMs also makes the scraper resilient to website changes. While traditional scrapers rely on hard-coded logic, the AI Web Scraper adapts automatically.

While you focus on the prompt, the Actor handles the technical heavy lifting:

Browser emulation: Full support for dynamic, JavaScript-heavy websites.
Smart anti-blocking: Integrated proxy pools and browser fingerprinting to access any website.
LLM integration: No external LLM subscription required. AI tokens are included in the Actor cost.

Note: If you don't provide a page extraction prompt, the Actor returns the content of each page as Markdown.

How to use this Actor

Click Try for free in the top-right corner.
Set up the input (see below).
Click Save & Start.
Wait a few seconds and your data will be ready in the Output tab.

Input

Field	Type	Required	Default	Description
`startUrls`	`array`	Yes	-	URLs to start from.
`prompt`	`string`	No	`""`	Extraction instruction in natural language. This prompt runs on every page.
`extractionMode`	`string`	No	`"single"`	Extraction mode: Single page, Scout or Agentic.
`maxPagesToVisit`	`integer`	No	`100`	Limit the number of visited pages for the Scout and Agentic mode.
`maxCrawlDepth`	`integer`	No	`5`	Limit the navigation for the Agentic mode.

Extraction modes

single (default) - reads only the Start URLs and never follows links (maxCrawlDepth is ignored). Each Start URL is rendered once and every matching item on it is extracted. Use it when you already have the exact pages and just want to extract results from each of them.
scout - a page finder rather than a data extractor. The output is the list of pages that contain what you asked for - one result per relevant page (the link plus the page's rendered Markdown), no extracted fields. Best when the pages you want aren't reachable through on-page navigation but the site publishes a sitemap. Ignores maxCrawlDepth. Fails if the sitemap is not found.
agentic - finds the right pages through the sitemap and Start URLs, then crawls and extracts structured data from them. From each relevant page the crawler opens individual item pages, follows links to more items, and extracts one richer record per item page - fields that only live on the item's own page (for example, full product description, specs, article body), grounded in what the page actually shows. Use it when the pages you want are buried deep in the site but listed in its sitemap, e.g. "find the pricing pages and extract the plans from each".

How to write a good prompt

A well-written prompt is key to getting good results with this Actor. The examples below are based on Apify Store.

Be specific about what data you want:

✅ Good: Extract all Apify Actors from this page. For each Actor, save its name and description.
❌ Bad: Extract all Actor information.

Avoid using colors to describe elements:

✅ Good: Get the link in the "Go to Console" button.
❌ Bad: Get the link in the black button.

For the Single page mode specify details of the data you'd like to get on this page, don't provide instuctions for other pages

✅ Good: Extract every product in the table on this page. For each product save: name, price in USD, rating, and number of reviews. Set fields that the page doesn't show to null.
❌ Bad: Follow links to each product and extract full specs.

For the Scout mode specify details of your search intent

✅ Good: Find recipe pages for vegetarian dinners. A page is relevant if it's a single recipe with ingredients and steps - not a category or listing page or an article that only mentions recipes.
❌ Bad: Get the ingredients, cooking time, and calories for each recipe.

For the Agentic mode split the prompt between target pages and target data

✅ Good: Find Google Map Actor pages. For each return the monthly users count.
❌ Bad: Get the monthly users count for Google Map Actors.

For the Agentic mode it's recommended to be as specific as possible about the navigation and what you'd like to extract

✅ Good: Open each individual product page in this store and extract: product name, current price in USD, brand, availability, and the product URL. Navigation: follow product links, pagination ("next", "load more"), and the main category pages — categories and listings are how you REACH the products. Do NOT follow filter/sort variants, tag, login, cart, blog, or FAQ pages. Store a record ONLY from a real product detail page (one full product). Category and listing pages produce no record — keep crawling them for links.
❌ Bad: Extract product data.

Optionally steer the crawl from the prompt. You can add navigation hints and the model will follow them, e.g. "Only follow product pages and pagination; ignore tag, author, and login pages."

Cost consideration

This Actor charges you per extracted page regardless of the extraction mode used.

When using the single page mode, the number of extracted pages will match the number of URLs in Start URLs.

When using Scout or Agentic mode, you might get multiple results per input domains. You can limit the extraction by using the maxPagesToVisit parameter. Note that this limit applies per Start URL: so if you provide 2 URLs with maxPagesToVisit set to 5 you might be charged up to 10 results.

Schedule recurring scrapes

To schedule regular data extraction, use the Apify built-in scheduler.

Using low-code tools like n8n

You can embed this Actor in your automation workflow using low-code tools like n8n. The Apify platform integrates with Zapier, Make, n8n, Google Sheets, Google Drive, and many others.

You can also use webhooks to trigger actions automatically when a run finishes.

Why use the AI Web Scraper?

Get structured data without custom development

You don't need to know what a CSS selector is. The AI handles that for you. Just provide a prompt in plain language.

Use one prompt for multiple websites

A traditional scraper requires custom code for every page. With AI Web Scraper, you can reuse the same prompt across multiple websites.

For example, to find the author of blog posts across different sites:

"startUrls": [
  { "url": "https://blog.apify.com/web-scraping-report-2026/" },
  { "url": "https://crawlee.dev/blog/crawlee-for-python-v1" }
],
"prompt": "Return the blog post name, author name, and publication date."

Expected output:

[
    {
        "url": "https://blog.apify.com/web-scraping-report-2026/",
        "markdown": "# State of web scraping report 2026\n\nBy Theo Vasilis · Jan 29, 2026\n\n…",
        "data": {
            "blog_post_name": "State of web scraping report 2026",
            "author_name": "Theo Vasilis",
            "publication_date": "Jan 29, 2026"
        }
    },
    {
        "url": "https://crawlee.dev/blog/crawlee-for-python-v1",
        "markdown": "# Crawlee for Python v1\n\nBy Vlada Dusek · September 15, 2025\n\n…",
        "data": {
            "blog_post_name": "Crawlee for Python v1",
            "author_name": "Vlada Dusek",
            "publication_date": "September 15, 2025"
        }
    }
]

Identify relevant page to feed your AI

The Scout extraction mode is perfect to feed your AI system with clean markdown data.

As the AI Web Scraper will return only relevant pages you won't be flooding your AI with irrelevant context and will save space in the context window.

Typical use cases

AI Web Scraper works best on websites with varied page structures, where building a traditional scraper would be too expensive:

Blogs
E-commerce websites
Real estate listings
Job boards

It's also a great fit for monitoring websites that update frequently. For example, if you want to track a competitor's pricing page that gets redesigned every few weeks.

AI Web Scraper and an MCP server

With the Apify API, you can use almost any Actor with a Model Context Protocol (MCP) server. You can connect using clients like Claude Desktop and LibreChat, or build your own. Read more about how to set up Apify Actors with MCP.

FAQ

Why choose AI Web Scraper over a traditional scraper?

Here's a quick comparison with Cheerio Scraper and Playwright Scraper:

	AI Web Scraper	Cheerio Scraper	Playwright Scraper
Requires programming skills	No	Yes	Yes
Adapts to website changes	Yes	No	No
Reads JavaScript and dynamic content	Yes	No	Yes
Proxy pool and anti-blocking	Yes	Yes	Yes
Cost per run	$$$	$	$$

Can I control the crawling behavior?

Yes. The Actor follows links to crawl a site, and you control it from the input: maxCrawlDepth - how deep to follow links, maxPagesToVisit - how many links to explore per each Start URL, and extractionMode - how pages are found and read — see Extraction modes above. Concurrency is managed automatically — the crawler scales parallelism as far as resources allow. You can also steer where it goes from the prompt itself, e.g. "only follow product pages and pagination."

Do I need a ChatGPT subscription?

No. AI tokens are included in the Actor cost. No external setup needed.

Can I use proxies?

This Actor uses Apify Proxy automatically.

How do I access and export the scraped data?

Scraped results are stored in a dataset. You can export it in JSON, XML, CSV, or Excel format.

Download results via the Apify API or Apify Console. You can also push data to tools like Make, n8n, or Zapier using the available integrations.

Which scraping tool is best for beginners?

If you don't have programming skills, an AI scraper is the best starting point. AI Web Scraper lets you extract structured data from any website using a plain-language prompt.

For a more technical introduction to web scraping, check out Apify Academy.

URL to markdown

apify/url-to-markdown

An Apify Actor that takes a URL as input and returns the content of the page in Markdown format.

Apify

Weather MCP Server

jiri.spilka/weather-mcp-server

A Model Context Protocol (MCP) server that provides weather information using the Open-Meteo API

Jiří Spilka

710

5.0

Contact Details Scraper Standby

compass/contact-details-scraper-standby

Simple version Contact Details Scraper that allows using Standby mode to get contact data in a few seconds

Compass

TrustMRR Startup scraper

advantageous_subcontra/trustmrr

Get all startups listed in any category on TrustMRR startup database. Get all information about each startup, like revenue, founding year, and location.

Fabian Maume

Coursera Scraper

scraped/coursera-scraper

Scrape Coursera for courses by keyword

scraped

OpenRouter

apify/openrouter

You can use any AI LLM model without accounts in AI providers. Use this Actor as a proxy for all requests. Use pay-per-event pricing to pay only for the real credit used.

Apify

6.1K

4.7

Phantombuster result downloader

advantageous_subcontra/phantombuster-result-downloader

Download the result of a phantom and store it in an Apify dataset. This is useful to combine Phantombuster and Apify. Run an Apify actor based on a Phantom result.

Fabian Maume

Google AI Overview API

johnvc/Google-AI-Overview-API

Fetch Google AI Overviews for any query - get the AI-generated answer and its cited sources as structured JSON. Send one or many queries, target a country and language, and handle Google's deferred (page-token) generation automatically. Pay per retrieval. MCP-ready for Claude and AI agents.

John

5.0

Cheerio Scraper

apify/cheerio-scraper

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.

Apify

18K

4.6

Coursera Scraper

antonionduarte/coursera-scraper

Extract Coursera course data for market research, competitor analysis, and educational insights. This scraper fetches titles, ratings, partners, and skills, empowering you to analyze the online education landscape.

António

199

4.5

AI Web Scraper

What is AI Web Scraper?

How to use this Actor

Input

Extraction modes

How to write a good prompt

Cost consideration

Schedule recurring scrapes

Using low-code tools like n8n

Why use the AI Web Scraper?

Get structured data without custom development

Use one prompt for multiple websites

Identify relevant page to feed your AI

Typical use cases

AI Web Scraper and an MCP server

FAQ

Why choose AI Web Scraper over a traditional scraper?

Can I control the crawling behavior?

Do I need a ChatGPT subscription?

Can I use proxies?

How do I access and export the scraped data?

Which scraping tool is best for beginners?

You might also like

URL to markdown

Weather MCP Server

Contact Details Scraper Standby

TrustMRR Startup scraper

Coursera Scraper

OpenRouter

Phantombuster result downloader

Google AI Overview API

Cheerio Scraper

Coursera Scraper