AI Smart Scraper — Extract Data from Any Website avatar

AI Smart Scraper — Extract Data from Any Website

Under maintenance

Pricing

from $0.00005 / actor start

Go to Apify Store
AI Smart Scraper — Extract Data from Any Website

AI Smart Scraper — Extract Data from Any Website

Under maintenance

AI web scraper: describe the data you want in plain English, get clean JSON from any webpage. No CSS selectors needed. For lead gen, price monitoring, RAG, and AI agents. Powered by Gemini AI.

Pricing

from $0.00005 / actor start

Rating

0.0

(0)

Developer

亲晖 林

亲晖 林

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

1

Monthly active users

4 days ago

Last modified

Share

AI Smart Scraper — Extract Structured Data from Any Website

Extract structured JSON data from any webpage using plain English prompts. No CSS selectors, no XPath, no coding required. Just describe the data you want, and AI does the rest.

✨ Key Features

  • Natural language extraction — Describe what you want: "Get all product names, prices, and ratings"
  • Any website — Works on news sites, e-commerce, directories, job boards, real estate listings, and more
  • Structured JSON output — Clean, machine-readable data ready for your pipeline
  • Zero configuration — No CSS selectors or page structure knowledge needed
  • Custom schemas — Optionally define exact output structure with JSON Schema
  • Batch processing — Process multiple URLs in a single run
  • Built-in AI — Powered by Google Gemini 2.5 Flash. No API keys needed

🎯 Use Cases

Use CaseExample Prompt
Lead generation"Extract company names, emails, phone numbers, and addresses"
Price monitoring"Get all product names, current prices, and discount percentages"
Job scraping"Extract job titles, companies, locations, salaries, and posting dates"
News aggregation"Get article titles, authors, publish dates, and summaries"
Real estate"Extract property addresses, prices, bedrooms, bathrooms, and square footage"
Restaurant data"Get restaurant names, ratings, review counts, cuisine types, and price ranges"
Academic research"Extract paper titles, authors, publication years, and citation counts"
Social media"Get post text, like counts, comment counts, and timestamps"

📥 Input

ParameterTypeRequiredDescription
urlStringYes*Target webpage URL
urlsArrayYes*List of URLs for batch processing
promptStringYesNatural language description of data to extract
schemaObjectNoOptional JSON Schema for output validation
maxPagesIntegerNoMaximum pages to process (default: 1, max: 100)
openaiApiKeyStringNoOptional: Use your own OpenAI key instead of built-in AI

*Provide either url or urls (or both).

📤 Output

Each result in the dataset contains:

{
"url": "https://example.com/products",
"data": [
{
"name": "Wireless Headphones",
"price": 79.99,
"rating": 4.5,
"reviews": 2847
}
],
"metadata": {
"tokensUsed": 1250,
"model": "google/gemini-2.5-flash",
"extractedAt": "2026-02-24T15:37:46.831Z",
"contentLength": 15420,
"status": "success"
}
}

💡 Examples

Example 1: Extract top articles from Hacker News

Input:

{
"url": "https://news.ycombinator.com",
"prompt": "Extract the top 5 articles with their title, score, and comment count"
}

Output:

{
"data": [
{ "title": "Show HN: I built a new tool", "score": 285, "comment_count": 63 },
{ "title": "Why AI agents need better tools", "score": 141, "comment_count": 45 }
]
}

Example 2: Scrape product listings with custom schema

Input:

{
"url": "https://example-shop.com/laptops",
"prompt": "Extract all laptop listings with name, price, specs, and availability",
"schema": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"price": { "type": "number" },
"cpu": { "type": "string" },
"ram_gb": { "type": "integer" },
"in_stock": { "type": "boolean" }
}
}
}
}

Example 3: Batch URL processing

Input:

{
"urls": [
"https://company-a.com/about",
"https://company-b.com/about",
"https://company-c.com/about"
],
"prompt": "Extract the company name, founding year, number of employees, and headquarters location"
}

💰 Pricing

This Actor uses Pay Per Event pricing:

EventPrice
Page extracted$0.01 per page
Actor start$0.00005 per start

Cost example: Extracting data from 100 product pages = $1.00 + platform usage (~$0.40) = ~$1.40 total

No monthly fees. No subscriptions. Pay only for what you use.

🔌 Integrations

This Actor works with:

  • Apify API — Call via REST API from any language
  • Apify MCP Server — Use directly from AI agents (Claude, ChatGPT, etc.)
  • Zapier / Make — Automate workflows with no-code tools
  • Python / JavaScript SDK — Native Apify client libraries

🤔 FAQ

Q: Do I need an API key? A: No! The Actor uses a built-in AI model (Google Gemini). Optionally, you can provide your own OpenAI API key for GPT-4o-mini.

Q: What websites does it work on? A: Any publicly accessible webpage. It uses Cheerio for fast HTML parsing, so JavaScript-heavy SPAs may need additional configuration.

Q: How accurate is the extraction? A: Powered by Gemini 2.5 Flash, extraction accuracy is typically 90-95% for well-structured pages. Complex or unusual layouts may require more specific prompts.

Q: Can I use this for large-scale scraping? A: Yes! Use the urls parameter for batch processing and maxPages to control scope. For very large jobs, consider running multiple Actor instances.

📋 Changelog

  • v0.1 — Initial release with Gemini 2.5 Flash, Cheerio crawler, PPE pricing