AI Smart Scraper — Extract Data from Any Website
Pricing
from $0.00005 / actor start
AI Smart Scraper — Extract Data from Any Website
AI web scraper: describe the data you want in plain English, get clean JSON from any webpage. No CSS selectors needed. For lead gen, price monitoring, RAG, and AI agents. Powered by Gemini AI.
Pricing
from $0.00005 / actor start
Rating
0.0
(0)
Developer

亲晖 林
Actor stats
0
Bookmarked
1
Total users
1
Monthly active users
4 days ago
Last modified
Categories
Share
AI Smart Scraper — Extract Structured Data from Any Website
Extract structured JSON data from any webpage using plain English prompts. No CSS selectors, no XPath, no coding required. Just describe the data you want, and AI does the rest.
✨ Key Features
- Natural language extraction — Describe what you want: "Get all product names, prices, and ratings"
- Any website — Works on news sites, e-commerce, directories, job boards, real estate listings, and more
- Structured JSON output — Clean, machine-readable data ready for your pipeline
- Zero configuration — No CSS selectors or page structure knowledge needed
- Custom schemas — Optionally define exact output structure with JSON Schema
- Batch processing — Process multiple URLs in a single run
- Built-in AI — Powered by Google Gemini 2.5 Flash. No API keys needed
🎯 Use Cases
| Use Case | Example Prompt |
|---|---|
| Lead generation | "Extract company names, emails, phone numbers, and addresses" |
| Price monitoring | "Get all product names, current prices, and discount percentages" |
| Job scraping | "Extract job titles, companies, locations, salaries, and posting dates" |
| News aggregation | "Get article titles, authors, publish dates, and summaries" |
| Real estate | "Extract property addresses, prices, bedrooms, bathrooms, and square footage" |
| Restaurant data | "Get restaurant names, ratings, review counts, cuisine types, and price ranges" |
| Academic research | "Extract paper titles, authors, publication years, and citation counts" |
| Social media | "Get post text, like counts, comment counts, and timestamps" |
📥 Input
| Parameter | Type | Required | Description |
|---|---|---|---|
url | String | Yes* | Target webpage URL |
urls | Array | Yes* | List of URLs for batch processing |
prompt | String | Yes | Natural language description of data to extract |
schema | Object | No | Optional JSON Schema for output validation |
maxPages | Integer | No | Maximum pages to process (default: 1, max: 100) |
openaiApiKey | String | No | Optional: Use your own OpenAI key instead of built-in AI |
*Provide either url or urls (or both).
📤 Output
Each result in the dataset contains:
{"url": "https://example.com/products","data": [{"name": "Wireless Headphones","price": 79.99,"rating": 4.5,"reviews": 2847}],"metadata": {"tokensUsed": 1250,"model": "google/gemini-2.5-flash","extractedAt": "2026-02-24T15:37:46.831Z","contentLength": 15420,"status": "success"}}
💡 Examples
Example 1: Extract top articles from Hacker News
Input:
{"url": "https://news.ycombinator.com","prompt": "Extract the top 5 articles with their title, score, and comment count"}
Output:
{"data": [{ "title": "Show HN: I built a new tool", "score": 285, "comment_count": 63 },{ "title": "Why AI agents need better tools", "score": 141, "comment_count": 45 }]}
Example 2: Scrape product listings with custom schema
Input:
{"url": "https://example-shop.com/laptops","prompt": "Extract all laptop listings with name, price, specs, and availability","schema": {"type": "array","items": {"type": "object","properties": {"name": { "type": "string" },"price": { "type": "number" },"cpu": { "type": "string" },"ram_gb": { "type": "integer" },"in_stock": { "type": "boolean" }}}}}
Example 3: Batch URL processing
Input:
{"urls": ["https://company-a.com/about","https://company-b.com/about","https://company-c.com/about"],"prompt": "Extract the company name, founding year, number of employees, and headquarters location"}
💰 Pricing
This Actor uses Pay Per Event pricing:
| Event | Price |
|---|---|
| Page extracted | $0.01 per page |
| Actor start | $0.00005 per start |
Cost example: Extracting data from 100 product pages = $1.00 + platform usage (~$0.40) = ~$1.40 total
No monthly fees. No subscriptions. Pay only for what you use.
🔌 Integrations
This Actor works with:
- Apify API — Call via REST API from any language
- Apify MCP Server — Use directly from AI agents (Claude, ChatGPT, etc.)
- Zapier / Make — Automate workflows with no-code tools
- Python / JavaScript SDK — Native Apify client libraries
🤔 FAQ
Q: Do I need an API key? A: No! The Actor uses a built-in AI model (Google Gemini). Optionally, you can provide your own OpenAI API key for GPT-4o-mini.
Q: What websites does it work on? A: Any publicly accessible webpage. It uses Cheerio for fast HTML parsing, so JavaScript-heavy SPAs may need additional configuration.
Q: How accurate is the extraction? A: Powered by Gemini 2.5 Flash, extraction accuracy is typically 90-95% for well-structured pages. Complex or unusual layouts may require more specific prompts.
Q: Can I use this for large-scale scraping?
A: Yes! Use the urls parameter for batch processing and maxPages to control scope. For very large jobs, consider running multiple Actor instances.
📋 Changelog
- v0.1 — Initial release with Gemini 2.5 Flash, Cheerio crawler, PPE pricing