AI Web Crawler
Under maintenancePricing
from $0.00005 / actor start
AI Web Crawler
Under maintenanceExtract structured data from any website using AI. No custom selectors needed.
Pricing
from $0.00005 / actor start
Rating
0.0
(0)
Developer
Angel Rojo
Maintained by CommunityActor stats
0
Bookmarked
1
Total users
0
Monthly active users
3 days ago
Last modified
Categories
Share
🤖 AI Web Scraper — GPT-Powered Data Extraction
Extract structured data from any website using AI. No custom selectors needed — just a URL and natural language instructions. Supports OpenAI, OpenRouter, LM Studio, Ollama, Groq, and any OpenAI-compatible API.
🎯 What It Does
AI Web Scraper uses GPT-4o-mini (or GPT-4o/GPT-4.1) to intelligently extract structured data from any webpage. Unlike traditional scrapers that require specific CSS selectors or XPath expressions, this Actor understands natural language instructions and adapts to any website structure.
✨ Key Features
- 🧠 Natural Language Extraction — Describe what you want in plain English, GPT does the rest
- 🔄 Universal Compatibility — Works on any website without custom coding per site
- 📊 Structured JSON Output — Returns clean, parseable data pushed to Apify Dataset
- 📄 Multi-Page Support — Automatic pagination handling (up to 50 pages)
- 🚀 Fast Processing — Pages processed in seconds with headless Playwright
- 🔒 Anti-Detection — Blocks images/ads, uses realistic user-agent
- ⚡ Multiple AI Models — gpt-4o-mini, gpt-4o, gpt-4.1 (or any OpenAI-compatible API)
💡 Use Cases
| Industry | What to Extract |
|---|---|
| 🛒 E-commerce | Product names, prices, ratings, descriptions, reviews count |
| 🏠 Real Estate | Listings, prices, locations, agent info, property details |
| 📧 Lead Generation | Company names, emails, phone numbers, social profiles |
| 💼 Job Boards | Job titles, salaries, companies, locations, requirements |
| 📰 Research | Articles, papers, reviews, social media content |
| 🔍 SEO | Meta tags, headings, content structure, internal links |
📥 Input Schema
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
url | string | ✅ | — | Target URL to scrape |
prompt | string | ✅ | — | What data to extract (natural language) |
apiKey | string | ❌ | env OPENAI_API_KEY | OpenAI API key (sk-...) |
model | string | ❌ | gpt-4o-mini | AI model: gpt-4o-mini, gpt-4o, gpt-4.1 |
maxPages | integer | ❌ | 1 | Max pages to process (1–50) |
waitForSelector | string | ❌ | — | CSS selector to wait for before extracting |
Example Input
{"url": "https://www.example.com/products","prompt": "Extract all product names, prices, ratings, and review counts","model": "gpt-4o-mini","maxPages": 3}
📤 Output
Each extracted item is pushed to the Apify Dataset as a separate record with these standard fields:
| Field | Type | Description |
|---|---|---|
title | string | Title or name of the extracted item |
description | string | Description or summary |
price | string | Price value if available |
url | string | Source URL of the item |
image_url | string | Image URL if available |
rating | number | Rating score (0–5 scale) |
reviews_count | integer | Number of reviews |
availability | string | Availability status |
category | string | Category or type |
source_page | string | Page where item was found |
extracted_at | datetime | ISO timestamp of extraction |
⚠️ Note: Field names are dynamic — GPT determines them based on your prompt. The schema above covers common extraction patterns for products/listings.
Example Output
[{"title": "Wireless Headphones Pro","price": "$79.99","rating": 4.5,"reviews_count": 1234,"url": "https://example.com/products/wireless-headphones-pro"},{"title": "Bluetooth Speaker","price": "$49.99","rating": 4.2,"reviews_count": 856,"url": "https://example.com/products/bluetooth-speaker"}]
🧪 How to Use
Option 1: Run via Apify Console
- Go to Apify Console
- Find "AI Web Scraper" in the Store
- Click "Try for free" or "Run Actor"
- Enter your URL and extraction prompt
- Click "Run" — results appear in the Dataset
Option 2: Run via API
curl -X POST "https://api.apify.com/v2/acts/gek0v~ai-web-scraper/runs" \-H "Authorization: Bearer YOUR_APIFY_TOKEN" \-H "Content-Type: application/json" \-d '{"url": "https://example.com/products","prompt": "Extract product names and prices","model": "gpt-4o-mini"}'
Option 3: Python SDK
from apify_client import ApifyClientclient = ApifyClient("your-apify-token")run = client.actor("gek0v/ai-web-scraper").call(run_input={"url": "https://example.com","prompt": "Extract all article titles and authors","model": "gpt-4o-mini"})for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(item)
💰 Pricing
| Component | Cost |
|---|---|
| Actor Compute (Actor Start) | ~$0.000002/run (based on memory allocation) |
| Dataset Storage | ~$0.002 per stored item |
| Platform Fee | 20% of compute + storage costs |
| OpenAI GPT API | Passed directly to user at model pricing |
💡 Typical cost per run: Most extractions cost < $0.01 (with gpt-4o-mini) plus ~$0.002 per extracted item stored.
🔧 Local Development
# Clonegit clone https://github.com/gek0v/ai-web-scraper.gitcd ai-web-scraper# Install dependenciespip install -r requirements.txt# Run locallypython src/main.py --input '{"url": "https://example.com", "prompt": "Extract all headings"}'
📝 Tips for Best Results
- Be specific in your prompt — "Extract product name, price in USD, and star rating" works better than "extract product info"
- Test with gpt-4o-mini first — It's 10x cheaper and often good enough. Upgrade to gpt-4o for complex pages
- Use
waitForSelector— For dynamic SPAs (React, Vue, Angular), wait for the content container - Limit
maxPages— Start with 1 page to test, then scale up - Provide your API key — Set
OPENAI_API_KEYenv var or pass via input
⚠️ Limitations
- Very large pages (>100K chars) are truncated to fit GPT's context window
- JavaScript-heavy SPAs may need
waitForSelectorfor rendering - Some anti-bot protections (Cloudflare, etc.) may block access
- GPT costs are passed through to the user (OpenAI/compatible API pricing applies)
- Requires an OpenAI-compatible API key (not included)
📄 License
MIT License — free to use and modify.
🏷️ Tags
web-scraping artificial-intelligence data-extraction playwright gpt automation developer-tools