Scrape GPT - Universal AI Web Scraper Agent
Pricing
from $3.00 / 1,000 results
Scrape GPT - Universal AI Web Scraper Agent
AI-powered universal web scraper that works on ANY website without configuration. Extract data from e-commerce, news sites, social media, and more using intelligent LLM-based field mapping. Features JSON-first extraction, automatic pagination, anti-bot bypass, and cost-effective caching.
Pricing
from $3.00 / 1,000 results
Rating
5.0
(1)
Developer

Paradox Analytics
Actor stats
0
Bookmarked
4
Total users
3
Monthly active users
3 hours ago
Last modified
Categories
Share
๐ Universal LLM Scraper
AI-powered web scraping that works on ANY website without configuration!
๐ฏ What Makes This Special?
This scraper delivers universal web scraping powered by Large Language Models:
- โ Universal - Works on ANY website without configuration
- โ Intelligent - LLM-powered semantic field mapping and extraction
- โ JSON-First - Automatically detects and extracts from embedded JSON data
- โ Auto-Pagination - Automatically handles multi-page content
- โ Anti-Bot Bypass - Web Unblocker support for Kasada, Cloudflare, and more
- โ Cost-Effective - Caching reduces LLM costs by up to 99% on repeated requests
- โ Production Ready - Tested on diverse website types (e-commerce, news, social media)
๐ฌ How It Works
Intelligent Extraction Flow
1. Fetch HTML from target website (with anti-detection browser)2. Detect embedded JSON data (faster, more reliable)3. Analyze structure using LLM (cached per domain+fields)4. Extract data with semantic field mapping5. Fallback to Direct LLM extraction if needed6. Cache results for future requests
Key Features
- JSON-First Extraction: Automatically detects and extracts from embedded JSON (faster, more reliable)
- Direct LLM Extraction: Fallback method that extracts directly from HTML using LLM
- Pre-warming Cache: Domain+fields cache skips LLM analysis for pages 2-N in pagination
- Semantic Field Mapping: Understands field synonyms (e.g., "title" = "name", "productName")
- Auto-Pagination: Automatically detects and scrapes multiple pages
- Web Unblocker: Automatic fallback to Bright Data Web Unblocker for anti-bot challenges
๐ Quick Start
Simple Example
{"startUrls": [{"url": "https://news.ycombinator.com"}],"fields": ["title", "url", "score"],"openaiApiKey": "sk-..."}
That's it! The system will:
- Fetch the page with anti-detection browser
- Detect JSON data or analyze HTML structure
- Extract the requested fields using semantic mapping
- Cache extraction patterns for future use
E-commerce Example
{"startUrls": [{"url": "https://www.chewy.com/b/wet-food-389"}],"fields": ["title", "price", "rating", "review count", "product url"],"openaiApiKey": "sk-...","enableAutoPagination": true,"maxPages": 10}
With Web Unblocker (for protected sites)
{"startUrls": [{"url": "https://protected-site.com/products"}],"fields": ["title", "price", "description"],"openaiApiKey": "sk-...","webUnblockerApiKey": "your-bright-data-api-key","webUnblockerZone": "web_unlocker1"}
โ๏ธ Configuration Options
Required Fields
| Field | Type | Description |
|---|---|---|
startUrls | Array | URLs to scrape |
fields | Array | Fields to extract (e.g., ["title", "price"]) |
openaiApiKey | String | Your OpenAI API key (can also be set as environment variable) |
Optional: Direct LLM Extraction
{"useDirectLLM": true,"directLLMQualityMode": "balanced"}
| Option | Values | Description |
|---|---|---|
useDirectLLM | true/false | Enable Direct LLM extraction (default: true) |
directLLMQualityMode | conservative/balanced/aggressive | Quality vs quantity tradeoff (default: balanced) |
Optional: Auto-Pagination
{"enableAutoPagination": true,"maxPages": 10}
| Option | Description |
|---|---|
enableAutoPagination | Automatically scrape multiple pages when pagination detected |
maxPages | Maximum pages to scrape (0 = all pages) |
Optional: Web Unblocker (Anti-Bot Bypass)
{"webUnblockerApiKey": "your-api-key","webUnblockerZone": "web_unlocker1"}
Automatically falls back to Web Unblocker when standard proxies are blocked by:
- Kasada
- Cloudflare
- PerimeterX
- Other advanced anti-bot systems
Optional: Proxy Configuration
{"proxyConfiguration": {"useApifyProxy": true,"apifyProxyGroups": ["RESIDENTIAL"]}}
Or use external proxy:
{"useExternalProxy": true,"externalProxyServer": "http://proxy.example.com:8080","externalProxyUsername": "username","externalProxyPassword": "password"}
๐ฐ Pricing
Price: $150 per 1,000 requests (or $0.15 per request)
This pricing covers infrastructure costs and provides a competitive rate compared to alternatives like ScrapeGraphAI ($30+ per 1,000 pages). The actor uses LLM-powered extraction with advanced features including Web Unblocker support, automatic pagination, and intelligent caching.
$3.00 per 1,000 requests
- Each URL in
startUrlscounts as 1 request - Pagination pages count as additional requests
- No hidden fees - pay only for what you use
Cost Breakdown
- Actor Usage: $3.00 per 1,000 requests
- OpenAI API: You pay OpenAI directly (typically $0.01-0.05 per page)
- Web Unblocker: You pay Bright Data directly (if used)
Cost Optimization Tips
- Enable Caching: Repeated requests to same domain+fields reuse cached patterns
- Use JSON-First: Automatically detects embedded JSON (faster, cheaper)
- Pre-warming: First page analyzes structure, subsequent pages reuse it
- Batch Similar Sites: Group similar sites together to maximize cache hits
๐ฏ Use Cases
Perfect For:
- E-commerce Scraping - Product listings, prices, reviews, ratings
- News Aggregation - Articles, headlines, authors, dates
- Social Media - Posts, comments, likes, shares
- Job Boards - Listings, companies, locations, salaries
- Real Estate - Properties, prices, locations, features
- Market Research - Competitive intelligence, pricing analysis
- Data Collection - Any structured data from websites
Why It's Better:
- No Configuration - Works immediately on any site
- Handles Anti-Bot - Automatic Web Unblocker fallback
- Scales Efficiently - Caching reduces costs dramatically
- Production Ready - Tested on diverse website types
- Universal - One tool for all websites
๐ฌ Technical Details
Extraction Methods
-
JSON-First: Detects embedded JSON data (fastest, most reliable)
- Analyzes JSON structure with LLM (cached per domain+fields)
- Semantic field mapping (e.g., "title" โ "name", "productName")
- Handles nested structures and arrays
-
Direct LLM Extraction: Fallback for HTML-only pages
- Converts HTML to Markdown
- Chunks large pages intelligently
- Extracts directly using LLM
- Caches results by structure hash
-
Code Generation: Traditional pattern-based extraction (legacy)
Caching Strategy
- Pre-warming Cache: Domain+fields โ field mappings (checked before fetch)
- Structure Cache: Domain+structure+fields โ analysis (checked after fetch)
- Direct LLM Cache: Structure+fields โ extraction results (persistent)
Anti-Detection
- Camoufox Browser: Advanced anti-detection Firefox-based browser
- Random Profiles: Rotating browser fingerprints
- Human-like Behavior: Realistic mouse movements, scrolling
- Web Unblocker: Automatic fallback for advanced anti-bot systems
๐ Performance Metrics
Based on testing across diverse website types:
- Success Rate: 95%+ on standard sites, 90%+ on protected sites
- Extraction Speed: 5-60 seconds per page (depends on complexity)
- Cache Hit Rate: 80-99% on repeated requests
- Field Coverage: 90%+ of requested fields extracted
๐ ๏ธ Advanced Features
Semantic Field Mapping
The system understands field synonyms automatically:
- "title" โ "name", "productName", "heading"
- "price" โ "cost", "amount", "value"
- "rating" โ "score", "stars", "review"
- "url" โ "link", "href", "productUrl"
Website-Specific Context
Provides LLM with context about what fields mean on specific websites:
- E-commerce: "title" = full product name (not short labels)
- News: "title" = article headline
- Social: "title" = post title
Pagination Detection
Automatically detects and handles:
- URL-based pagination (
?page=2) - Infinite scroll
- "Load More" buttons
- JSON API pagination
๐ Output Format
Dataset Items
{"title": "Example Product","price": "$99.99","rating": 4.5,"review_count": 1234,"product_url": "https://example.com/product/123","_url": "https://example.com/products","_metadata": {"fetch_method": "browser","extraction_source": "json","execution_time": 12.5}}
Metadata
Each item includes:
_url: Source URL where item was found_metadata.fetch_method: How page was fetched (browser, static, api)_metadata.extraction_source: How data was extracted (json, direct_llm, code)_metadata.execution_time: Time taken to extract (seconds)
๐ Troubleshooting
"No API key provided"
- Add your OpenAI API key in the input or set
OPENAI_API_KEYenvironment variable
"Failed to extract data"
- Check if fields match actual page content
- Try enabling Direct LLM extraction (
useDirectLLM: true) - Review output metadata for extraction source
"Site is blocking requests"
- Enable Apify residential proxies
- Use Web Unblocker for advanced anti-bot systems
- Add delays between requests
"Missing fields in output"
- Check field names match page content
- Try semantic variations (e.g., "name" instead of "title")
- Enable Direct LLM extraction for better field detection
๐ Best Practices
- Start Small - Test on a few URLs before scaling
- Use Caching - Repeated requests to same domain+fields reuse cached patterns
- Enable Auto-Pagination - Set
maxPagesto limit pagination - Use Web Unblocker - For sites with advanced anti-bot protection
- Monitor Costs - Check execution time and extraction source in metadata
- Group Similar Sites - Scrape similar sites together to maximize cache hits
๐ Security & Privacy
- API Keys: Stored securely as Apify secrets
- No Data Retention: We don't store scraped data
- Pattern Privacy: Cached patterns stored in your Apify Key-Value Store
- GDPR Compliant: No personal data collected
๐ Examples
E-commerce Product Scraping
{"startUrls": [{"url": "https://www.chewy.com/b/wet-food-389"}],"fields": ["title", "price", "rating", "review count", "product url"],"openaiApiKey": "sk-...","enableAutoPagination": true,"maxPages": 5}
News Article Scraping
{"startUrls": [{"url": "https://techcrunch.com"}],"fields": ["title", "author", "date", "content"],"openaiApiKey": "sk-..."}
Social Media Scraping
{"startUrls": [{"url": "https://www.reddit.com/r/github/"}],"fields": ["title", "author", "score", "comments", "url"],"openaiApiKey": "sk-..."}
๐ค Support
- Documentation: Full guides in actor repository
- Issues: Report bugs via Apify support
- Updates: Follow actor for new features
๐ License
This actor is available under the MIT license.
๐ Get Started Now!
- Add your OpenAI API key
- Provide URLs and fields to extract
- Hit Run!
The system handles everything else automatically. Start scraping any website today! ๐