Pricing

from $3.00 / 1,000 results

Try for free

Go to Apify Store

Scrape GPT - Universal AI Web Scraper Agent

Try for free

AI-powered universal web scraper that works on ANY website without configuration. Extract data from e-commerce, news sites, social media, and more using intelligent LLM-based field mapping. Features JSON-first extraction, automatic pagination, anti-bot bypass, and cost-effective caching.

Pricing

from $3.00 / 1,000 results

Rating

5.0

(1)

Developer

Paradox Analytics

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

🚀 Universal LLM Scraper

AI-powered web scraping that works on ANY website without configuration!

🎯 What Makes This Special?

This scraper delivers universal web scraping powered by Large Language Models:

✅ Universal - Works on ANY website without configuration
✅ Intelligent - LLM-powered semantic field mapping and extraction
✅ JSON-First - Automatically detects and extracts from embedded JSON data
✅ Auto-Pagination - Automatically handles multi-page content
✅ Anti-Bot Bypass - Web Unblocker support for Kasada, Cloudflare, and more
✅ Cost-Effective - Caching reduces LLM costs by up to 99% on repeated requests
✅ Production Ready - Tested on diverse website types (e-commerce, news, social media)

🎬 How It Works

Intelligent Extraction Flow

1. Fetch HTML from target website (with anti-detection browser)
2. Detect embedded JSON data (faster, more reliable)
3. Analyze structure using LLM (cached per domain+fields)
4. Extract data with semantic field mapping
5. Fallback to Direct LLM extraction if needed
6. Cache results for future requests

Key Features

JSON-First Extraction: Automatically detects and extracts from embedded JSON (faster, more reliable)
Direct LLM Extraction: Fallback method that extracts directly from HTML using LLM
Pre-warming Cache: Domain+fields cache skips LLM analysis for pages 2-N in pagination
Semantic Field Mapping: Understands field synonyms (e.g., "title" = "name", "productName")
Auto-Pagination: Automatically detects and scrapes multiple pages
Web Unblocker: Automatic fallback to Bright Data Web Unblocker for anti-bot challenges

🚀 Quick Start

Simple Example

{
  "startUrls": [
    {"url": "https://news.ycombinator.com"}
  ],
  "fields": ["title", "url", "score"],
  "openaiApiKey": "sk-..."
}

That's it! The system will:

Fetch the page with anti-detection browser
Detect JSON data or analyze HTML structure
Extract the requested fields using semantic mapping
Cache extraction patterns for future use

E-commerce Example

{
  "startUrls": [
    {"url": "https://www.chewy.com/b/wet-food-389"}
  ],
  "fields": ["title", "price", "rating", "review count", "product url"],
  "openaiApiKey": "sk-...",
  "enableAutoPagination": true,
  "maxPages": 10
}

With Web Unblocker (for protected sites)

{
  "startUrls": [
    {"url": "https://protected-site.com/products"}
  ],
  "fields": ["title", "price", "description"],
  "openaiApiKey": "sk-...",
  "webUnblockerApiKey": "your-bright-data-api-key",
  "webUnblockerZone": "web_unlocker1"
}

⚙️ Configuration Options

Required Fields

Field	Type	Description
`startUrls`	Array	URLs to scrape
`fields`	Array	Fields to extract (e.g., ["title", "price"])
`openaiApiKey`	String	Your OpenAI API key (can also be set as environment variable)

Optional: Direct LLM Extraction

{
  "useDirectLLM": true,
  "directLLMQualityMode": "balanced"
}

Option	Values	Description
`useDirectLLM`	true/false	Enable Direct LLM extraction (default: true)
`directLLMQualityMode`	conservative/balanced/aggressive	Quality vs quantity tradeoff (default: balanced)

Optional: Auto-Pagination

{
  "enableAutoPagination": true,
  "maxPages": 10
}

Option	Description
`enableAutoPagination`	Automatically scrape multiple pages when pagination detected
`maxPages`	Maximum pages to scrape (0 = all pages)

Optional: Web Unblocker (Anti-Bot Bypass)

{
  "webUnblockerApiKey": "your-api-key",
  "webUnblockerZone": "web_unlocker1"
}

Automatically falls back to Web Unblocker when standard proxies are blocked by:

Kasada
Cloudflare
PerimeterX
Other advanced anti-bot systems

Optional: Proxy Configuration

{
  "proxyConfiguration": {
    "useApifyProxy": true,
    "apifyProxyGroups": ["RESIDENTIAL"]
  }
}

Or use external proxy:

{
  "useExternalProxy": true,
  "externalProxyServer": "http://proxy.example.com:8080",
  "externalProxyUsername": "username",
  "externalProxyPassword": "password"
}

💰 Pricing

Price: $150 per 1,000 requests (or $0.15 per request)

This pricing covers infrastructure costs and provides a competitive rate compared to alternatives like ScrapeGraphAI ($30+ per 1,000 pages). The actor uses LLM-powered extraction with advanced features including Web Unblocker support, automatic pagination, and intelligent caching.

$3.00 per 1,000 requests

Each URL in startUrls counts as 1 request
Pagination pages count as additional requests
No hidden fees - pay only for what you use

Cost Breakdown

Actor Usage: $3.00 per 1,000 requests
OpenAI API: You pay OpenAI directly (typically $0.01-0.05 per page)
Web Unblocker: You pay Bright Data directly (if used)

Cost Optimization Tips

Enable Caching: Repeated requests to same domain+fields reuse cached patterns
Use JSON-First: Automatically detects embedded JSON (faster, cheaper)
Pre-warming: First page analyzes structure, subsequent pages reuse it
Batch Similar Sites: Group similar sites together to maximize cache hits

🎯 Use Cases

Perfect For:

E-commerce Scraping - Product listings, prices, reviews, ratings
News Aggregation - Articles, headlines, authors, dates
Social Media - Posts, comments, likes, shares
Job Boards - Listings, companies, locations, salaries
Real Estate - Properties, prices, locations, features
Market Research - Competitive intelligence, pricing analysis
Data Collection - Any structured data from websites

Why It's Better:

No Configuration - Works immediately on any site
Handles Anti-Bot - Automatic Web Unblocker fallback
Scales Efficiently - Caching reduces costs dramatically
Production Ready - Tested on diverse website types
Universal - One tool for all websites

🔬 Technical Details

Extraction Methods

JSON-First: Detects embedded JSON data (fastest, most reliable)
- Analyzes JSON structure with LLM (cached per domain+fields)
- Semantic field mapping (e.g., "title" → "name", "productName")
- Handles nested structures and arrays
Direct LLM Extraction: Fallback for HTML-only pages
- Converts HTML to Markdown
- Chunks large pages intelligently
- Extracts directly using LLM
- Caches results by structure hash
Code Generation: Traditional pattern-based extraction (legacy)

Caching Strategy

Pre-warming Cache: Domain+fields → field mappings (checked before fetch)
Structure Cache: Domain+structure+fields → analysis (checked after fetch)
Direct LLM Cache: Structure+fields → extraction results (persistent)

Anti-Detection

Camoufox Browser: Advanced anti-detection Firefox-based browser
Random Profiles: Rotating browser fingerprints
Human-like Behavior: Realistic mouse movements, scrolling
Web Unblocker: Automatic fallback for advanced anti-bot systems

📊 Performance Metrics

Based on testing across diverse website types:

Success Rate: 95%+ on standard sites, 90%+ on protected sites
Extraction Speed: 5-60 seconds per page (depends on complexity)
Cache Hit Rate: 80-99% on repeated requests
Field Coverage: 90%+ of requested fields extracted

🛠️ Advanced Features

Semantic Field Mapping

The system understands field synonyms automatically:

"title" → "name", "productName", "heading"
"price" → "cost", "amount", "value"
"rating" → "score", "stars", "review"
"url" → "link", "href", "productUrl"

Website-Specific Context

Provides LLM with context about what fields mean on specific websites:

E-commerce: "title" = full product name (not short labels)
News: "title" = article headline
Social: "title" = post title

Pagination Detection

Automatically detects and handles:

URL-based pagination (?page=2)
Infinite scroll
"Load More" buttons
JSON API pagination

📈 Output Format

Dataset Items

{
  "title": "Example Product",
  "price": "$99.99",
  "rating": 4.5,
  "review_count": 1234,
  "product_url": "https://example.com/product/123",
  "_url": "https://example.com/products",
  "_metadata": {
    "fetch_method": "browser",
    "extraction_source": "json",
    "execution_time": 12.5
  }
}

Metadata

Each item includes:

_url: Source URL where item was found
_metadata.fetch_method: How page was fetched (browser, static, api)
_metadata.extraction_source: How data was extracted (json, direct_llm, code)
_metadata.execution_time: Time taken to extract (seconds)

🆘 Troubleshooting

"No API key provided"

Add your OpenAI API key in the input or set OPENAI_API_KEY environment variable

"Failed to extract data"

Check if fields match actual page content
Try enabling Direct LLM extraction (useDirectLLM: true)
Review output metadata for extraction source

"Site is blocking requests"

Enable Apify residential proxies
Use Web Unblocker for advanced anti-bot systems
Add delays between requests

"Missing fields in output"

Check field names match page content
Try semantic variations (e.g., "name" instead of "title")
Enable Direct LLM extraction for better field detection

🎓 Best Practices

Start Small - Test on a few URLs before scaling
Use Caching - Repeated requests to same domain+fields reuse cached patterns
Enable Auto-Pagination - Set maxPages to limit pagination
Use Web Unblocker - For sites with advanced anti-bot protection
Monitor Costs - Check execution time and extraction source in metadata
Group Similar Sites - Scrape similar sites together to maximize cache hits

🔐 Security & Privacy

API Keys: Stored securely as Apify secrets
No Data Retention: We don't store scraped data
Pattern Privacy: Cached patterns stored in your Apify Key-Value Store
GDPR Compliant: No personal data collected

📚 Examples

E-commerce Product Scraping

{
  "startUrls": [
    {"url": "https://www.chewy.com/b/wet-food-389"}
  ],
  "fields": ["title", "price", "rating", "review count", "product url"],
  "openaiApiKey": "sk-...",
  "enableAutoPagination": true,
  "maxPages": 5
}

News Article Scraping

{
  "startUrls": [
    {"url": "https://techcrunch.com"}
  ],
  "fields": ["title", "author", "date", "content"],
  "openaiApiKey": "sk-..."
}

{
  "startUrls": [
    {"url": "https://www.reddit.com/r/github/"}
  ],
  "fields": ["title", "author", "score", "comments", "url"],
  "openaiApiKey": "sk-..."
}

🤝 Support

Documentation: Full guides in actor repository
Issues: Report bugs via Apify support
Updates: Follow actor for new features

📄 License

This actor is available under the MIT license.

🎉 Get Started Now!

Add your OpenAI API key
Provide URLs and fields to extract
Hit Run!

The system handles everything else automatically. Start scraping any website today! 🚀

Universal AI GPT Scraper

louisdeconinck/ai-gpt-scraper

Transform any website into structured data with AI-powered extraction. This versatile tool combines advanced web scraping with intelligent content analysis to deliver clean, customized JSON output - perfect for automating data collection from any web source.

Louis Deconinck

159

5.0

Universal AI Web Scraper

stanvanrooy6/universal-ai-web-scraper

Turn any website into an API. Extract structured data using plain English. Features anti-bot bypass, dynamic rendering, and web search. No coding needed.

Stan Van Rooy

5.0

AI Product Recommendation Agent

matymar/ai-product-recommendation-agent

The AI Product Recommendation Agent helps users find the best products based on their needs using a simple query. It analyzes product listings, reviews, and ratings to provide well-informed recommendations.

Matouš Mařík

2.2

Ui Generator

tinkerbrains/ui-generator

Get UI designs for your app ideas. Get HTML + TailwindCSS code and a full screenshot of the design in output data. You will need a Anthropic Claude API Key to add in environment variables.

Sanket Dongre

5.0

Website Contact Extractor

adinfosys-labs/website-contact-extractor

Extract emails, phone numbers, and social links from any website. Automatically scans contact pages and returns clean, export-ready contact data.

Artashes Arakelyan

5.0

Open Catchment Accessibility Analyzer

adinfosys-labs/open-catchment-accessibility-analyzer

Analyze service coverage and accessibility around any location using OpenStreetMap. Generate distance-based catchment metrics and export results as Dataset, CSV, XLSX, and summary JSON for planning and site selection.

Artashes Arakelyan

5.0

Zillow Real Estate Scraper

propertyapi/zillow-real-estate-scraper

This actor allows you to scrape detailed property information, market data, and listing details from Zillow with ease. Whether you need investment insights, market analysis, or property details, this scraper handles it all.

Property API

Ebay Email Scraper

scraper-mind/ebay-email-scraper

[𝗖𝗵𝗲𝗮𝗽𝗲𝘀𝘁 𝗣𝗿𝗶𝗰𝗲] Boost your lead generation with the Ebay Email Scraper! Extract verified seller emails from eBay in seconds and grow your business effortlessly. Perfect for marketers, dropshippers, and eCommerce pros. Try the Ebay Email Scraper today and streamline your outreach! 🚀

Scraper Mind

Dynamic Web Scraper

josejet/dynamic-web-scraper

Dynamic Web Scraper is an Apify Actor that gathers information online by simulating user browsing behavior on the web. It reduces the time and amount of scraped web pages by using a model (ChatGPT) to make decisions regarding browser navigation and results evaluation.

Pepa J

316

Salesforce AppExchange Apps & Reviews Extractors

adinfosys-labs/salesforce-appexchange-discovery-engine---apps-reviews

Discover Salesforce AppExchange apps by category, extract key details, ratings, and reviews into a structured dataset for market research, competitive analysis, and integrations intelligence. Ideal for Salesforce consultants, ISVs, analysts, and marketplace intelligence teams.

Artashes Arakelyan

5.0

Scrape GPT - Universal AI Web Scraper Agent

🚀 Universal LLM Scraper

🎯 What Makes This Special?

🎬 How It Works

Intelligent Extraction Flow

Key Features

🚀 Quick Start

Simple Example

E-commerce Example

With Web Unblocker (for protected sites)

⚙️ Configuration Options

Required Fields

Optional: Direct LLM Extraction

Optional: Auto-Pagination

Optional: Web Unblocker (Anti-Bot Bypass)

Optional: Proxy Configuration

💰 Pricing

Cost Breakdown

Cost Optimization Tips

🎯 Use Cases

Perfect For:

Why It's Better:

🔬 Technical Details

Extraction Methods

Caching Strategy

Anti-Detection

📊 Performance Metrics

🛠️ Advanced Features

Semantic Field Mapping

Website-Specific Context

Pagination Detection

📈 Output Format

Dataset Items

Metadata

🆘 Troubleshooting

"No API key provided"

"Failed to extract data"

"Site is blocking requests"

"Missing fields in output"

🎓 Best Practices

🔐 Security & Privacy

📚 Examples

E-commerce Product Scraping

News Article Scraping

Social Media Scraping

🤝 Support

📄 License

🎉 Get Started Now!

You might also like

Universal AI GPT Scraper

Universal AI Web Scraper

AI Product Recommendation Agent

Ui Generator

Website Contact Extractor

Open Catchment Accessibility Analyzer

Zillow Real Estate Scraper

Ebay Email Scraper

Dynamic Web Scraper

Salesforce AppExchange Apps & Reviews Extractors