AI Web Crawler avatar

AI Web Crawler

Under maintenance

Pricing

from $0.00005 / actor start

Go to Apify Store
AI Web Crawler

AI Web Crawler

Under maintenance

Extract structured data from any website using AI. No custom selectors needed.

Pricing

from $0.00005 / actor start

Rating

0.0

(0)

Developer

Angel Rojo

Angel Rojo

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

3 days ago

Last modified

Categories

Share

🤖 AI Web Scraper — GPT-Powered Data Extraction

Extract structured data from any website using AI. No custom selectors needed — just a URL and natural language instructions. Supports OpenAI, OpenRouter, LM Studio, Ollama, Groq, and any OpenAI-compatible API.

Apify Python GPT License


🎯 What It Does

AI Web Scraper uses GPT-4o-mini (or GPT-4o/GPT-4.1) to intelligently extract structured data from any webpage. Unlike traditional scrapers that require specific CSS selectors or XPath expressions, this Actor understands natural language instructions and adapts to any website structure.

✨ Key Features

  • 🧠 Natural Language Extraction — Describe what you want in plain English, GPT does the rest
  • 🔄 Universal Compatibility — Works on any website without custom coding per site
  • 📊 Structured JSON Output — Returns clean, parseable data pushed to Apify Dataset
  • 📄 Multi-Page Support — Automatic pagination handling (up to 50 pages)
  • 🚀 Fast Processing — Pages processed in seconds with headless Playwright
  • 🔒 Anti-Detection — Blocks images/ads, uses realistic user-agent
  • Multiple AI Models — gpt-4o-mini, gpt-4o, gpt-4.1 (or any OpenAI-compatible API)

💡 Use Cases

IndustryWhat to Extract
🛒 E-commerceProduct names, prices, ratings, descriptions, reviews count
🏠 Real EstateListings, prices, locations, agent info, property details
📧 Lead GenerationCompany names, emails, phone numbers, social profiles
💼 Job BoardsJob titles, salaries, companies, locations, requirements
📰 ResearchArticles, papers, reviews, social media content
🔍 SEOMeta tags, headings, content structure, internal links

📥 Input Schema

FieldTypeRequiredDefaultDescription
urlstringTarget URL to scrape
promptstringWhat data to extract (natural language)
apiKeystringenv OPENAI_API_KEYOpenAI API key (sk-...)
modelstringgpt-4o-miniAI model: gpt-4o-mini, gpt-4o, gpt-4.1
maxPagesinteger1Max pages to process (1–50)
waitForSelectorstringCSS selector to wait for before extracting

Example Input

{
"url": "https://www.example.com/products",
"prompt": "Extract all product names, prices, ratings, and review counts",
"model": "gpt-4o-mini",
"maxPages": 3
}

📤 Output

Each extracted item is pushed to the Apify Dataset as a separate record with these standard fields:

FieldTypeDescription
titlestringTitle or name of the extracted item
descriptionstringDescription or summary
pricestringPrice value if available
urlstringSource URL of the item
image_urlstringImage URL if available
ratingnumberRating score (0–5 scale)
reviews_countintegerNumber of reviews
availabilitystringAvailability status
categorystringCategory or type
source_pagestringPage where item was found
extracted_atdatetimeISO timestamp of extraction

⚠️ Note: Field names are dynamic — GPT determines them based on your prompt. The schema above covers common extraction patterns for products/listings.

Example Output

[
{
"title": "Wireless Headphones Pro",
"price": "$79.99",
"rating": 4.5,
"reviews_count": 1234,
"url": "https://example.com/products/wireless-headphones-pro"
},
{
"title": "Bluetooth Speaker",
"price": "$49.99",
"rating": 4.2,
"reviews_count": 856,
"url": "https://example.com/products/bluetooth-speaker"
}
]

🧪 How to Use

Option 1: Run via Apify Console

  1. Go to Apify Console
  2. Find "AI Web Scraper" in the Store
  3. Click "Try for free" or "Run Actor"
  4. Enter your URL and extraction prompt
  5. Click "Run" — results appear in the Dataset

Option 2: Run via API

curl -X POST "https://api.apify.com/v2/acts/gek0v~ai-web-scraper/runs" \
-H "Authorization: Bearer YOUR_APIFY_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/products",
"prompt": "Extract product names and prices",
"model": "gpt-4o-mini"
}'

Option 3: Python SDK

from apify_client import ApifyClient
client = ApifyClient("your-apify-token")
run = client.actor("gek0v/ai-web-scraper").call(run_input={
"url": "https://example.com",
"prompt": "Extract all article titles and authors",
"model": "gpt-4o-mini"
})
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item)

💰 Pricing

ComponentCost
Actor Compute (Actor Start)~$0.000002/run (based on memory allocation)
Dataset Storage~$0.002 per stored item
Platform Fee20% of compute + storage costs
OpenAI GPT APIPassed directly to user at model pricing

💡 Typical cost per run: Most extractions cost < $0.01 (with gpt-4o-mini) plus ~$0.002 per extracted item stored.


🔧 Local Development

# Clone
git clone https://github.com/gek0v/ai-web-scraper.git
cd ai-web-scraper
# Install dependencies
pip install -r requirements.txt
# Run locally
python src/main.py --input '{"url": "https://example.com", "prompt": "Extract all headings"}'

📝 Tips for Best Results

  1. Be specific in your prompt"Extract product name, price in USD, and star rating" works better than "extract product info"
  2. Test with gpt-4o-mini first — It's 10x cheaper and often good enough. Upgrade to gpt-4o for complex pages
  3. Use waitForSelector — For dynamic SPAs (React, Vue, Angular), wait for the content container
  4. Limit maxPages — Start with 1 page to test, then scale up
  5. Provide your API key — Set OPENAI_API_KEY env var or pass via input

⚠️ Limitations

  • Very large pages (>100K chars) are truncated to fit GPT's context window
  • JavaScript-heavy SPAs may need waitForSelector for rendering
  • Some anti-bot protections (Cloudflare, etc.) may block access
  • GPT costs are passed through to the user (OpenAI/compatible API pricing applies)
  • Requires an OpenAI-compatible API key (not included)

📄 License

MIT License — free to use and modify.


🏷️ Tags

web-scraping artificial-intelligence data-extraction playwright gpt automation developer-tools