Universal AI GPT Scraper avatar

Universal AI GPT Scraper

Try for free

This Actor is paid per event

Go to Store
Universal AI GPT Scraper

Universal AI GPT Scraper

louisdeconinck/ai-gpt-scraper
Try for free

This Actor is paid per event

Transform any website into structured data with AI-powered extraction. This versatile tool combines advanced web scraping with intelligent content analysis to deliver clean, customized JSON output - perfect for automating data collection from any web source.

Transform any website into structured data effortlessly! This powerful Apify actor revolutionizes web scraping by combining AI with precise data extraction. Simply specify what data you need, and watch as advanced AI models intelligently parse web content into clean, structured JSON - saving you countless hours of manual data collection and processing. Perfect for businesses and developers who need reliable, automated data extraction without complex coding or maintenance.

Use Cases

  • Extract product information from e-commerce sites
  • Gather pricing data from service providers
  • Collect structured data from blog posts or articles
  • Extract specific fields from documentation pages
  • Convert any web content into structured JSON data

Features

  • 🎯 Custom Field Extraction: Define exactly what fields you want to extract
  • πŸ€– AI-Powered: Uses advanced language models to understand and extract content
  • πŸ“Š Structured Output: Get clean JSON or CSV data with your specified fields
  • πŸ”„ Type Support: Specify the data type for each field (string, number, boolean, etc.)
  • πŸŽ›οΈ Model Selection: Choose from predefined AI models or use your own
  • 🎯 CSS Selector Support: Target specific page elements using CSS selectors
  • πŸ”’ Secure: Support for secret API keys and proxy configuration

Input Configuration

Required Fields

  • URLs (array): List of web pages to scrape
  • Fields (array): Specification of fields to extract, each containing:
    • name: Field name in the output
    • description: Description to guide the AI, be as specific and descriptive as possible
    • type: Data type (string, number, boolean, array, object)

Content Extraction Options

  • CSS Selector (string, optional): CSS selector to target specific elements on the page. This can greatly reduce the AI cost, by reducing the number of input tokens. It can also have a positive impact on accuracy. If provided, only text from elements matching this selector will be extracted. If not provided, the default content extraction will be used. This is an advanced option, if you are not familiar with CSS selectors, please do not provide one. Inspect the HTML of a page to find the correct CSS selector.

Example CSS selectors:

  • main: selects elements with tag "main".
  • #price: selects elements with id "price".
  • .product-details-container .price, .product-details-container .description: selects elements with class "price" and "description" that are descendants of elements with class "product-details-container".
  • article.main-story, .article-body > p: selects elements with tag "article" and class "main-story", as well as direct child "p" elements under elements with class "article-body".
  • .documentation-content h2, .documentation-content .method-signature: selects "h2" elements and elements with class "method-signature" that are descendants of elements with class "documentation-content".
  • .post-container[data-type="user-post"] .content: selects elements with class "content" that are descendants of elements with both class "post-container" and data-type attribute "user-post".
  • #product-listing div.item:not(.ad) .details h3, .price-info span.current-price: selects "h3" elements under elements with class "details" that are descendants of "div" elements with class "item" but not class "ad" under element with ID "product-listing", as well as "span" elements with class "current-price" under elements with class "price-info".

AI Model Configuration

You can either use one of our predfined models which we verified that work well. Or you could specify your own model from OpenRouter. If you use a predefined model, you don't have to bring your own API key we will cover the AI cost and you will be charged for it through Apify usage. If you bring your own OpenRouter API key you will not be charged for the AI cost. Your API key is stored securly and encrypted with Apify.

After some testing we found Google Gemini Flash 2.0 to give the best quality for the lowest price.

Free Apify users can only process 1 URL every 24 hours using predefined models to test out this functionality. If you are a free user you will have to upgrade your Apify account to a paying subscription tier to use predefined models or bring your own OpenRouter API key.

Option 1: Predefined Models

  • Predefined Model (string): Choose from supported models:
    • Google Gemini Flash 1.5
    • Google Gemini Flash 2.0 (recommended)
    • OpenAI GPT-4o-mini
    • Google Gemini Pro 1.5
    • OpenAI GPT-4o

Option 2: Custom Model

  • Use Custom Model (boolean): Toggle to use your own model
  • Custom Model Name (string): OpenRouter model identifier e.g. google/gemini-2.0-flash-001
  • OpenRouter API Key (string): Your API key for custom model access (is stored encrypted)

Make sure your model supports structured outputs. Check model compatibility at: https://openrouter.ai/models?supported_parameters=structured_outputs

Proxies

  • Proxy Configuration (object): Configure proxy settings for web scraping

Example input

1{
2    "urls": [
3        "https://apify.com/clockworks/free-tiktok-scraper"
4    ],
5    "fields": [
6        {
7            "name": "name",
8            "description": "The name/title of the scraper tool",
9            "type": "string"
10        },
11        {
12            "name": "price",
13            "description": "The price per 1000 results, only the number",
14            "type": "number"
15        },
16        {
17            "name": "author",
18            "description": "The author or maintainer of the scraper",
19            "type": "string"
20        }
21    ],
22    "cssSelector": "main > article",
23    "useCustomModel": false,
24    "predefinedModel": "google/gemini-2.0-flash-001",
25    "proxyConfiguration": {
26        "useApifyProxy": true,
27        "apifyProxyGroups": [
28            "RESIDENTIAL"
29        ]
30    }
31}

Output

The actor outputs a dataset where each item contains:

  • url: The source URL
  • Custom fields as specified in your input configuration

Example output:

1{
2    "url": "https://apify.com/clockworks/free-tiktok-scraper",
3    "author": "Clockworks",
4    "name": "TikTok Data Extractor",
5    "price": 4
6}

Cost

There are 3 costs to using this model: startup cost, cost per result and AI cost. We split it up like this to make our pricing as competitive as possible.

  • There's a one time charge of $0.05 (5 cents) every time you start an actor run. This cost is to cover server startup times.
  • Every result pushed to the dataset (= every input URL) is charged at $0.001 (1/10th of a cent).
  • If you use a predefined model you will be charged for every 1,000 tokens depending on the AI model used. If you bring your own API key you will not be charged this.
    • Google Gemini Flash 1.5: $0.0006 / 1,000 tokens (6/100th of a cent)
    • Google Gemini Flash 2.0: $0.0008 / 1,000 tokens (8/100th of a cent, best value)
    • Open AI GPT-4o mini: $0.0012 / 1,000 tokens (12/100th of a cent)
    • Google Gemini Pro 1.5: $0.02 / 1,000 tokens (2 cents)
    • Open AI GPT4o: $0.04 / 1,000 tokens (4 cents)

You can check how many tokens are in a given text by using the Open AI Tokenizer: https://platform.openai.com/tokenizer. Generally speaking 1 token = 1 word.

Limitations

  • The AI models require clear, well-structured content for best results
  • Some models may have token limits affecting the amount of text they can process
  • Custom models must support structured output format
  • Rate limits may apply based on the chosen AI provider

Cost of Usage

  • When using predefined models, AI costs are covered
  • Custom model usage requires your own OpenRouter API key and credits
  • Standard Apify platform charges apply (proxy usage if enabled)

Tips for Best Results

  1. Be specific in your field descriptions
  2. Choose appropriate data types for each field
  3. Test with a small number of URLs first
  4. Use the model that best fits your needs (faster models for simple extraction, more powerful models for complex tasks)
  5. Consider using proxies when scraping at scale
  6. Use CSS selectors when you know exactly which elements contain the relevant information
  7. Test your CSS selectors first in browser DevTools to ensure they match the desired elements

Technical Details

  • Built with TypeScript
  • Uses Crawlee for web scraping
  • Integrates with OpenRouter for AI processing
  • Supports structured output with JSON schema validation
  • Includes automatic error handling and retries
  • Supports both default content extraction and CSS selector-based extraction
Developer
Maintained by Community

Actor Metrics

  • 4 monthly users

  • 0 No stars yet

  • 90% runs succeeded

  • Created in Feb 2025

  • Modified 5 days ago