Pay $9.00 for 1,000 pages

GPT Scraper

drobnikj/gpt-scraper

Pay $9.00 for 1,000 pages

Extract data from any website and feed it into GPT via the OpenAI API. Use ChatGPT to proofread content, analyze sentiment, summarize reviews, extract contact details, and much more.

GPT Scraper is a powerful tool that leverages OpenAI's API to modify text obtained from a scraper. You can use the scraper to extract content from a website and then pass that content to the OpenAI API to make the GPT magic happen.

How does GPT Scraper work?

The scraper first loads the page using Playwright, then it converts the content into markdown format and asks for GPT instructions about markdown content.

If the content doesn't fit into the GPT limit, the scraper will truncate the content. You can find the message about truncated content in the log.

How much does it cost?

GPT Scraper costs $0.009 per processed page. This price also includes the cost of the OpenAI API. A free Apify account gives you $5 free usage credit each month, so you can scrape up to 555 pages for free.

Extended version

If you are looking for a more powerful GPT Scraper that lets you select the GPT model you want to use and provides more features, check out Extended GPT Scraper.

How to use GPT Scraper

To get started with GPT Scraper, you need to set up the pages you want to scrape using Start URLs and then set up instructions on how the GTP scraper should handle each page. For instance, using a simple scraper to load the URL https://news.ycombinator.com/ and instructing GPT to extract information from it will look like this:

You can configure the scraper and GPT using Input configuration to set up a more complex workflow.

Input configuration

GPT Scraper accepts a number of configuration settings. These can be entered either manually in the user interface in Apify Console or programmatically in a JSON object using the Apify API. For a complete list of input fields and their types, please see the outline of the Actor's Input-schema.

Start URLs

The Start URLs (startUrls) field represents the initial list of page URLs that the scraper will visit. You can enter a group of URLs together using file upload or one by one.

The scraper supports adding new URLs to scrape on the fly, either using the Link selector and Glob patterns options.

Link selector

The Link selector (linkSelector) field contains a CSS selector that is used to find links to other web pages (items with href attributes, e.g. <div class="my-class" href="...">).

On every page that is loaded, the scraper looks for all links matching Link selector, and checks that the target URL matches one of the Glob patterns. If it is a match, it then adds the URL to the request queue so that it's loaded by the scraper later on.

If Link selector is empty, the page links are ignored, and the scraper only loads pages specified in Start URLs.

Glob patterns

The Glob patterns (globs) field specifies which types of URLs found by Link selector should be added to the request queue.

A glob pattern is simply a string with wildcard characters.

For example, a glob pattern http://www.example.com/pages/**/* will match all the following URLs:

http://www.example.com/pages/deeper-level/page
http://www.example.com/pages/my-awesome-page
http://www.example.com/pages/something

Instructions and prompts for GPT

This option tells GPT how to handle page content. For example, you can send the following prompts:

"Summarize this page in three sentences."
"Find a sentences that contain 'Apify Proxy' and return them as a list."

You can also instruct OpenAI to answer with "skip this page" if you don't want to process all the scraped content, e.g.

"Summarize this page in three sentences. If the page is about proxies, answer with 'skip this page'.".

Max crawling depth

This specifies how many links away from the Start URLs the scraper will descend. This value is a safeguard against infinite crawling depths for misconfigured scrapers.

Max pages per run

The maximum number of pages that the scraper will open. 0 means unlimited.

Formatted output

If you want to get data in a structured format, you can define JSON schema using the Schema input option and enable the Use JSON schema to format answer option. This schema will be used to format data into a structured JSON object, which will be stored in the output in the jsonAnswer attribute.

Proxy configuration

The Proxy configuration (proxyConfiguration) option enables you to set proxies. The scraper will use these to prevent its detection by target websites. You can use both Apify Proxy and custom HTTP or SOCKS5 proxy servers.

Limits

The GPT model itself has a limit on the amount of content it can handle (i.e. maximum token limit). The scraped content will be truncated when this limit is reached. If you are looking for a more powerful version that lets you use more than 4096 tokens, you can check out Extended GPT Scraper.

Tips & tricks

Here are a few hidden features that you might find helpful.

Skip pages from the output

You can skip pages from the output by asking GPT to answer with skip this page, for example:

"Summarize this page in three sentences. If the page is about proxies, answer with 'skip this page'.".

Structured data answer with JSON [DEPRECATED]

Deprecated: Use Schema input option instead.

You can instruct GPT to answer with JSON, and the scraper under the hood will parse this JSON and store it as a structured answer, for example:

"Find all links on this page and return them as JSON. There will be one attribute, links, containing an array of URLs."

Example usage

Here are some example use cases that you can use as a starting point for your own GPT scraping experiments.

Summarize a page

Start URL:

https://en.wikipedia.org/wiki/COVID-19_pandemic

Instructions for GPT:

Summarize this page in three sentences.

Results:

1[
2  {
3    "url": "https://en.wikipedia.org/wiki/COVID-19_pandemic",
4    "answer": "This page on Wikipedia provides comprehensive information on the COVID-19 pandemic, including its epidemiology, disease symptoms and prevention strategies. The page also covers the history of the pandemic, national responses, and other measures taken by organizations such as the WHO and UN. The information is organized through a series of subsections for easy navigation.",
5    "jsonAnswer": null
6  }
7]

Extract keywords from a blog post

Start URL:

https://blog.apify.com/step-by-step-guide-to-scraping-amazon/

Prompt for GPT Extract keywords from this blog post.

Results:

1[{
2  "url": "https://blog.apify.com/step-by-step-guide-to-scraping-amazon/",
3  "answer": "Keywords: Web Scraping, Tutorial, Apify, Updates, Automation, Data Extraction, Ecommerce, Amazon, Product Data, API, Title, URL, Descriptions, Features, Prices, Images, Seller, Stock Status, ASINs, Proxy, Scraping.",
4}]

Summarize reviews of movies, games, or products

Start URL:

https://www.imdb.com/title/tt10366206/reviews

Instructions for GPT:

Analyze all user reviews for this movie and summarize the consensus.

Results:

1[{
2  "url": "https://www.imdb.com/title/tt10366206/reviews",
3  "answer": "The consensus among user reviews for John Wick: Chapter 4 (2023) is that it delivers exceptional action scenes and lives up to the high standards set by the previous films in the franchise. Many users praised the creativity and variety of the fight scenes, and Donnie Yen's performance in particular. Some noted minor flaws, such as an anticlimactic ending and a subplot with a tracker that did not feel consequential. Overall, users highly recommended the film to fans of the series and action movies in general.",
4}]

Find contact details on a web page

Start URL:

https://apify.com/contact

Instructions for GPT:

1Please find contact details on this page and return them as JSON.
2There will be attributes, companyEmail, companyWeb, githubUrl, twitterUrl,
3vatId, businessId and backAccountNumber.

Results:

1[
2  {
3    "url": "https://apify.com/contact",
4    "answer": "{\n    \"companyEmail\": \"hello@apify.com\",\n    \"companyWeb\": \"https://apify.com\",\n    \"githubUrl\": \"https://github.com/apify\",\n    \"twitterUrl\": \"https://twitter.com/apify\",\n    \"vatId\": \"CZ04788290\",\n    \"businessId\": \"04788290\",\n    \"backAccountNumber\": \"CZ0355000000000027434378\"\n}",
5    "jsonAnswer": {
6      "companyEmail": "hello@apify.com",
7      "companyWeb": "https://apify.com",
8      "githubUrl": "https://github.com/apify",
9      "twitterUrl": "https://twitter.com/apify",
10      "vatId": "CZ04788290",
11      "businessId": "04788290",
12      "backAccountNumber": "CZ0355000000000027434378"
13    }
14  }
15]

Other suggested use cases

Find typos and grammatical errors across your entire website
Analyze competing content to find keywords or ideas
Examine code examples in content to find errors or suggest improvements

Developer

Jakub Drobník

Actor metrics

201 monthly users
39 stars
99.1% runs succeeded
Created in Mar 2023
Modified 2 days ago

Categories

Lead generation

Business

Extended GPT Scraper

drobnikj/extended-gpt-scraper

Extract data from any website and feed it into GPT via the OpenAI API. Use ChatGPT to proofread content, analyze sentiment, summarize reviews, extract contact details, and much more.

Jakub Drobník

989

GPTs Scraper

observant_bagpipes/GPTs-scraper

Use this scraper to collect data about GPTs url, title, description and more.

quill zhou

Free GPTs Scraper

seadapp/free-gpts-scraper

Gets you GPT data from Openai. Download your data as JSON, HTML Table, CSV, Execl, RSS Feed

Seadapp

Twitter Tweets and Profiles Scraper

web.harvester/twitter-scraper

Easily search and extract tweets from profiles or directly using a URL with our tool. Scrape tweets and replies effortlessly. Download data in formats like JSON, CSV, XML, RSS, or HTML Table, perfect for integration with various applications, databases, and social media analytics tools.

Web Harvester

1.2k

Easy Twitter Search Scraper

web.harvester/easy-twitter-search-scraper

Easily Scrape tweets with our Twitter Search Scraper. Export data in formats like JSON and Excel perfect for integration with various applications, databases, and data analysis tools.

Web Harvester

2.2k

Indeed job scraper

curious_coder/indeed-scraper

This is an actively maintained scraper which can extract job postings and hiring company details at scale from any indeed search results page for a fixed monthly rental price. Well documented with examples and demos

Curious Coder

514

Pinecone GPT Chatbot

tri_angle/pinecone-gpt-chatbot

Pinecone GPT Chatbot combines OpenAI's GPT models with Pinecone's database to generate insightful responses. Its interactive chatbot interface presents precise and comprehensive answers to user queries. Benefit from semantic understanding, efficient workflows, and enriched knowledge integration!

Tri⟁angle

Youtube Channel Scraper

powerful_bachelor/Youtube-Channel-Scraper

The YouTube Channel Scraper is a powerful tool for extracting detailed info from YT channels. It retrieves video count 📹, subscribers count 📊, description 📝, verification status ✅, keywords 🏷️, tags 🔖, view count 👁️, and social media links 🔗. Ideal for researchers, marketers, and creators!!

Powerful Bachelor

LinkedIn People Profile Scraper

pratikdani/linkedin-people-profile-scraper

Retrieve LinkedIn profile URLs and extract comprehensive profile details. This includes basic information, employment history, skills, company information, biography, location, and more.

Pratik Dani

How I use GPT Scraper to let ChatGPT access the internet

Custom GPTs: how to build a GPT with a knowledge base

Add custom actions to your GPTs with Apify Actors

Build new tools

Are you a developer? Build your own Actors and run them on Apify.

Learn more

Get a custom solution

Get a custom web scraping or RPA solution.

Book a demo

GPT Scraper

How does GPT Scraper work?

How much does it cost?

Extended version

How to use GPT Scraper

Input configuration

Start URLs

Link selector

Glob patterns

Instructions and prompts for GPT

Max crawling depth

Max pages per run

Formatted output

Proxy configuration

Limits

Tips & tricks

Skip pages from the output

Structured data answer with JSON [DEPRECATED]

Example usage

Summarize a page

Extract keywords from a blog post

Summarize reviews of movies, games, or products

Find contact details on a web page

Other suggested use cases

Extended GPT Scraper

GPTs Scraper

Free GPTs Scraper

Twitter Tweets and Profiles Scraper

Easy Twitter Search Scraper

Indeed job scraper

Pinecone GPT Chatbot

Youtube Channel Scraper

LinkedIn People Profile Scraper

Related articles

Where next?

Build new tools

Get a custom solution

How does GPT Scraper work?

How much does it cost?

Extended version

How to use GPT Scraper

Input configuration

Start URLs

Link selector

Glob patterns

Instructions and prompts for GPT

Max crawling depth

Max pages per run

Formatted output

Proxy configuration

Limits

Tips & tricks

Skip pages from the output

Structured data answer with JSON [DEPRECATED]

Example usage

Summarize a page

Extract keywords from a blog post

Summarize reviews of movies, games, or products

Find contact details on a web page

Other suggested use cases

You might also like these Actors

Extended GPT Scraper

GPTs Scraper

Free GPTs Scraper

Twitter Tweets and Profiles Scraper

Easy Twitter Search Scraper

Indeed job scraper

Pinecone GPT Chatbot

Youtube Channel Scraper

LinkedIn People Profile Scraper

Related articles

Where next?

Build new tools

Get a custom solution