Pay $9.00 for 1,000 pages

GPT Scraper

drobnikj/gpt-scraper

Pay $9.00 for 1,000 pages

Extract data from any website and feed it into GPT via the OpenAI API. Use ChatGPT to proofread content, analyze sentiment, summarize reviews, extract contact details, and much more.

Start URLs

startUrlsarrayRequired

A static list of URLs to scrape.

For details, see Start URLs in README.

Instructions for GPT

instructionsstringRequired

Instruct GPT how to generate text. For example: "Summarize this page in three sentences."

You can instruct OpenAI to answer with "skip this page", which will skip the page. For example: "Summarize this page in three sentences. If the page is about Apify Proxy, answer with 'skip this page'.".

Include URLs (globs)

includeUrlGlobsarrayOptional

Glob patterns matching URLs of pages that will be included in crawling. Combine them with the link selector to tell the scraper where to find links. You need to use both globs and link selector to crawl further pages.

Default value of this property is []

Exclude URLs (globs)

excludeUrlGlobsarrayOptional

Glob patterns matching URLs of pages that will be excluded from crawling. Note that this affects only links found on pages, but not Start URLs, which are always crawled.

Default value of this property is []

Max crawling depth

maxCrawlingDepthintegerOptional

This specifies how many links away from the Start URLs the scraper will descend. This value is a safeguard against infinite crawling depths for misconfigured scrapers.

If set to 0, there is no limit.

Default value of this property is 99999999

Max pages per run

maxPagesPerCrawlintegerOptional

Maximum number of pages that the scraper will open. 0 means unlimited.

Default value of this property is 10

Link selector

linkSelectorstringOptional

This is a CSS selector that says which links on the page (<a> elements with href attribute) should be followed and added to the request queue. To filter the links added to the queue, use the Pseudo-URLs setting.

If Link selector is empty, the page links are ignored.

For details, see Link selector in README.

Initial cookies

initialCookiesarrayOptional

Cookies that will be pre-set to all pages the scraper opens. This is useful for pages that require login. The value is expected to be a JSON array of objects with name, value, 'domain' and 'path' properties. For example: [{"name": "cookieName", "value": "cookieValue"}, "domain": ".domain.com", "path": "/"}].

You can use the EditThisCookie browser extension to copy browser cookies in this format, and paste it here.

Default value of this property is []

Proxy configuration

proxyConfigurationobjectOptional

This specifies the proxy servers that will be used by the scraper in order to hide its origin.

For details, see Proxy configuration in README.

Default value of this property is {"useApifyProxy":false}

Temperature

temperaturestringOptional

Controls randomness: Lowering results in less random completions. As the temperature approaches zero, the model will become deterministic and repetitive. For consistent results, we recommend setting the temperature to 0.

Default value of this property is "0"

TopP

topPstringOptional

Controls diversity via nucleus sampling: 0.5 means half of all likelihood-weighted options are considered.

Default value of this property is "1"

Frequency penalty

frequencyPenaltystringOptional

How much to penalize new tokens based on their existing frequency in the text so far. Decreases the model's likelihood to repeat the same line verbatim.

Default value of this property is "0"

Presence penalty

presencePenaltystringOptional

How much to penalize new tokens based on whether they appear in the text so far. Increases the model's likelihood to talk about new topics.

Default value of this property is "0"

Content selector

targetSelectorstringOptional

A CSS selector of the HTML element on the page that will be used in the instruction. Instead of a whole page, you can use only part of the page. For example: "div#content".

Remove HTML elements (CSS selector)

removeElementsCssSelectorstringOptional

A CSS selector matching HTML elements that will be removed from the DOM, before sending it to GPT processing. This is useful to skip irrelevant page content and save on GPT input tokens.

By default, the Actor removes usually unwanted elements like scripts, styles and inline images. You can disable the removal by setting this value to some non-existent CSS selector like dummy_keep_everything.

Default value of this property is "script, style, noscript, path, svg, xlink"

Page format in request

pageFormatInRequestEnumOptional

In what format to send the content extracted from the page to the GPT. Markdown will take less space allowing for larger requests, while HTML may help include some information like attributes that may otherwise be omitted.

Value options:

"HTML": string"Markdown": string

Default value of this property is "Markdown"

Wait for dynamic content (seconds)

dynamicContentWaitSecsintegerOptional

The maximum time to wait for dynamic page content to load. The crawler will continue either if this time elapses, or if it detects the network became idle as there are no more requests for additional resources.

Default value of this property is 0

Remove link URLs

removeLinkUrlsbooleanOptional

Removes web link URLs while keeping the text content they display.

This helps reduce the total page content by eliminating unnecessary URLs before sending to GPT
Useful if you are hitting maximum input tokens limits

Default value of this property is false

Use JSON schema to format answer

useStructureOutputbooleanOptional

If true, the answer will be transformed into a structured format based on the schema in the jsonAnswer attribute.

JSON schema format

schemaobjectOptional

Defines how the output will be stored in structured format using the JSON Schema. Keep in mind that it uses function, so by setting the description of the fields and the correct title, you can get better results.

Schema description

schemaDescriptionstringOptional

Description of the schema function. Use this to provide more context for the schema.

By default, the instructions field's value is used as the schema description, you can change it here.

Save debug snapshots

saveSnapshotsbooleanOptional

For each page store its HTML, screenshot and parsed content (markdown/HTML as it was sent to ChatGPT) adding links to these into the output

Default value of this property is true

Developer

Jakub Drobník

Actor metrics

302 monthly users
97.4% runs succeeded
9.5 days response time
Created in Mar 2023
Modified 8 days ago

Categories

Lead generation

Business

Google Maps Scraper

compass/crawler-google-places

Extract data from hundreds of Google Maps locations and businesses. Get Google Maps data including reviews, images, contact info, opening hours, location, popular times, prices & more. Export scraped data, run the scraper via API, schedule and monitor runs, or integrate with other tools.

Compass

62.3k

Google Search Results Scraper

apify/google-search-scraper

Scrape Google Search Engine Results Pages (SERPs). Select the country or language and extract organic and paid results, ads, queries, People Also Ask, prices, reviews, like a Google SERP API. Export scraped data, run the scraper via API, schedule and monitor runs, or integrate with other tools.

Apify

43.4k

Instagram Hashtag Scraper

apify/instagram-hashtag-scraper

Scrape Instagram hashtags data. Just add one or more hashtags and extract posts, images, URLs, comments, likes, users, locations, timestamps, and more. Export scraped datasets, run the scraper via API, schedule and monitor runs or integrate with other tools.

Apify

13.8k

Website Content Crawler

apify/website-content-crawler

Automatically crawl and extract text content from websites with documentation, knowledge bases, help centers, or blogs. This Actor is designed to provide data to feed, fine-tune, or train large language models such as ChatGPT or LLaMA.

Apify

12.9k

Facebook Posts Scraper

apify/facebook-posts-scraper

Extract data from hundreds of Facebook posts from one or multiple Facebook pages and profiles. Get post URL, post text, page or profile URL, timestamp, number of likes, shares, comments, and more. Download the data in JSON, CSV, and Excel and use it in apps, spreadsheets, and reports.

Apify

7.7k

Contact Details Scraper

vdrmota/contact-info-scraper

Free email extractor to extract and download emails, phone numbers, Facebook, Twitter, LinkedIn, and Instagram profiles from any website. Extract contact information at scale from lists of URLs and download the data as Excel, CSV, JSON, HTML, and XML.

Vojta Drmota

17.2k

AI Product Matcher

equidem/ai-product-matcher

Match products across multiple e-commerce websites. Use this AI product matching Actor whenever you need to find matching pairs of products from different online shops for dynamic pricing, competitor analysis or market research.

Matěj Sochor

308

Instagram Scraper

apify/instagram-scraper

Scrape and download Instagram posts, profiles, places, hashtags, photos, and comments. Get data from Instagram using one or more Instagram URLs or search queries. Export scraped data, run the scraper via API, schedule and monitor runs or integrate with other tools.

Apify

37.6k

Facebook Ads Scraper

apify/facebook-ads-scraper

Extract advertising data from one or multiple Facebook Pages. Get page details, reach estimates, publisher platforms, report count, number of impressions, ad IDs, timestamps, and more. Download Facebook ads data in JSON, CSV, and Excel and use it in apps, spreadsheets, and reports.

Apify

3.9k

Add custom actions to your GPTs with Apify Actors

Custom GPTs: how to add a knowledge base

How I use GPT Scraper to let ChatGPT access the internet

Build new tools

Are you a developer? Build your own Actors and run them on Apify.

Learn more

Get a custom solution

Get a custom web scraping or RPA solution.

Book a demo

GPT Scraper

Start URLs

Instructions for GPT

Include URLs (globs)

Exclude URLs (globs)

Max crawling depth

Max pages per run

Link selector

Initial cookies

Proxy configuration

Temperature

TopP

Frequency penalty

Presence penalty

Content selector

Remove HTML elements (CSS selector)