Extended GPT Scraper
No credit card required
Extended GPT Scraper
No credit card required
Extract data from any website and feed it into GPT via the OpenAI API. Use ChatGPT to proofread content, analyze sentiment, summarize reviews, extract contact details, and much more.
Do you want to learn more about this Actor?
Get a demoStart URLs
startUrls
arrayRequired
A static list of URLs to scrape.
For details, see Start URLs in README.
OpenAI API key
openaiApiKey
stringRequired
The API key for accessing OpenAI. You can get it from OpenAI platform.
Instructions for GPT
instructions
stringRequired
Instruct GPT how to generate text. For example: "Summarize this page in three sentences."
You can instruct OpenAI to answer with "skip this page", which will skip the page. For example: "Summarize this page in three sentences. If the page is about Apify Proxy, answer with 'skip this page'.".
GPT model
model
EnumRequired
Select a GPT model. See models overview. Keep in mind that each model has different pricing and features.
Value options:
"gpt-3.5-turbo": string"gpt-3.5-turbo-16k": string"gpt-4": string"gpt-4-32k": string"text-davinci-003": string"gpt-4-turbo": string"gpt-4o": string"gpt-4o-mini": string
Default value of this property is "gpt-3.5-turbo"
Include URLs (globs)
includeUrlGlobs
arrayOptional
Glob patterns matching URLs of pages that will be included in crawling. Combine them with the link selector to tell the scraper where to find links. You need to use both globs and link selector to crawl further pages.
Default value of this property is []
Exclude URLs (globs)
excludeUrlGlobs
arrayOptional
Glob patterns matching URLs of pages that will be excluded from crawling. Note that this affects only links found on pages, but not Start URLs, which are always crawled.
Default value of this property is []
Max crawling depth
maxCrawlingDepth
integerOptional
This specifies how many links away from the Start URLs the scraper will descend. This value is a safeguard against infinite crawling depths for misconfigured scrapers.
If set to 0
, there is no limit.
Default value of this property is 99999999
Max pages per run
maxPagesPerCrawl
integerOptional
Maximum number of pages that the scraper will open. 0 means unlimited.
Default value of this property is 10
Link selector
linkSelector
stringOptional
This is a CSS selector that says which links on the page (<a>
elements with href
attribute) should be followed and added to the request queue. To filter the links added to the queue, use the Pseudo-URLs setting.
If Link selector is empty, the page links are ignored.
For details, see Link selector in README.
Initial cookies
initialCookies
arrayOptional
Cookies that will be pre-set to all pages the scraper opens. This is useful for pages that require login. The value is expected to be a JSON array of objects with name
, value
, 'domain' and 'path' properties. For example: [{"name": "cookieName", "value": "cookieValue"}, "domain": ".domain.com", "path": "/"}]
.
You can use the EditThisCookie browser extension to copy browser cookies in this format, and paste it here.
Default value of this property is []
Proxy configuration
proxyConfiguration
objectOptional
This specifies the proxy servers that will be used by the scraper in order to hide its origin.
For details, see Proxy configuration in README.
Default value of this property is {"useApifyProxy":false}
Temperature
temperature
stringOptional
Controls randomness: Lowering results in less random completions. As the temperature approaches zero, the model will become deterministic and repetitive. For consistent results, we recommend setting the temperature to 0.
Default value of this property is "0"
TopP
topP
stringOptional
Controls diversity via nucleus sampling: 0.5 means half of all likelihood-weighted options are considered.
Default value of this property is "1"
Frequency penalty
frequencyPenalty
stringOptional
How much to penalize new tokens based on their existing frequency in the text so far. Decreases the model's likelihood to repeat the same line verbatim.
Default value of this property is "0"
Presence penalty
presencePenalty
stringOptional
How much to penalize new tokens based on whether they appear in the text so far. Increases the model's likelihood to talk about new topics.
Default value of this property is "0"
Content selector
targetSelector
stringOptional
A CSS selector of the HTML element on the page that will be used in the instruction. Instead of a whole page, you can use only part of the page. For example: "div#content".
Remove HTML elements (CSS selector)
removeElementsCssSelector
stringOptional
A CSS selector matching HTML elements that will be removed from the DOM, before sending it to GPT processing. This is useful to skip irrelevant page content and save on GPT input tokens.
By default, the Actor removes usually unwanted elements like scripts, styles and inline images. You can disable the removal by setting this value to some non-existent CSS selector like dummy_keep_everything
.
Default value of this property is "script, style, noscript, path, svg, xlink"
Page format in request
pageFormatInRequest
EnumOptional
In what format to send the content extracted from the page to the GPT. Markdown will take less space allowing for larger requests, while HTML may help include some information like attributes that may otherwise be omitted.
Value options:
"HTML": string"Markdown": string
Default value of this property is "Markdown"
Skip GPT processing for Globs
skipGptGlobs
arrayOptional
This setting allows you to specify certain page URLs to skip GPT instructions for. Pages matching these glob patterns will only be crawled for links, excluding them from GPT processing. Useful for intermediary pages used for navigation or undesired content.
Default value of this property is []
Wait for dynamic content (seconds)
dynamicContentWaitSecs
integerOptional
The maximum time to wait for dynamic page content to load. The crawler will continue either if this time elapses, or if it detects the network became idle as there are no more requests for additional resources.
Default value of this property is 0
Remove link URLs
removeLinkUrls
booleanOptional
Removes web link URLs while keeping the text content they display.
- This helps reduce the total page content by eliminating unnecessary URLs before sending to GPT
- Useful if you are hitting maximum input tokens limits
Default value of this property is false
Use JSON schema to format answer
useStructureOutput
booleanOptional
If true, the answer will be transformed into a structured format based on the schema in the jsonAnswer
attribute.
JSON schema format
schema
objectOptional
Defines how the output will be stored in structured format using the [JSON SchemaJSON Schema. Keep in mind that it uses function, so by setting the description of the fields and the correct title, you can get better results.
Actor Metrics
78 monthly users
-
59 stars
73% runs succeeded
11 days response time
Created in Jun 2023
Modified 11 days ago