Extended GPT Scraper

No credit card required

Extended GPT Scraper

Extended GPT Scraper

drobnikj/extended-gpt-scraper

No credit card required

Extract data from any website and feed it into GPT via the OpenAI API. Use ChatGPT to proofread content, analyze sentiment, summarize reviews, extract contact details, and much more.

Start URLs

startUrls
array
Required

A static list of URLs to scrape.

For details, see Start URLs in README.

Glob patterns

globs
array
Optional

Glob patterns let you match links in the page that you want to enqueue. Combine them with the link selector to tell the scraper where to find links. Omitting the glob patterns will cause the scraper to enqueue all links matched by the link selector.

Default value of this property is

[]
linkSelector
string
Optional

This is a CSS selector that says which links on the page (<a> elements with href attribute) should be followed and added to the request queue. To filter the links added to the queue, use the Pseudo-URLs setting.

If Link selector is empty, the page links are ignored.

For details, see Link selector in README.

OpenAI API key

openaiApiKey
string
Required

The API key for accessing OpenAI. You can get it from OpenAI platform.

Instructions for GPT

instructions
string
Required

Instruct GPT how to generate text. For example: "Summarize this page in three sentences."

You can instruct OpenAI to answer with "skip this page", which will skip the page. For example: "Summarize this page in three sentences. If the page is about Apify Proxy, answer with 'skip this page'.".

GPT model

model
Enum
Required

Select a GPT model. See models overview. Keep in mind that each model has different pricing and features.

Value options:

"gpt-3.5-turbo": string"gpt-3.5-turbo-16k": string"gpt-4": string"gpt-4-32k": string"text-davinci-003": string

Default value of this property is

"gpt-3.5-turbo"

Content selector

targetSelector
string
Optional

A CSS selector of the HTML element on the page that will be used in the instruction. Instead of a whole page, you can use only part of the page. For example: "div#content".

Max crawling depth

maxCrawlingDepth
integer
Optional

This specifies how many links away from the Start URLs the scraper will descend. This value is a safeguard against infinite crawling depths for misconfigured scrapers.

If set to 0, there is no limit.

Default value of this property is

0

Max pages per run

maxPagesPerCrawl
integer
Optional

Maximum number of pages that the scraper will open. 0 means unlimited.

Default value of this property is

10

Use JSON schema to format answer

useStructureOutput
boolean
Optional

If true, the answer will be transformed into a structured format based on the schema in the jsonAnswer attribute.

Schema

schema
object
Optional

Defines how the output will be stored in structured format using the [JSON SchemaJSON Schema. Keep in mind that it uses function, so by setting the description of the fields and the correct title, you can get better results.

Proxy configuration

proxyConfiguration
object
Optional

This specifies the proxy servers that will be used by the scraper in order to hide its origin.

For details, see Proxy configuration in README.

Default value of this property is

{"useApifyProxy":false}
Developer
Maintained by Community
Actor stats
  • 568 users
  • 50.2k runs
  • Modified 2 days ago

You might also like these Actors