Start URLs
A static list of URLs to scrape.
For details, see Start URLs in README.
Glob patterns
Glob patterns let you match links in the page that you want to enqueue. Combine them with the link selector to tell the scraper where to find links. Omitting the glob patterns will cause the scraper to enqueue all links matched by the link selector.
Default value of this property is
Link selector
This is a CSS selector that says which links on the page (<a>
elements with href
attribute) should be followed and added to the request queue. To filter the links added to the queue, use the Pseudo-URLs setting.
If Link selector is empty, the page links are ignored.
For details, see Link selector in README.
OpenAI API key
The API key for accessing OpenAI. You can get it from OpenAI platform.
Instructions for GPT
Instruct GPT how to generate text. For example: "Summarize this page in three sentences."
You can instruct OpenAI to answer with "skip this page", which will skip the page. For example: "Summarize this page in three sentences. If the page is about Apify Proxy, answer with 'skip this page'.".
GPT model
Select a GPT model. See models overview. Keep in mind that each model has different pricing and features.
Value options:
Default value of this property is
Content selector
A CSS selector of the HTML element on the page that will be used in the instruction. Instead of a whole page, you can use only part of the page. For example: "div#content".
Max crawling depth
This specifies how many links away from the Start URLs the scraper will descend. This value is a safeguard against infinite crawling depths for misconfigured scrapers.
If set to 0
, there is no limit.
Default value of this property is
Max pages per run
Maximum number of pages that the scraper will open. 0 means unlimited.
Default value of this property is
Use JSON schema to format answer
If true, the answer will be transformed into a structured format based on the schema in the jsonAnswer
attribute.
Schema
Defines how the output will be stored in structured format using the [JSON SchemaJSON Schema. Keep in mind that it uses function, so by setting the description of the fields and the correct title, you can get better results.
Proxy configuration
This specifies the proxy servers that will be used by the scraper in order to hide its origin.
For details, see Proxy configuration in README.
Default value of this property is
- 568 users
- 50.2k runs
- Modified 2 days ago