Start URLs
A static list of URLs to scrape.
For details, see Start URLs in README.
Glob patterns
Glob patterns to match links in the page that you want to enqueue. Combine with Link selector to tell the scraper where to find links. Omitting the Glob patterns will cause the scraper to enqueue all links matched by the Link selector.
Default value of this property is
Link selector
A CSS selector saying which links on the page (<a>
elements with href
attribute) shall be followed and added to the request queue. To filter the links added to the queue, use the Pseudo-URLs setting.
If Link selector is empty, the page links are ignored.
For details, see Link selector in README.
Instructions for GPT
Instruct GPT how to generate text. For example: "Summarize this page into three sentences."
You can instruct to OpenAI to answer with "skip this page", which will skip the page. For example: "Summarize this page into three sentences. If the page is about Apify Proxy answer with 'skip this page'.".
Content selector
A CSS selector of HTML element on the page will be used in instruction. Istead of whole page you can use only part of the page. For example: "div#content".
Max crawling depth
Specifies how many links away from Start URLs the scraper will descend. This value is a safeguard against infinite crawling depths for misconfigured scrapers.
If set to 0
, there is no limit.
Default value of this property is
Max pages per run
Maximum number of pages that the scraper will open. 0 means unlimited.
Default value of this property is
Use JSON schema to format answer
If true, the answer will be transformed into a structured format based on the schema in the jsonAnswer
attribute.
Schema
This defines how the output will be stored in structured format using [JSON SchemaJSON Schema. Keep in mind that it uses function, so by setting the description of the fields and the correct title, you can get better results.
Proxy configuration
Specifies proxy servers that will be used by the scraper in order to hide its origin.
For details, see Proxy configuration in README.
Default value of this property is
- 2.7k users
- 132.5k runs
- Modified about 23 hours ago