One or more URLs of pages where the crawler will start. Note that the Actor will additionally only crawl sub-pages of these URLs. For example, for start URL https://www.example.com/blog, it will crawl pages like https://example.com/blog/article-1, but will skip https://example.com/docs/something-else.
Max pages
maxCrawlPagesintegerOptional
The maximum number pages to crawl. It includes the start URLs, pagination pages, pages with no content, etc. The crawler will automatically finish after reaching this number. This setting is useful to prevent accidental crawler runaway.
Default value of this property is 9999999
OpenAI API key
openAIApiKeystringOptional
Enter your OpenAI account and an API key. This is needed for vectorizing the data and also to be able to prompt the OpenAI model.
Query
querystringOptional
The query you want to ask the model about the crawled data.
Re-crawl the data
forceRecrawlbooleanOptional
If enabled, the data will be re-crawled even if cached vector index is available.
Default value of this property is false
Load URLs from Sitemaps
loadUrlsFromSitemapsbooleanOptional
If enabled, the scraper will automatically find and load URLs from sitemap.xml files.
Default value of this property is false
Respect robots.txt file
respectRobotsTxtbooleanOptional
If enabled, the scraper will respect the robots.txt file and avoid crawling disallowed pages.