Website Content Vector Retriever
No credit card required
Website Content Vector Retriever
No credit card required
URL
url
stringRequired
A URL of a website where to fetch the web pages from. The URL can be a top-level domain like https://example.com, a subdirectory https://example.com/some-directory/, or a specific page https://example.com/some-directory/page.html.
Maximum number of pages
maxCrawlPages
integerOptional
The maximum number pages to crawl. It includes the start URLs, pagination pages, pages with no content, etc. The crawler will automatically finish after reaching this number. This setting is useful to prevent accidental crawler runaway.
Default value of this property is 1000
Max results
maxResults
integerOptional
The maximum number of resulting web pages to store. The crawler will automatically finish after reaching this number. This setting is useful to prevent accidental crawler runaway. If both Max page and Max results are defined, then the crawler will finish when the first limit is reached. Note that the crawler skips pages with the canonical URL of a page that has already been crawled, hence it might crawl more pages than there are results.
Default value of this property is 9999999
Chunk size
chunkSize
integerOptional
The maximum size of each chunk in characters or tokens.
Default value of this property is 1000
Chunk overlap
chunkOverlap
integerOptional
The number of overlapping characters or tokens between consecutive chunks.
Default value of this property is 200
Crawler type
crawlerType
EnumOptional
Value options:
"playwright:firefox": string"cheerio": string
Default value of this property is "cheerio"
Max cache age in days
maxCacheAge
integerOptional
The maximum age of cached records, in days, to consider them useful. By default, set to 30.
Default value of this property is 30
Cache key-value store name
cacheKeyValueStoreName
stringOptional
Name of the key-value store where the actor will store its cache.
Default value of this property is "website-content-vector-retriever-cache"
Index Website Content Crawled Run ID
indexWebsiteContentCrawledRunId
stringOptional
When set, it invokes a special mode in which we index the crawl to the database, rather than answer user queries. This is invoked using webhook from Website Content Crawler, scheduled by this actor. Note that the URL and Query need to also be passed, but the URL can be dummy, as they it's ignored.
Actor Metrics
1 monthly user
-
5 stars
Created in Sep 2023
Modified a year ago