Pricing

Pay per usage

Go to Store

Website Content Vector Retriever

Try for free

Developed by

Hamza Alwan

0.0 (0)

Pricing

Pay per usage

Total users

Monthly users

Runs succeeded

>99%

Last modified

2 years ago

URL

urlstringRequired

A URL of a website where to fetch the web pages from. The URL can be a top-level domain like https://example.com, a subdirectory https://example.com/some-directory/, or a specific page https://example.com/some-directory/page.html.

Query

querystringRequired

Text query to look for on the website.

OpenAI API key

openaiApiKeystringRequired

OpenAI API key to generate vector embeddings.

Maximum number of pages

maxCrawlPagesintegerOptional

The maximum number pages to crawl. It includes the start URLs, pagination pages, pages with no content, etc. The crawler will automatically finish after reaching this number. This setting is useful to prevent accidental crawler runaway.

Default value of this property is 1000

Max results

maxResultsintegerOptional

The maximum number of resulting web pages to store. The crawler will automatically finish after reaching this number. This setting is useful to prevent accidental crawler runaway. If both Max page and Max results are defined, then the crawler will finish when the first limit is reached. Note that the crawler skips pages with the canonical URL of a page that has already been crawled, hence it might crawl more pages than there are results.

Default value of this property is 9999999

Chunk size

chunkSizeintegerOptional

The maximum size of each chunk in characters or tokens.

Default value of this property is 1000

Chunk overlap

chunkOverlapintegerOptional

The number of overlapping characters or tokens between consecutive chunks.

Default value of this property is 200

Crawler type

crawlerTypeEnumOptional

Value options:

"playwright:firefox": string"cheerio": string

Default value of this property is "cheerio"

Max cache age in days

maxCacheAgeintegerOptional

The maximum age of cached records, in days, to consider them useful. By default, set to 30.

Default value of this property is 30

Cache key-value store name

cacheKeyValueStoreNamestringOptional

Name of the key-value store where the actor will store its cache.

Default value of this property is "website-content-vector-retriever-cache"

Index Website Content Crawled Run ID

indexWebsiteContentCrawledRunIdstringOptional

When set, it invokes a special mode in which we index the crawl to the database, rather than answer user queries. This is invoked using webhook from Website Content Crawler, scheduled by this actor. Note that the URL and Query need to also be passed, but the URL can be dummy, as they it's ignored.

Force rescrape

forceRescrapebooleanOptional

When set to true, the actor will ignore the cache and scrape the website again.

OpenAI Vector Store Integration

jiri.spilka/openai-vector-store-integration

The Apify OpenAI Vector Store integration uploads data from Apify Actors to the OpenAI Vector Store linked to OpenAI Assistant.

Jiří Spilka

180

4.8

Website extract

mrahil/my-actor

It is website extractor

Mohammed Rahil

tsboi index

trim_flag/tsboi-index

Indexing for LLMs. This application crawls specified websites, processes their content into a searchable vector database, and enables users to ask natural language questions about the content.

Ikenna Chidoka

Fast Website Content Crawler

6sigmag/fast-website-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

1.4K

4.6

Website Content Crawler

apify/website-content-crawler

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Apify

64K

4.3

Qdrant Integration

apify/qdrant-integration

Transfer data from Apify Actors to a Qdrant vector database.

Apify

4.5

Website Scraper

grihithbhoir707/website-scraper

Grihith Bhoir

Deep Website Content Crawler

6sigmag/deep-website-content-crawler

Scrape Failed Killer! A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

460

4.4

TikTok Media and Metadata Retriever

gratenes/tiktok-media-and-metadata-retriever

An api for gathering media and metadata from any TikTok media url, supports vm.tiktok.com, vt.tiktok.com and other TikTok short links.

Fast URL Content Crawler

6sigmag/fast-url-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple URLs simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng