Under maintenance

Pricing

Pay per usage

Try for free

Go to Apify Store

WCC Pinecone Integration

Under maintenance

Try for free

Developed by

Tri⟁angle

Crawl any website and store its content in your Pinecone vector database. Enhance the accuracy and reliability of your own AI Assistant with facts fetched from external sources or connect this integration to our Pinecone GPT Chatbot assistant available in Apify Store.

3.8 (5)

Pricing

Pay per usage

Issues response

52 days

Last modified

7 months ago

Automation

Integrations

Website URL

urlstringOptional

A URL of a website where to fetch the web pages from. The URL can be a top-level domain like https://example.com, a subdirectory https://example.com/some-directory/, or a specific page https://example.com/some-directory/page.html.

Vector database query

querystringOptional

Text query that will be used to search relevant documents in the vector database using similarity search. This query will be converted into an embedding vector using OpenAI embedding function and it will be compared to the vectors of documents stored in the vector database.

No website crawling and vector DB update (query only)

noCrawlingbooleanOptional

If enabled, the crawler will not be started and the actor will only search the vector database for the given query.

Default value of this property is false

OpenAI API key

openaiApiKeystringRequired

OpenAI API key to generate vector embeddings for documents that are stored to the vector database and also for the database query.

Pinecone API key

pineconeApiKeystringRequired

Your Pinecone API key.

Pinecone index name

pineconeIndexNamestringRequired

The name of the Pinecone index where you want to store the vectors.

Top K results

topKResultsintegerOptional

The number of top results to return from the vector database. The results will be sorted by similarity to the query vector.

Default value of this property is 10

Cache key-value store

cacheKeyValueStoreNamestringOptional

The name of the key-value store where the actor will cache URLs of the fetched websites. If the website is already being crawled, the actor will be aborted.

Default value of this property is "website-content-vector-cache"

Max results

maxResultsintegerOptional

The maximum number of resulting web pages to store. The crawler will automatically finish after reaching this number. This setting is useful to prevent accidental crawler runaway. If both Max page and Max results are defined, then the crawler will finish when the first limit is reached. Note that the crawler skips pages with the canonical URL of a page that has already been crawled, hence it might crawl more pages than there are results.

Default value of this property is 9999999

Chunk size

chunkSizeintegerOptional

The maximum size of each chunk in characters.

Default value of this property is 2000

Chunk overlap

chunkOverlapintegerOptional

The number of overlapping characters between consecutive chunks.

Default value of this property is 200

Crawler type

crawlerTypeEnumOptional

Select the crawling engine:

Headless web browser - Useful for modern websites with anti-scraping protections and JavaScript rendering. It recognizes common blocking patterns like CAPTCHAs and automatically retries blocked requests through new sessions. However, running web browsers is more expensive as it requires more computing resources and is slower. It is recommended to use at least 8 GB of RAM.
Stealthy web browser (default) - Another headless web browser with anti-blocking measures enabled. Try this if you encounter bot protection while scraping. For best performance, use with Apify Proxy residential IPs.
Adaptive switching between Chrome and raw HTTP client - The crawler automatically switches between raw HTTP for static pages and Chrome browser (via Playwright) for dynamic pages, to get the maximum performance wherever possible.
Raw HTTP client - High-performance crawling mode that uses raw HTTP requests to fetch the pages. It is faster and cheaper, but it might not work on all websites.

Value options:

"playwright:firefox": string"playwright:chrome": string"playwright:adaptive": string"cheerio": string"jsdom": string

Default value of this property is "playwright:firefox"

Include URLs (globs)

includeUrlGlobsarrayOptional

Glob patterns matching URLs of pages that will be included in crawling.

Setting this option will disable the default Start URLs based scoping and will allow you to customize the crawling scope yourself. Note that this affects only links found on pages, but not Start URLs - if you want to crawl a page, make sure to specify its URL in the Start URLs field.

For example https://{store,docs}.example.com/** lets the crawler to access all URLs starting with https://store.example.com/ or https://docs.example.com/, and https://example.com/**/*\?*foo=* allows the crawler to access all URLs that contain foo query parameter with any value.

Learn more about globs and test them here.

Default value of this property is []

Exclude URLs (globs)

excludeUrlGlobsarrayOptional

Glob patterns matching URLs of pages that will be excluded from crawling. Note that this affects only links found on pages, but not Start URLs, which are always crawled.

For example https://{store,docs}.example.com/** excludes all URLs starting with https://store.example.com/ or https://docs.example.com/, and https://example.com/**/*\?*foo=* excludes all URLs that contain foo query parameter with any value.

Learn more about globs and test them here.

Default value of this property is []

Ignore canonical URLs

ignoreCanonicalUrlbooleanOptional

If enabled, the Actor will ignore the canonical URL reported by the page, and use the actual URL instead. You can use this feature for websites that report invalid canonical URLs, which causes the Actor to skip those pages in results.

Default value of this property is false

Max crawling depth

maxCrawlDepthintegerOptional

The maximum number of links starting from the start URL that the crawler will recursively follow. The start URLs have depth 0, the pages linked directly from the start URLs have depth 1, and so on.

This setting is useful to prevent accidental crawler runaway. By setting it to 0, the Actor will only crawl the Start URLs.

Default value of this property is 20

Max pages

maxCrawlPagesintegerOptional

The maximum number pages to crawl. It includes the start URLs, pagination pages, pages with no content, etc. The crawler will automatically finish after reaching this number. This setting is useful to prevent accidental crawler runaway.

Default value of this property is 9999999

Initial concurrency

initialConcurrencyintegerOptional

The initial number of web browsers or HTTP clients running in parallel. The system scales the concurrency up and down based on the current CPU and memory load. If the value is set to 0 (default), the Actor uses the default setting for the specific crawler type.

Note that if you set this value too high, the Actor will run out of memory and crash. If too low, it will be slow at start before it scales the concurrency up.

Default value of this property is 0

Max concurrency

maxConcurrencyintegerOptional

The maximum number of web browsers or HTTP clients running in parallel. This setting is useful to avoid overloading the target websites and to avoid getting blocked.

Default value of this property is 200

Initial cookies

initialCookiesarrayOptional

Cookies that will be pre-set to all pages the scraper opens. This is useful for pages that require login. The value is expected to be a JSON array of objects with name and value properties. For example: [{"name": "cookieName", "value": "cookieValue"}].

You can use the EditThisCookie browser extension to copy browser cookies in this format, and paste it here.

Default value of this property is []

Proxy configuration

proxyConfigurationobjectOptional

Enables loading the websites from IP addresses in specific geographies and to circumvent blocking.

Default value of this property is {"useApifyProxy":true}

Maximum number of session rotations

maxSessionRotationsintegerOptional

The maximum number of times the crawler will rotate the session (IP address + browser configuration) on anti-scraping measures like CAPTCHAs. If the crawler rotates the session more than this number and the page is still blocked, it will finish with an error.

Default value of this property is 10

Maximum number of retries on network / server errors

maxRequestRetriesintegerOptional

The maximum number of times the crawler will retry the request on network, proxy or server errors. If the (n+1)-th request still fails, the crawler will mark this request as failed.

Default value of this property is 5

Request timeout

requestTimeoutSecsintegerOptional

Timeout (in seconds) for making the request and processing its response. Defaults to 60s.

Default value of this property is 60

Minimum file download speed (kilobytes per second)

minFileDownloadSpeedKBpsintegerOptional

The minimum viable file download speed in kilobytes per seconds. If the file download speed is lower than this value for a prolonged duration, the crawler will consider the file download as failing, abort it, and retry it again (up to "Maximum number of retries" times). This is useful to avoid your crawls being stuck on slow file downloads.

Default value of this property is 128

Wait for dynamic content (seconds)

dynamicContentWaitSecsintegerOptional

The maximum time to wait for dynamic page content to load. By default, it is 10 seconds. The crawler will continue either if this time elapses, or if it detects the network became idle as there are no more requests for additional resources.

Note that this setting is ignored for the raw HTTP client, because it doesn't execute JavaScript or loads any dynamic resources.

Default value of this property is 10

Maximum scroll height (pixels)

maxScrollHeightPixelsintegerOptional

The crawler will scroll down the page until all content is loaded (and network becomes idle), or until this maximum scrolling height is reached. Setting this value to 0 disables scrolling altogether.

Note that this setting is ignored for the raw HTTP client, because it doesn't execute JavaScript or loads any dynamic resources.

Default value of this property is 5000

Remove HTML elements (CSS selector)

removeElementsCssSelectorstringOptional

A CSS selector matching HTML elements that will be removed from the DOM, before converting it to text, Markdown, or saving as HTML. This is useful to skip irrelevant page content.

By default, the Actor removes common navigation elements, headers, footers, modals, scripts, and inline image. You can disable the removal by setting this value to some non-existent CSS selector like dummy_keep_everything.

Default value of this property is "nav, footer, script, style, noscript, svg,\n[role=\"alert\"],\n[role=\"banner\"],\n[role=\"dialog\"],\n[role=\"alertdialog\"],\n[role=\"region\"][aria-label*=\"skip\" i],\n[aria-modal=\"true\"]"

Remove cookie warnings

removeCookieWarningsbooleanOptional

If enabled, the Actor will try to remove cookies consent dialogs or modals, using the I don't care about cookies browser extension, to improve the accuracy of the extracted text. Note that there is a small performance penalty if this feature is enabled.

This setting is ignored when using the raw HTTP crawler type.

Default value of this property is true

Expand clickable elements

clickElementsCssSelectorstringOptional

A CSS selector matching DOM elements that will be clicked. This is useful for expanding collapsed sections, in order to capture their text content.

Default value of this property is "[aria-expanded=\"false\"]"

HTML transformer

htmlTransformerEnumOptional

Specify how to transform the HTML to extract meaningful content without any extra fluff, like navigation or modals. The HTML transformation happens after removing and clicking the DOM elements.

Readable text with fallback - Extracts the main contents of the webpage, without navigation and other fluff while carefully checking the content integrality.
Readable text (default) - Extracts the main contents of the webpage, without navigation and other fluff.
Extractus - Uses Extractus library.
None - Only removes the HTML elements specified via 'Remove HTML elements' option.

You can examine output of all transformers by enabling the debug mode.

Value options:

"readableTextIfPossible": string"readableText": string"extractus": string"none": string

Default value of this property is "readableText"

Readable text extractor character threshold

readableTextCharThresholdintegerOptional

A configuration options for the "Readable text" HTML transformer. It contains the minimum number of characters an article must have in order to be considered relevant.

Default value of this property is 100

Remove duplicate text lines

aggressivePrunebooleanOptional

This is an experimental feature. If enabled, the crawler will prune content lines that are very similar to the ones already crawled on other pages, using the Count-Min Sketch algorithm. This is useful to strip repeating content in the scraped data like menus, headers, footers, etc. In some (not very likely) cases, it might remove relevant content from some pages.

Default value of this property is false

Debug mode (stores output of all HTML transformers)

debugModebooleanOptional

If enabled, the Actor will store the output of all types of HTML transformers, including the ones that are not used by default, and it will also store the HTML to Key-value Store with a link. All this data is stored under the debug field in the resulting Dataset.

Default value of this property is false

Debug log

debugLogbooleanOptional

If enabled, the actor log will include debug messages. Beware that this can be quite verbose.

Default value of this property is false

Sitemap Change Orchestrator

tri_angle/sitemap-change-orchestrator

Monitor website sitemaps for new, updated, or removed URLs. Integration with the Website Content Crawler (WCC) allows feeding only relevant URLs. This ensures your web crawls are efficient, targeted, and resource-optimized, keeping your datasets fresh for any application.

Tri⟁angle

Find Sitemap from url

eesti/find-sitemap-from-url

A powerful [Apify Actor] that finds sitemap URLs for any website. This Actor helps you discover XML sitemaps by checking common locations, robots.txt files, and analyzing HTML content for sitemap links.

ando

YellowPages.ca Business Data Scraper

delicious_zebu/yellowpages-ca-business-data-scraper

Effortlessly extract comprehensive Canadian business data from YellowPages.ca with flexible search options, rich detail extraction, and customizable pagination for your market research and lead generation needs.

ВAH

YellowPages South Africa Business Lead Generator

lead.gen.labs/yellowpages-south-africa-business-lead-generator

A powerful web scraper designed to extract business information from YellowPages South Africa. If you're looking for leads, contact details, or business insights, this actor helps you quickly gather essential data such as business names, addresses, emails, websites, and descriptions.

LeadGen Labs

Pinecone GPT Chatbot

tri_angle/pinecone-gpt-chatbot

Pinecone GPT Chatbot combines OpenAI's GPT models with Pinecone's database to generate insightful responses. Its interactive chatbot interface presents precise and comprehensive answers to user queries. Benefit from semantic understanding, efficient workflows, and enriched knowledge integration!

Tri⟁angle

4.9

Yellow-Pages-Scraper-withEmail

krish_patel/yellow-pages-scraper-withEmail

Allows you to gather leads from yellow pages along with their corresponding email if any.

Krish Patel

665

Pinecone Integration

apify/pinecone-integration

This integration transfers data from Apify Actors to a Pinecone and is a good starting point for a question-answering, search, or RAG use case.

Apify

459

3.2

Sitemap Scraper

pvillalva/sitemap-scraper

The Sitemap Scraper extracts and outputs all URLs from a given sitemap.

Percival Villalva

5.0

Sitemap Detector

coder_zoro/sitemap-detector

Find sitemap URLs fast with our free Sitemap Finder tool. Instantly detect sitemaps from any website for SEO audits, indexing checks, and crawl planning. Improve visibility, site structure insights, and search engine performance in just seconds

Zoro

5.0

Sitemap Sniffer

vaclavrut/sitemap-sniffer

Sitemap sniffer will check the most used variants of sitemaps and you can use that for crawling. This will just save you time so you don't have to check manually.

Vaclav Rut

641

5.0

What is a vector database?

How we built an AI salesman with the OpenAI Assistants API

How to use LangChain with OpenAI, Pinecone, and Apify

WCC Pinecone Integration

WCC Pinecone Integration

Website URL

Vector database query

No website crawling and vector DB update (query only)

OpenAI API key

Pinecone API key

Pinecone index name

Top K results

Cache key-value store

Max results

Chunk size

Chunk overlap

Crawler type

Value options:

Include URLs (globs)

Exclude URLs (globs)

Ignore canonical URLs

Max crawling depth

Max pages

Initial concurrency

Max concurrency

Initial cookies

Proxy configuration

Maximum number of session rotations

Maximum number of retries on network / server errors

Request timeout

Minimum file download speed (kilobytes per second)

Wait for dynamic content (seconds)

Maximum scroll height (pixels)

Remove HTML elements (CSS selector)

Remove cookie warnings

Expand clickable elements

HTML transformer

Value options:

Readable text extractor character threshold

Remove duplicate text lines

Debug mode (stores output of all HTML transformers)

Debug log

You might also like

Sitemap Change Orchestrator

Find Sitemap from url

YellowPages.ca Business Data Scraper

YellowPages South Africa Business Lead Generator

Pinecone GPT Chatbot

Yellow-Pages-Scraper-withEmail

Pinecone Integration

Sitemap Scraper

Sitemap Detector

Sitemap Sniffer

Related articles

Website URL

Vector database query

No website crawling and vector DB update (query only)

OpenAI API key

Pinecone API key

Pinecone index name

Top K results

Cache key-value store

Max results

Chunk size

Chunk overlap

Crawler type

Value options:

Include URLs (globs)

Exclude URLs (globs)

Ignore canonical URLs

Max crawling depth

Max pages

Initial concurrency

Max concurrency

Initial cookies

Proxy configuration

Maximum number of session rotations

Maximum number of retries on network / server errors

Request timeout

Minimum file download speed (kilobytes per second)

Wait for dynamic content (seconds)

Maximum scroll height (pixels)

Remove HTML elements (CSS selector)