Product Finder Plus: Crawler & Extractor avatar
Product Finder Plus: Crawler & Extractor

Pricing

from $1.00 / 1,000 product details

Go to Apify Store
Product Finder Plus: Crawler & Extractor

Product Finder Plus: Crawler & Extractor

Product Finder Plus is a high-end e-commerce crawler built for websites where standard scraping tools fall short. It is designed to extract structured product data from complex, dynamic e-commerce stores and platforms.

Pricing

from $1.00 / 1,000 product details

Rating

0.0

(0)

Developer

Datavault

Datavault

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

Product Finder Plus - Crawler & Extractor

Recommendation: For simpler sites, we highly recommend trying the Product Finder Crawler & Extractor as a first step. It is generally faster and more cost-effective. This "Plus" version is designed for sites that require more complex solutions, specifically those with dynamic content or advanced anti-bot protections.

The Product Finder Crawler & Extractor Plus is an enhanced, high-performance implementation of our versatile e-commerce scraper. It is designed to extract product information from virtually any website, including modern Single Page Applications (SPAs) and PWA-based stores. It leverages multi-threaded concurrency and sophisticated parsing strategies (JSON-LD, Microdata, and JS-Global objects) to ensure maximum data yield with minimal overhead.

Features

  • High-Performance Concurrency: Uses a worker pool to crawl multiple pages in parallel, significantly reducing total execution time.
  • State Persistence & Resume: Automatically saves crawl progress (visited URLs and queue) to the Apify Key-Value Store. If the run is interrupted, it resumes exactly where it left off.
  • Comprehensive Product Discovery: Automatically identifies and extracts products using Schema.org (JSON-LD, Microdata), Meta Tags, and Next.js __NEXT_DATA__.
  • Dynamic JS-Object Extraction: Specifically tuned for ScandiPWA and React stores by extracting data directly from window.actionName and other global JavaScript objects.
  • Multi-Country Proxy Support: Fully integrated with Apify Proxy to bypass geo-blocks and analyze price differences across regions.
  • Pay-per-event (PPE) Integration: Fully compatible with Apify's PPE model, charging only for successful page loads and products found.
  • Configurable Limits: Control maxPagesPerCrawl, maxConcurrency, and maxRetries to manage depth and operational costs.

Input Parameters

  • startUrls: An array of URLs to start the crawl.
  • crawlSubpages: If checked (default: true), the crawler will follow links found on the pages.
  • maxPagesPerCrawl: The maximum number of pages to visit in a single run.
  • maxConcurrency: How many pages to process in parallel (Default: 5).
  • maxRetries: Number of times to retry a failed page fetch (Default: 3).
  • minRequestDelay: Minimum time in milliseconds to wait between requests.
  • proxyConfiguration: Apify Proxy configuration. Recommended for residential proxies on protected sites.

Output

The scraper outputs a dataset where each item represents a found product. Fields include:

  • url: The product page URL.
  • name: Product name.
  • description: Product description.
  • sku: Stock Keeping Unit.
  • brand: Brand name.
  • price: Product price.
  • currency: Currency code (e.g., USD, NOK).
  • image: URL of the product image.
  • availability: Availability status (e.g., InStock).
  • gtin: Global Trade Item Number (GTIN) such as EAN, UPC, ISBN.
  • rawSchema: The full extracted object for debugging or extra fields.

Sample Input

{
"startUrls": [
{ "url": "https://www.example-store.com" }
],
"crawlSubpages": true,
"maxPagesPerCrawl": 200,
"maxConcurrency": 5,
"proxyConfiguration": {
"useApifyProxy": true,
"apifyProxyGroups": ["RESIDENTIAL"]
}
}

How it works

  1. Initialization: The crawler loads any existing state and charges the apify-actor-start event.
  2. Concurrent Fetching: Workers pick URLs from the queue and fetch them using a persistent HTTP client.
  3. Advanced Parsing: It parses the page content using various strategies:
    • Schema.org (JSON-LD, Microdata)
    • Next.js and ScandiPWA data structures
    • Global JavaScript objects and Meta Tags
  4. Resilient Storage: Products are pushed to the Apify Dataset, and the crawl state is periodically saved to the Key-Value Store.
  5. Smart Discovery: New links are identified from both HTML anchors and dynamic JavaScript content to ensure deep coverage.

Common issue when there is no result

  • Blocking: Some sites might require Residential Proxies or specific User-Agent headers.
  • Non-Standard Structures: If a site doesn't use standard markup or common HTML patterns, generic extraction might fail.

Tip

Try setting just one URL of your site in the list of startUrls and set crawlSubpages to false. See if you get any result before scaling up the crawl.


Feedback & Improvements If the results don't align with your goals, please reach out and leave us a message. We use your feedback to continuously update and refine our extraction engine, helping us make the Product Finder better for everyone.