Universal Store Catalog Scraper (Shopify/Woo/Generic) avatar

Universal Store Catalog Scraper (Shopify/Woo/Generic)

Pricing

from $4.00 / 1,000 results

Go to Apify Store
Universal Store Catalog Scraper (Shopify/Woo/Generic)

Universal Store Catalog Scraper (Shopify/Woo/Generic)

Extract clean, normalized product-catalog data from any Shopify, WooCommerce, or generic e-commerce storefront. AI-agent / MCP ready: every store comes back in one identical schema.

Pricing

from $4.00 / 1,000 results

Rating

0.0

(0)

Developer

Maninder Pal Singh

Maninder Pal Singh

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Categories

Share

Universal Store Catalog Scraper 🛍️ → 🤖

Turn any Shopify, WooCommerce, or generic e-commerce storefront into a clean, normalized product feed — the same schema for every store, every time. Built MCP-first so an AI agent can consume it as a structured catalog tool.

This is a storefront-platform scraper, not a marketplace scraper. It targets independent merchant stores (each its own site) using public, non-personal product data only. The core value is cross-platform schema consistency: a Shopify store, a WooCommerce store, and a plain JSON-LD store all come back in one identical shape.


✨ Why this Actor

  • One schema, every platform. Agents don't have to special-case Shopify vs. Woo vs. generic — the output is always the same record.
  • Tiered, cheap-first extraction. Uses fast public JSON endpoints where they exist and only spins up a browser as a last resort, so runs stay fast and cheap.
  • Reliable & graceful. A malformed or blocked store degrades to fewer fields or a skip record — it never fails the whole run.
  • Polite by default. Respects robots.txt, rate-limits per domain, honest User-Agent, public data only.

🧱 How it extracts (tiered engine)

PlatformPrimary pathFallback
ShopifyPublic /products.json (paginated, ?limit=250&page=N)Generic JSON-LD
WooCommerceStore API /wp-json/wc/store/v1/productsGeneric JSON-LD / HTML
Genericschema.org Product JSON-LDOpen Graph → microdata → Playwright render

Detection is a single lightweight request (asset/header/global fingerprints, with a cheap endpoint probe when ambiguous) and is overridable via forcePlatform.


📥 Input

Two modes — use either or combine them.

Mode 1 — Direct URLs (default, safest)

{
"storeUrls": ["https://www.allbirds.com", "https://some-woostore.com/shop"],
"maxProductsPerStore": 200,
"includeVariants": true,
"includeImages": true
}

Mode 2 — Discovery (opt-in, capped)

{
"discovery": {
"enabled": true,
"keywords": ["handmade ceramic mugs", "merino wool socks"],
"platformHint": "any",
"maxStores": 10
},
"maxProductsPerStore": 100
}
FieldTypeDefaultNotes
storeUrlsstring[][]Store home / collection / product URLs
discovery.enabledboolfalseOpt-in store discovery
discovery.keywordsstring[][]Niche/product terms
discovery.platformHintenumanyshopify | woocommerce | any
discovery.maxStoresint10Hard max 50
maxProductsPerStoreint200Hard max 5000 — cost ceiling
includeVariantsbooltruePer-variant array
includeImagesbooltrueImage URLs (primary first)
includeDescriptionHtmlboolfalseRaw HTML descriptions (bulky)
forcePlatformenumautoOverride detection
respectRobotsTxtbooltrueHonor robots.txt
maxConcurrencyint5Stores in parallel
proxyConfigurationobjectApify auto

📤 Output — one normalized record per product

{
"storeUrl": "https://www.allbirds.com",
"storeName": "Allbirds",
"platform": "shopify",
"productId": "6592289505353",
"productUrl": "https://www.allbirds.com/products/mens-wool-runners",
"title": "Men's Wool Runners",
"descriptionText": "Our iconic everyday sneaker made from merino wool...",
"brand": "Allbirds",
"productType": "Shoes",
"categories": ["Shoes"],
"price": 110.0,
"compareAtPrice": null,
"currency": "USD",
"onSale": false,
"availability": "in_stock",
"sku": "WR-NAT-GRY-8",
"variants": [
{ "variantId": "39472394814025", "title": "Natural Grey / 8", "price": 110.0,
"compareAtPrice": null, "sku": "WR-NAT-GRY-8", "availability": "in_stock",
"options": { "Color": "Natural Grey", "Size": "8" } }
],
"images": ["https://cdn.shopify.com/.../wool-runner-1.jpg"],
"tags": ["wool", "runners"],
"ratingValue": null,
"ratingCount": null,
"scrapedAt": "2026-06-30T10:15:42.118Z",
"extractionMethod": "shopify:products.json"
}

Field guarantees that matter for agents:

  • price / compareAtPrice are always numbers (never "$19.99"); currency is a separate ISO-4217 code.
  • availability is one of in_stock | out_of_stock | preorder | unknown.
  • platform is one of shopify | woocommerce | generic.
  • extractionMethod tells you exactly which path produced the record.
  • Stores that yield nothing produce a { storeUrl, skipped, error } row instead of disappearing.

See ./samples for a full input + output example per platform.


🤖 MCP / AI-agent usage

Apify exposes every Actor as an MCP tool, so an agent can call this scraper directly. Because the input schema is crisp and the output is deterministic and identical across platforms, the agent experience is clean.

Connect (Apify MCP server): https://mcp.apify.com — add YOUR_USERNAME/store-catalog-scraper as an available tool.

An agent calls it like this (conceptual tool call):

{
"tool": "YOUR_USERNAME/store-catalog-scraper",
"input": {
"storeUrls": ["https://www.allbirds.com"],
"maxProductsPerStore": 50,
"includeVariants": true
}
}

The agent then reads the run's dataset — every item is one normalized product it can reason over (compare prices across stores, check stock, build a catalog) without per-store parsing logic.

Tip for agent builders: keep maxProductsPerStore modest for interactive use, and rely on platform + extractionMethod to gauge data confidence.


💸 Pricing (pay-per-event)

EventRate (USD)When
Result (product scraped)$0.004 ($4 / 1,000)Per normalized product written to the dataset — the main charge
Actor start$0.02Once per run

Platform compute + proxy are included (absorbed by the Actor), so you pay only the events above — pricing is predictable regardless of a store's size or speed.

Worked cost example — scrape 10 stores × 200 products = 2,000 products:

Actor start: 1 × $0.02 = $0.02
Products scraped: 2000 × $0.004 = $8.00
-------------------------------------------
Total ≈ $8.02 (~$0.004 per product)

You are only charged for real products: stores that fail or return nothing are written to a separate, non-billed dataset (skipped-stores).


🔒 Responsible use

  • Public, non-personal product data only. This Actor does not log into stores, collect merchant/customer personal data, or access gated content. If a store requires authentication to view products, it is skipped.
  • You own compliance. You are responsible for complying with each target store's Terms of Service and applicable law in your jurisdiction.
  • Discovery is best-effort and user-directed. Mode 2 finds candidate public storefronts from your keywords and verifies them before crawling; you remain responsible for what you choose to crawl. It is hard-capped at 50 stores.
  • Politeness is built in: robots.txt is respected by default, requests are rate-limited per domain with jitter, and an honest User-Agent is sent.

⚠️ Limitations

  • WooCommerce per-variant prices aren't exposed by the public Store API without extra calls, so Woo variants surface attribute combinations with price: null (product-level price is always populated).
  • Generic JSON-LD stores vary widely; fields present depend on what the store emits. Missing fields are null/[], never invented.
  • Shopify currency is read from the storefront; if a store hides it, currency may be null.
  • The Playwright fallback only triggers when cheap paths return nothing and the page looks JS-rendered; escalations are capped per run for cost.

🧩 Extending

Platform modules are cleanly separated (src/shopify.js, src/woocommerce.js, src/generic.js) and all funnel through src/normalize.js. Adding BigCommerce or Squarespace is a drop-in: add a detector fingerprint + an extractor that emits into the shared normalizer.