Shopify Store Scraper | Metadata & Catalog Extractor
Pricing
Pay per event
Shopify Store Scraper | Metadata & Catalog Extractor
Shopify store scraper that pulls public storefront metadata, product catalogs, collections, and vendor data directly from JSON endpoints. No browser, no auth. Returns structured tables ready for competitive catalog research.
Pricing
Pay per event
Rating
0.0
(0)
Developer
太郎 山田
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
Shopify Store Leads & Catalog Intelligence
After this run
Turn this Actor's output into a capped paid report with Ad Landing Page Offer Intelligence & CRO Gap Report. Use it when paid media, CRO, and agency teams need to decide which public landing-page offer gaps to fix before increasing ad spend.
- First report: $3 /
landing_offer_report; setmaxChargeUsdto $3. - Deeper report: $15 /
cro_gap_report_pack; use only when the first result needs competitor or action-depth. - This is an internal Apify flow aid. It is not revenue proof until accounted paid usage appears.
Next report-style Actors
If you already have data from this Actor, these follow-on Actors turn public or user-provided inputs into decision-ready reports. They are optional, capped by maxChargeUsd, and do not make business outcome claims.
- ATS Hiring Signal Report - turn target-company public hiring pages into expansion and account-priority signals.
- SaaS Pricing Page Monitor - monitor competitor public pricing pages after store intelligence.
- Ad Landing Page Offer Intelligence - audit public landing pages for offer, proof, CTA, and friction.
- CSV Local Business List Scoring - score exported business lists before SEO cleanup.
Runtime: Node.js 20+.
Extract analyst-ready Shopify storefront intelligence from public merchant endpoints: normalized domain, store identity, currency, price range, sampled products, collections, merch rollups, endpoint warnings, and explicit pay-per-event billing fields.
This actor is built for ecommerce analysts, growth teams, marketplace operators, technical SEO teams, data engineers, and competitive intelligence workflows that need repeatable storefront facts without running a browser. It reads public Shopify surfaces such as the homepage, /meta.json, /products.json, /collections.json, and optional /pages.json / /blogs.json. It is not a search engine or discovery actor: provide the store URLs you want inspected.
Store Quickstart
Start with dataset delivery so analysts can inspect rows before wiring automation:
- Quickstart Baseline (2 Stores -> Catalog + Merch Signals): two public storefronts, low sampling limits, and the core analyst fields:
status,chargedEvent,isShopify,normalizedDomain,storeName,currency,priceRange,productCount,productsSample,signals, anderrors. - Recurring Baseline (Multi-Store Catalog Watch): schedule the same watchlist weekly or daily to compare catalog size, price range, vendor/tag rollups, endpoint availability, and warning counts over time.
- Webhook Routed Check (Daily Store Updates): use only after dataset rows match your downstream BI, CRM, Slack, Make, n8n, or warehouse schema.
- Content Expansion (Pages / Blogs When Public): enable when page/blog metadata matters for SEO, launch monitoring, policy copy checks, or content inventory.
The included store-input.example.json is the lowest-friction Store proof. sample-output.example.json shows the published result contract, including one charged Shopify row and one no-charge blocked row.
Analyst Workflow
- Provide known Shopify or ecommerce storefront URLs. The actor does not discover stores from search terms.
- Run in
delivery: "dataset"with modest sampling limits. - Review
status,chargedEvent,signals,warnings, anderrorsbefore routing rows downstream. - Filter charged Shopify rows with
chargedEventequal tostore_enrichedorstore_partialfor analyst review. - Keep no-charge rows such as
invalid_input,blocked, andnot_storeas watchlist cleanup tasks. - Add webhook delivery only after analysts trust the dataset shape.
Key Features
- Multi-store inspection for up to 50 storefront URLs per run.
- Public Shopify signal detection from homepage, meta, products, and collections endpoints.
- Analyst summary fields for domain, store name, currency, price range, product count, sampled products, status, billing event, signals, and errors.
- Catalog and collection samples from public Shopify JSON endpoints.
- Vendor, tag, and product-type rollups derived from sampled products.
- Restriction-aware output for blocked, non-JSON, unavailable, timeout, invalid, and non-store cases.
- Optional pages and blogs metadata when public endpoints expose it.
- Dataset-first and webhook-after delivery modes.
Use Cases
| Who | Workflow | Value |
|---|---|---|
| Ecommerce analysts | Track competitor catalog, price bands, and merchandising structure | Repeatable store summary rows for comparison over time |
| Data teams | Pipe normalized storefront fields into warehouses or dashboards | Stable row keys such as normalizedDomain, storeName, and currency |
| Technical SEO teams | Inspect public collections, products, pages, and blog metadata | Fast endpoint-level visibility without browser automation |
| Marketplace operators | Validate merchant storefronts and detect public Shopify evidence | Clear isShopify, status, and no-charge cleanup rows |
| RevOps / growth teams | Feed merchant intelligence into CRM or account scoring | Sampled products, signals, and errors ready for routing |
Input
| Field | Type | Default | Description |
|---|---|---|---|
storeUrls | string[] | required | Known storefront URLs to inspect. Custom domains and *.myshopify.com domains both work. Maximum 50 stores per run. |
productSampleLimit | integer | 25 | Maximum public products to fetch per store from /products.json. Keep low for quickstarts and recurring monitoring. |
collectionSampleLimit | integer | 25 | Maximum public collections to fetch per store from /collections.json. |
includeContentMetadata | boolean | false | When true, also attempts /pages.json and /blogs.json. Leave false until content metadata is worth the extra sampling. |
contentSampleLimit | integer | 10 | Maximum public pages or blogs to sample when content metadata is enabled. |
timeoutMs | integer | 15000 | Per-request timeout in milliseconds. |
delivery | string | "dataset" | dataset writes durable rows for review. webhook writes dataset rows first, then sends the full payload to webhookUrl. |
webhookUrl | string | empty | Required only when delivery is "webhook". |
dryRun | boolean | false | Development mode. Skips dataset writes and webhook delivery, but still writes local output/result.json. |
Input Example
{"storeUrls": ["https://colourpop.com","https://allbirds.com"],"productSampleLimit": 10,"collectionSampleLimit": 6,"includeContentMetadata": false,"contentSampleLimit": 5,"timeoutMs": 15000,"delivery": "dataset","dryRun": false}
Input Examples
Example: Single store catalog snapshot
{"stores": ["allbirds.com"],"includeCollections": true,"maxProductsPerStore": 250}
Example: Competitor catalog comparison
{"stores": ["brand1.myshopify.com","brand2.myshopify.com"],"includeCollections": true,"includeVendorRollup": true}
Example: Vendor / tag rollup audit
{"stores": ["multi-brand-store.com"],"includeVendorRollup": true,"includeTagRollup": true,"maxProductsPerStore": 500}
Output
The Apify dataset receives one row per input storefront after normalization and deduplication. Local output/result.json wraps the same rows in { "meta": ..., "results": [...] }.
| Field | Type | Analyst meaning |
|---|---|---|
status | string | Result classification: success, partial, not_shopify, blocked, invalid_input, not_store, timeout, or error. Use this before routing rows to analysts. |
chargedEvent | string | null |
isShopify | boolean | True when Shopify evidence was detected from metadata, Shopify endpoints, theme scripts, or public Shopify JSON. |
normalizedDomain | string | null |
storeName | string | null |
currency | string | null |
priceRange | object | Minimum and maximum prices observed in sampled public products. Null values mean no public product prices were sampled. |
productCount | integer | Number of public products sampled in this row. This is sample count, not full catalog size. |
productsSample | object[] | CSV/API-friendly alias for sampled public product records. Same source as productSamples. |
signals | string[] | Evidence used for classification and charging, such as shopify_detected, shopify_products_json, shopify_collections_json, or ecommerce_cart_or_checkout. |
errors | object[] | Structured endpoint or run problems important for automation. Includes type, endpoint, HTTP status, and message. |
inputUrl | string | Original URL provided by the user. |
normalizedUrl | string | null |
hostname | string | Hostname from normalizedUrl. |
store | object | Store profile fields such as name, canonical URL, myshopify domain, theme, locale, country, and Shopify detection. |
summary | object | Counts, sample basis, and endpoint status map for homepage, meta, products, collections, pages, and blogs. |
collections | object[] | Sampled public collections. |
productSamples | object[] | Sampled public products with vendor, type, tags, variant count, availability, images, and product-level price range. |
rollups | object | Vendor, tag, and product-type counts derived from sampled products only. |
content | object | Optional page and blog samples when includeContentMetadata is enabled and endpoints are public. |
warnings | object[] | Endpoint restrictions, non-JSON responses, unavailable endpoints, and sample truncation notices. |
error | string | null |
Output Example
{"inputUrl": "https://example-shop.com","normalizedUrl": "https://example-shop.com","hostname": "example-shop.com","status": "success","chargedEvent": "store_enriched","isShopify": true,"normalizedDomain": "example-shop.com","storeName": "Example Shop","currency": "USD","priceRange": { "min": 12.5, "max": 49.99 },"productCount": 2,"productsSample": [{"title": "Sample Product","url": "https://example-shop.com/products/sample-product","vendor": "Example Shop","productType": "Accessory","priceRange": { "min": 12.5, "max": 19.99 }}],"signals": ["shopify_detected", "shopify_products_json", "shopify_collections_json"],"errors": [],"store": {"name": "Example Shop","currency": "USD","myshopifyDomain": "example-shop.myshopify.com","canonicalUrl": "https://example-shop.com/","themeName": "Dawn","shopifyDetected": true},"summary": {"productSampleCount": 2,"collectionSampleCount": 1,"vendorCount": 1,"tagCount": 3,"endpointStatuses": {"homepage": "ok","meta": "ok","products": "ok","collections": "ok","pages": "skipped","blogs": "skipped"}},"warnings": [],"error": null}
No-charge diagnostic rows keep analyst queues honest:
{"inputUrl": "not a url","normalizedUrl": null,"hostname": "","status": "invalid_input","chargedEvent": null,"isShopify": false,"normalizedDomain": null,"storeName": null,"currency": null,"priceRange": { "min": null, "max": null },"productCount": 0,"productsSample": [],"signals": [],"errors": [{"type": "invalid_input","endpoint": null,"status": null,"message": "Unsupported protocol for store URL."}]}
PPE Events And No-Charge Rules
This actor uses explicit pay-per-event row charging. Production runtime passes the event name from chargedEvent when a row should be charged.
| Result status | PPE event | Charged? | Meaning |
|---|---|---|---|
success | store_enriched | Yes | Shopify evidence plus useful public catalog or store metadata were captured, and primary catalog endpoints were available. |
partial | store_partial | Yes | Shopify evidence was found, but one or more important endpoints were restricted, unavailable, or incomplete. |
not_shopify with ecommerce evidence | non_shopify_store_detected | Yes | The site looks like an ecommerce store but does not expose Shopify evidence. Useful for merchant classification. |
invalid_input | null | No | The input could not be normalized into an HTTP(S) storefront URL. |
blocked | null | No | Public endpoints were blocked, restricted, password-like, or non-JSON across primary surfaces. |
not_store | null | No | The homepage loaded but no Shopify or ecommerce storefront evidence was found. |
timeout | null | No | Endpoint requests timed out. |
error | null | No | Unexpected run or fetch failure. |
Use chargedEvent rather than status alone for billing audits. invalid_input, blocked, not_store, timeout, and error rows are no-charge diagnostics and should be retained for watchlist cleanup.
API Usage
Run this actor programmatically using the Apify API. Replace YOUR_API_TOKEN with your token from Apify Console -> Settings -> Integrations.
cURL
curl -X POST "https://api.apify.com/v2/acts/taroyamada~shopify-store-intelligence/run-sync-get-dataset-items?token=YOUR_API_TOKEN" \-H "Content-Type: application/json" \-d '{ "storeUrls": ["https://colourpop.com", "https://allbirds.com"], "productSampleLimit": 10, "collectionSampleLimit": 6, "includeContentMetadata": false, "contentSampleLimit": 5, "timeoutMs": 15000, "delivery": "dataset", "dryRun": false }'
Python
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run = client.actor("taroyamada/shopify-store-intelligence").call(run_input={"storeUrls": ["https://colourpop.com", "https://allbirds.com"],"productSampleLimit": 10,"collectionSampleLimit": 6,"includeContentMetadata": False,"contentSampleLimit": 5,"timeoutMs": 15000,"delivery": "dataset","dryRun": False,})for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(item["normalizedDomain"], item["status"], item["chargedEvent"], item["productCount"])
JavaScript / Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });const run = await client.actor('taroyamada/shopify-store-intelligence').call({storeUrls: ['https://colourpop.com', 'https://allbirds.com'],productSampleLimit: 10,collectionSampleLimit: 6,includeContentMetadata: false,contentSampleLimit: 5,timeoutMs: 15000,delivery: 'dataset',dryRun: false,});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(items.map((row) => ({normalizedDomain: row.normalizedDomain,status: row.status,chargedEvent: row.chargedEvent,productCount: row.productCount,})));
Tips And Limitations
- This is not a search or discovery actor; provide known storefront URLs.
productCount,priceRange,rollups, andproductsSampleare based on sampled public products, not the full catalog.- Some Shopify stores restrict public JSON endpoints; these become
partial,blocked, or no-charge diagnostic rows depending on available evidence. not_shopifycan be charged only when useful ecommerce evidence exists, because it helps classify non-Shopify merchant URLs.- Use
delivery: "dataset"first. Move to webhooks only after downstream tools accept the row shape. - Use
dryRun: truefor local development or shape checks where dataset writes and webhooks should be skipped.
FAQ
Does this fetch every product and collection?
No. The actor samples public /products.json and /collections.json results up to your configured limits.
What happens on restricted stores?
The actor emits explicit warnings and errors, and uses partial, blocked, or another diagnostic status depending on whether useful Shopify evidence was still available.
Can non-Shopify stores be useful?
Yes. If the homepage contains ecommerce evidence such as cart, checkout, product structured data, or platform hints, the row is classified as not_shopify and charged as non_shopify_store_detected.
Can I route results to another tool?
Yes. Keep dataset mode for inspection, then use webhook mode for Slack, Make, n8n, BI ingestion, CRM enrichment, or internal monitoring.
Related Actors
- Website Content Extractor for cleaned text from policy, FAQ, pricing, help-center, or landing pages.
- Contact Details Extractor for public support, sales, or partnership contacts from the same merchant domain.
- Domain Security Audit API for SSL, DMARC, expiry, and security-header checks.
- AI Visibility Monitor for brand visibility checks beside storefront monitoring.
Was this helpful?
If this actor saved you time, please leave a rating on Apify Store. Bug reports and feature requests belong on the actor Issues tab.
Premium Report Pack
Use these premium report actors when a raw dataset is ready to become a buyer-facing audit, watch summary, or agency deliverable. All three keep sourceDatasetId as advanced-only; first runs should use pasted input, URLs, demo mode, and reportTier.
- CSV Local Business List Scoring & SEO Gap Report - Score pasted local business CSV lists and produce agency-ready lead/SEO gap reports.
- SaaS Pricing Page Monitor & Competitor Price Change Alerts - Turn public pricing pages into snapshots, competitor reports, and weekly pricing watch summaries.
- Ad Landing Page Offer Intelligence & CRO Gap Report - Analyze user-provided landing pages and pasted ad copy for offer, CTA, proof, and CRO gaps.
Recommended flow from this actor: run the current extraction/check first, export the useful dataset or copy the relevant URLs, then choose entry, premium, or bundle in the report actor with maxChargeUsd as the safety cap.
Related report Actors
Use these follow-on Actors when you want a capped, decision-ready report instead of more raw rows. They use public or user-provided inputs, respect maxChargeUsd, and do not promise rankings, revenue, conversion lifts, or sales outcomes.
- SaaS Pricing Page Monitor - watch public competitor pricing and packaging pages after store research.
- Ad Landing Page Offer Intelligence - audit public landing pages for offer, proof, CTA, and friction.
- CSV Local Business List Scoring - score exported or user-provided lead lists before cleanup or outreach.
Related paid report workflows
If this Actor gave you raw rows or source context, these follow-on report Actors are designed for a small capped paid run. They help make a decision, not just collect more data.
- SaaS Pricing Page Monitor & Competitor Price Change Alerts - decide whether a public competitor pricing page changed in a way that affects packaging or sales messaging. Entry $3 /
pricing_snapshot_report; premium $15 /competitor_pricing_report. - Ad Landing Page Offer Intelligence & CRO Gap Report - decide which public landing-page offer gaps to fix before increasing ad spend. Entry $3 /
landing_offer_report; premium $15 /cro_gap_report_pack. - CSV Local Business List Scoring & SEO Gap Report - prioritize which businesses in a list deserve outreach, cleanup, or SEO follow-up. Entry $3 /
lead_scoring_report; premium $15 /agency_lead_gap_report.
Keep maxChargeUsd equal to the selected tier. Internal links are traffic aids only; real proof requires accounted paid usage.