Shopify Metadata Extractor
Pricing
Pay per event
Shopify Metadata Extractor
Extract raw Shopify storefront data directly from public JSON endpoints. Fetch collections, product schemas, vendor metadata, and content structure fast.
Pricing
Pay per event
Rating
0.0
(0)
Developer
太郎 山田
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
a day ago
Last modified
Categories
Share
Shopify Store Leads & Catalog Intelligence
Runtime: Node.js 20+.
Extract analyst-ready Shopify storefront intelligence from public merchant endpoints: normalized domain, store identity, currency, price range, sampled products, collections, merch rollups, endpoint warnings, and explicit pay-per-event billing fields.
This actor is built for ecommerce analysts, growth teams, marketplace operators, technical SEO teams, data engineers, and competitive intelligence workflows that need repeatable storefront facts without running a browser. It reads public Shopify surfaces such as the homepage, /meta.json, /products.json, /collections.json, and optional /pages.json / /blogs.json. It is not a search engine or discovery actor: provide the store URLs you want inspected.
Store Quickstart
Start with dataset delivery so analysts can inspect rows before wiring automation:
- Quickstart Baseline (2 Stores -> Catalog + Merch Signals): two public storefronts, low sampling limits, and the core analyst fields:
status,chargedEvent,isShopify,normalizedDomain,storeName,currency,priceRange,productCount,productsSample,signals, anderrors. - Recurring Baseline (Multi-Store Catalog Watch): schedule the same watchlist weekly or daily to compare catalog size, price range, vendor/tag rollups, endpoint availability, and warning counts over time.
- Webhook Routed Check (Daily Store Updates): use only after dataset rows match your downstream BI, CRM, Slack, Make, n8n, or warehouse schema.
- Content Expansion (Pages / Blogs When Public): enable when page/blog metadata matters for SEO, launch monitoring, policy copy checks, or content inventory.
The included store-input.example.json is the lowest-friction Store proof. sample-output.example.json shows the published result contract, including one charged Shopify row and one no-charge blocked row.
Analyst Workflow
- Provide known Shopify or ecommerce storefront URLs. The actor does not discover stores from search terms.
- Run in
delivery: "dataset"with modest sampling limits. - Review
status,chargedEvent,signals,warnings, anderrorsbefore routing rows downstream. - Filter charged Shopify rows with
chargedEventequal tostore_enrichedorstore_partialfor analyst review. - Keep no-charge rows such as
invalid_input,blocked, andnot_storeas watchlist cleanup tasks. - Add webhook delivery only after analysts trust the dataset shape.
Key Features
- Multi-store inspection for up to 50 storefront URLs per run.
- Public Shopify signal detection from homepage, meta, products, and collections endpoints.
- Analyst summary fields for domain, store name, currency, price range, product count, sampled products, status, billing event, signals, and errors.
- Catalog and collection samples from public Shopify JSON endpoints.
- Vendor, tag, and product-type rollups derived from sampled products.
- Restriction-aware output for blocked, non-JSON, unavailable, timeout, invalid, and non-store cases.
- Optional pages and blogs metadata when public endpoints expose it.
- Dataset-first and webhook-after delivery modes.
Use Cases
| Who | Workflow | Value |
|---|---|---|
| Ecommerce analysts | Track competitor catalog, price bands, and merchandising structure | Repeatable store summary rows for comparison over time |
| Data teams | Pipe normalized storefront fields into warehouses or dashboards | Stable row keys such as normalizedDomain, storeName, and currency |
| Technical SEO teams | Inspect public collections, products, pages, and blog metadata | Fast endpoint-level visibility without browser automation |
| Marketplace operators | Validate merchant storefronts and detect public Shopify evidence | Clear isShopify, status, and no-charge cleanup rows |
| RevOps / growth teams | Feed merchant intelligence into CRM or account scoring | Sampled products, signals, and errors ready for routing |
Input
| Field | Type | Default | Description |
|---|---|---|---|
storeUrls | string[] | required | Known storefront URLs to inspect. Custom domains and *.myshopify.com domains both work. Maximum 50 stores per run. |
productSampleLimit | integer | 25 | Maximum public products to fetch per store from /products.json. Keep low for quickstarts and recurring monitoring. |
collectionSampleLimit | integer | 25 | Maximum public collections to fetch per store from /collections.json. |
includeContentMetadata | boolean | false | When true, also attempts /pages.json and /blogs.json. Leave false until content metadata is worth the extra sampling. |
contentSampleLimit | integer | 10 | Maximum public pages or blogs to sample when content metadata is enabled. |
timeoutMs | integer | 15000 | Per-request timeout in milliseconds. |
delivery | string | "dataset" | dataset writes durable rows for review. webhook writes dataset rows first, then sends the full payload to webhookUrl. |
webhookUrl | string | empty | Required only when delivery is "webhook". |
dryRun | boolean | false | Development mode. Skips dataset writes and webhook delivery, but still writes local output/result.json. |
Input Example
{"storeUrls": ["https://colourpop.com","https://allbirds.com"],"productSampleLimit": 10,"collectionSampleLimit": 6,"includeContentMetadata": false,"contentSampleLimit": 5,"timeoutMs": 15000,"delivery": "dataset","dryRun": false}
Output
The Apify dataset receives one row per input storefront after normalization and deduplication. Local output/result.json wraps the same rows in { "meta": ..., "results": [...] }.
| Field | Type | Analyst meaning |
|---|---|---|
status | string | Result classification: success, partial, not_shopify, blocked, invalid_input, not_store, timeout, or error. Use this before routing rows to analysts. |
chargedEvent | string | null |
isShopify | boolean | True when Shopify evidence was detected from metadata, Shopify endpoints, theme scripts, or public Shopify JSON. |
normalizedDomain | string | null |
storeName | string | null |
currency | string | null |
priceRange | object | Minimum and maximum prices observed in sampled public products. Null values mean no public product prices were sampled. |
productCount | integer | Number of public products sampled in this row. This is sample count, not full catalog size. |
productsSample | object[] | CSV/API-friendly alias for sampled public product records. Same source as productSamples. |
signals | string[] | Evidence used for classification and charging, such as shopify_detected, shopify_products_json, shopify_collections_json, or ecommerce_cart_or_checkout. |
errors | object[] | Structured endpoint or run problems important for automation. Includes type, endpoint, HTTP status, and message. |
inputUrl | string | Original URL provided by the user. |
normalizedUrl | string | null |
hostname | string | Hostname from normalizedUrl. |
store | object | Store profile fields such as name, canonical URL, myshopify domain, theme, locale, country, and Shopify detection. |
summary | object | Counts, sample basis, and endpoint status map for homepage, meta, products, collections, pages, and blogs. |
collections | object[] | Sampled public collections. |
productSamples | object[] | Sampled public products with vendor, type, tags, variant count, availability, images, and product-level price range. |
rollups | object | Vendor, tag, and product-type counts derived from sampled products only. |
content | object | Optional page and blog samples when includeContentMetadata is enabled and endpoints are public. |
warnings | object[] | Endpoint restrictions, non-JSON responses, unavailable endpoints, and sample truncation notices. |
error | string | null |
Output Example
{"inputUrl": "https://example-shop.com","normalizedUrl": "https://example-shop.com","hostname": "example-shop.com","status": "success","chargedEvent": "store_enriched","isShopify": true,"normalizedDomain": "example-shop.com","storeName": "Example Shop","currency": "USD","priceRange": { "min": 12.5, "max": 49.99 },"productCount": 2,"productsSample": [{"title": "Sample Product","url": "https://example-shop.com/products/sample-product","vendor": "Example Shop","productType": "Accessory","priceRange": { "min": 12.5, "max": 19.99 }}],"signals": ["shopify_detected", "shopify_products_json", "shopify_collections_json"],"errors": [],"store": {"name": "Example Shop","currency": "USD","myshopifyDomain": "example-shop.myshopify.com","canonicalUrl": "https://example-shop.com/","themeName": "Dawn","shopifyDetected": true},"summary": {"productSampleCount": 2,"collectionSampleCount": 1,"vendorCount": 1,"tagCount": 3,"endpointStatuses": {"homepage": "ok","meta": "ok","products": "ok","collections": "ok","pages": "skipped","blogs": "skipped"}},"warnings": [],"error": null}
No-charge diagnostic rows keep analyst queues honest:
{"inputUrl": "not a url","normalizedUrl": null,"hostname": "","status": "invalid_input","chargedEvent": null,"isShopify": false,"normalizedDomain": null,"storeName": null,"currency": null,"priceRange": { "min": null, "max": null },"productCount": 0,"productsSample": [],"signals": [],"errors": [{"type": "invalid_input","endpoint": null,"status": null,"message": "Unsupported protocol for store URL."}]}
PPE Events And No-Charge Rules
This actor uses explicit pay-per-event row charging. Production runtime passes the event name from chargedEvent when a row should be charged.
| Result status | PPE event | Charged? | Meaning |
|---|---|---|---|
success | store_enriched | Yes | Shopify evidence plus useful public catalog or store metadata were captured, and primary catalog endpoints were available. |
partial | store_partial | Yes | Shopify evidence was found, but one or more important endpoints were restricted, unavailable, or incomplete. |
not_shopify with ecommerce evidence | non_shopify_store_detected | Yes | The site looks like an ecommerce store but does not expose Shopify evidence. Useful for merchant classification. |
invalid_input | null | No | The input could not be normalized into an HTTP(S) storefront URL. |
blocked | null | No | Public endpoints were blocked, restricted, password-like, or non-JSON across primary surfaces. |
not_store | null | No | The homepage loaded but no Shopify or ecommerce storefront evidence was found. |
timeout | null | No | Endpoint requests timed out. |
error | null | No | Unexpected run or fetch failure. |
Use chargedEvent rather than status alone for billing audits. invalid_input, blocked, not_store, timeout, and error rows are no-charge diagnostics and should be retained for watchlist cleanup.
API Usage
Run this actor programmatically using the Apify API. Replace YOUR_API_TOKEN with your token from Apify Console -> Settings -> Integrations.
cURL
curl -X POST "https://api.apify.com/v2/acts/taroyamada~shopify-store-intelligence/run-sync-get-dataset-items?token=YOUR_API_TOKEN" \-H "Content-Type: application/json" \-d '{ "storeUrls": ["https://colourpop.com", "https://allbirds.com"], "productSampleLimit": 10, "collectionSampleLimit": 6, "includeContentMetadata": false, "contentSampleLimit": 5, "timeoutMs": 15000, "delivery": "dataset", "dryRun": false }'
Python
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run = client.actor("taroyamada/shopify-store-intelligence").call(run_input={"storeUrls": ["https://colourpop.com", "https://allbirds.com"],"productSampleLimit": 10,"collectionSampleLimit": 6,"includeContentMetadata": False,"contentSampleLimit": 5,"timeoutMs": 15000,"delivery": "dataset","dryRun": False,})for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(item["normalizedDomain"], item["status"], item["chargedEvent"], item["productCount"])
JavaScript / Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });const run = await client.actor('taroyamada/shopify-store-intelligence').call({storeUrls: ['https://colourpop.com', 'https://allbirds.com'],productSampleLimit: 10,collectionSampleLimit: 6,includeContentMetadata: false,contentSampleLimit: 5,timeoutMs: 15000,delivery: 'dataset',dryRun: false,});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(items.map((row) => ({normalizedDomain: row.normalizedDomain,status: row.status,chargedEvent: row.chargedEvent,productCount: row.productCount,})));
Tips And Limitations
- This is not a search or discovery actor; provide known storefront URLs.
productCount,priceRange,rollups, andproductsSampleare based on sampled public products, not the full catalog.- Some Shopify stores restrict public JSON endpoints; these become
partial,blocked, or no-charge diagnostic rows depending on available evidence. not_shopifycan be charged only when useful ecommerce evidence exists, because it helps classify non-Shopify merchant URLs.- Use
delivery: "dataset"first. Move to webhooks only after downstream tools accept the row shape. - Use
dryRun: truefor local development or shape checks where dataset writes and webhooks should be skipped.
FAQ
Does this fetch every product and collection?
No. The actor samples public /products.json and /collections.json results up to your configured limits.
What happens on restricted stores?
The actor emits explicit warnings and errors, and uses partial, blocked, or another diagnostic status depending on whether useful Shopify evidence was still available.
Can non-Shopify stores be useful?
Yes. If the homepage contains ecommerce evidence such as cart, checkout, product structured data, or platform hints, the row is classified as not_shopify and charged as non_shopify_store_detected.
Can I route results to another tool?
Yes. Keep dataset mode for inspection, then use webhook mode for Slack, Make, n8n, BI ingestion, CRM enrichment, or internal monitoring.
Related Actors
- Website Content Extractor for cleaned text from policy, FAQ, pricing, help-center, or landing pages.
- Contact Details Extractor for public support, sales, or partnership contacts from the same merchant domain.
- Domain Security Audit API for SSL, DMARC, expiry, and security-header checks.
- AI Visibility Monitor for brand visibility checks beside storefront monitoring.
Was this helpful?
If this actor saved you time, please leave a rating on Apify Store. Bug reports and feature requests belong on the actor Issues tab.