Shopify Metadata Extractor avatar

Shopify Metadata Extractor

Pricing

Pay per event

Go to Apify Store
Shopify Metadata Extractor

Shopify Metadata Extractor

Extract raw Shopify storefront data directly from public JSON endpoints. Fetch collections, product schemas, vendor metadata, and content structure fast.

Pricing

Pay per event

Rating

0.0

(0)

Developer

太郎 山田

太郎 山田

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a day ago

Last modified

Share

Shopify Store Leads & Catalog Intelligence

Runtime: Node.js 20+.

Extract analyst-ready Shopify storefront intelligence from public merchant endpoints: normalized domain, store identity, currency, price range, sampled products, collections, merch rollups, endpoint warnings, and explicit pay-per-event billing fields.

This actor is built for ecommerce analysts, growth teams, marketplace operators, technical SEO teams, data engineers, and competitive intelligence workflows that need repeatable storefront facts without running a browser. It reads public Shopify surfaces such as the homepage, /meta.json, /products.json, /collections.json, and optional /pages.json / /blogs.json. It is not a search engine or discovery actor: provide the store URLs you want inspected.

Store Quickstart

Start with dataset delivery so analysts can inspect rows before wiring automation:

  • Quickstart Baseline (2 Stores -> Catalog + Merch Signals): two public storefronts, low sampling limits, and the core analyst fields: status, chargedEvent, isShopify, normalizedDomain, storeName, currency, priceRange, productCount, productsSample, signals, and errors.
  • Recurring Baseline (Multi-Store Catalog Watch): schedule the same watchlist weekly or daily to compare catalog size, price range, vendor/tag rollups, endpoint availability, and warning counts over time.
  • Webhook Routed Check (Daily Store Updates): use only after dataset rows match your downstream BI, CRM, Slack, Make, n8n, or warehouse schema.
  • Content Expansion (Pages / Blogs When Public): enable when page/blog metadata matters for SEO, launch monitoring, policy copy checks, or content inventory.

The included store-input.example.json is the lowest-friction Store proof. sample-output.example.json shows the published result contract, including one charged Shopify row and one no-charge blocked row.

Analyst Workflow

  1. Provide known Shopify or ecommerce storefront URLs. The actor does not discover stores from search terms.
  2. Run in delivery: "dataset" with modest sampling limits.
  3. Review status, chargedEvent, signals, warnings, and errors before routing rows downstream.
  4. Filter charged Shopify rows with chargedEvent equal to store_enriched or store_partial for analyst review.
  5. Keep no-charge rows such as invalid_input, blocked, and not_store as watchlist cleanup tasks.
  6. Add webhook delivery only after analysts trust the dataset shape.

Key Features

  • Multi-store inspection for up to 50 storefront URLs per run.
  • Public Shopify signal detection from homepage, meta, products, and collections endpoints.
  • Analyst summary fields for domain, store name, currency, price range, product count, sampled products, status, billing event, signals, and errors.
  • Catalog and collection samples from public Shopify JSON endpoints.
  • Vendor, tag, and product-type rollups derived from sampled products.
  • Restriction-aware output for blocked, non-JSON, unavailable, timeout, invalid, and non-store cases.
  • Optional pages and blogs metadata when public endpoints expose it.
  • Dataset-first and webhook-after delivery modes.

Use Cases

WhoWorkflowValue
Ecommerce analystsTrack competitor catalog, price bands, and merchandising structureRepeatable store summary rows for comparison over time
Data teamsPipe normalized storefront fields into warehouses or dashboardsStable row keys such as normalizedDomain, storeName, and currency
Technical SEO teamsInspect public collections, products, pages, and blog metadataFast endpoint-level visibility without browser automation
Marketplace operatorsValidate merchant storefronts and detect public Shopify evidenceClear isShopify, status, and no-charge cleanup rows
RevOps / growth teamsFeed merchant intelligence into CRM or account scoringSampled products, signals, and errors ready for routing

Input

FieldTypeDefaultDescription
storeUrlsstring[]requiredKnown storefront URLs to inspect. Custom domains and *.myshopify.com domains both work. Maximum 50 stores per run.
productSampleLimitinteger25Maximum public products to fetch per store from /products.json. Keep low for quickstarts and recurring monitoring.
collectionSampleLimitinteger25Maximum public collections to fetch per store from /collections.json.
includeContentMetadatabooleanfalseWhen true, also attempts /pages.json and /blogs.json. Leave false until content metadata is worth the extra sampling.
contentSampleLimitinteger10Maximum public pages or blogs to sample when content metadata is enabled.
timeoutMsinteger15000Per-request timeout in milliseconds.
deliverystring"dataset"dataset writes durable rows for review. webhook writes dataset rows first, then sends the full payload to webhookUrl.
webhookUrlstringemptyRequired only when delivery is "webhook".
dryRunbooleanfalseDevelopment mode. Skips dataset writes and webhook delivery, but still writes local output/result.json.

Input Example

{
"storeUrls": [
"https://colourpop.com",
"https://allbirds.com"
],
"productSampleLimit": 10,
"collectionSampleLimit": 6,
"includeContentMetadata": false,
"contentSampleLimit": 5,
"timeoutMs": 15000,
"delivery": "dataset",
"dryRun": false
}

Output

The Apify dataset receives one row per input storefront after normalization and deduplication. Local output/result.json wraps the same rows in { "meta": ..., "results": [...] }.

FieldTypeAnalyst meaning
statusstringResult classification: success, partial, not_shopify, blocked, invalid_input, not_store, timeout, or error. Use this before routing rows to analysts.
chargedEventstringnull
isShopifybooleanTrue when Shopify evidence was detected from metadata, Shopify endpoints, theme scripts, or public Shopify JSON.
normalizedDomainstringnull
storeNamestringnull
currencystringnull
priceRangeobjectMinimum and maximum prices observed in sampled public products. Null values mean no public product prices were sampled.
productCountintegerNumber of public products sampled in this row. This is sample count, not full catalog size.
productsSampleobject[]CSV/API-friendly alias for sampled public product records. Same source as productSamples.
signalsstring[]Evidence used for classification and charging, such as shopify_detected, shopify_products_json, shopify_collections_json, or ecommerce_cart_or_checkout.
errorsobject[]Structured endpoint or run problems important for automation. Includes type, endpoint, HTTP status, and message.
inputUrlstringOriginal URL provided by the user.
normalizedUrlstringnull
hostnamestringHostname from normalizedUrl.
storeobjectStore profile fields such as name, canonical URL, myshopify domain, theme, locale, country, and Shopify detection.
summaryobjectCounts, sample basis, and endpoint status map for homepage, meta, products, collections, pages, and blogs.
collectionsobject[]Sampled public collections.
productSamplesobject[]Sampled public products with vendor, type, tags, variant count, availability, images, and product-level price range.
rollupsobjectVendor, tag, and product-type counts derived from sampled products only.
contentobjectOptional page and blog samples when includeContentMetadata is enabled and endpoints are public.
warningsobject[]Endpoint restrictions, non-JSON responses, unavailable endpoints, and sample truncation notices.
errorstringnull

Output Example

{
"inputUrl": "https://example-shop.com",
"normalizedUrl": "https://example-shop.com",
"hostname": "example-shop.com",
"status": "success",
"chargedEvent": "store_enriched",
"isShopify": true,
"normalizedDomain": "example-shop.com",
"storeName": "Example Shop",
"currency": "USD",
"priceRange": { "min": 12.5, "max": 49.99 },
"productCount": 2,
"productsSample": [
{
"title": "Sample Product",
"url": "https://example-shop.com/products/sample-product",
"vendor": "Example Shop",
"productType": "Accessory",
"priceRange": { "min": 12.5, "max": 19.99 }
}
],
"signals": ["shopify_detected", "shopify_products_json", "shopify_collections_json"],
"errors": [],
"store": {
"name": "Example Shop",
"currency": "USD",
"myshopifyDomain": "example-shop.myshopify.com",
"canonicalUrl": "https://example-shop.com/",
"themeName": "Dawn",
"shopifyDetected": true
},
"summary": {
"productSampleCount": 2,
"collectionSampleCount": 1,
"vendorCount": 1,
"tagCount": 3,
"endpointStatuses": {
"homepage": "ok",
"meta": "ok",
"products": "ok",
"collections": "ok",
"pages": "skipped",
"blogs": "skipped"
}
},
"warnings": [],
"error": null
}

No-charge diagnostic rows keep analyst queues honest:

{
"inputUrl": "not a url",
"normalizedUrl": null,
"hostname": "",
"status": "invalid_input",
"chargedEvent": null,
"isShopify": false,
"normalizedDomain": null,
"storeName": null,
"currency": null,
"priceRange": { "min": null, "max": null },
"productCount": 0,
"productsSample": [],
"signals": [],
"errors": [
{
"type": "invalid_input",
"endpoint": null,
"status": null,
"message": "Unsupported protocol for store URL."
}
]
}

PPE Events And No-Charge Rules

This actor uses explicit pay-per-event row charging. Production runtime passes the event name from chargedEvent when a row should be charged.

Result statusPPE eventCharged?Meaning
successstore_enrichedYesShopify evidence plus useful public catalog or store metadata were captured, and primary catalog endpoints were available.
partialstore_partialYesShopify evidence was found, but one or more important endpoints were restricted, unavailable, or incomplete.
not_shopify with ecommerce evidencenon_shopify_store_detectedYesThe site looks like an ecommerce store but does not expose Shopify evidence. Useful for merchant classification.
invalid_inputnullNoThe input could not be normalized into an HTTP(S) storefront URL.
blockednullNoPublic endpoints were blocked, restricted, password-like, or non-JSON across primary surfaces.
not_storenullNoThe homepage loaded but no Shopify or ecommerce storefront evidence was found.
timeoutnullNoEndpoint requests timed out.
errornullNoUnexpected run or fetch failure.

Use chargedEvent rather than status alone for billing audits. invalid_input, blocked, not_store, timeout, and error rows are no-charge diagnostics and should be retained for watchlist cleanup.

API Usage

Run this actor programmatically using the Apify API. Replace YOUR_API_TOKEN with your token from Apify Console -> Settings -> Integrations.

cURL

curl -X POST "https://api.apify.com/v2/acts/taroyamada~shopify-store-intelligence/run-sync-get-dataset-items?token=YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{ "storeUrls": ["https://colourpop.com", "https://allbirds.com"], "productSampleLimit": 10, "collectionSampleLimit": 6, "includeContentMetadata": false, "contentSampleLimit": 5, "timeoutMs": 15000, "delivery": "dataset", "dryRun": false }'

Python

from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("taroyamada/shopify-store-intelligence").call(run_input={
"storeUrls": ["https://colourpop.com", "https://allbirds.com"],
"productSampleLimit": 10,
"collectionSampleLimit": 6,
"includeContentMetadata": False,
"contentSampleLimit": 5,
"timeoutMs": 15000,
"delivery": "dataset",
"dryRun": False,
})
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item["normalizedDomain"], item["status"], item["chargedEvent"], item["productCount"])

JavaScript / Node.js

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const run = await client.actor('taroyamada/shopify-store-intelligence').call({
storeUrls: ['https://colourpop.com', 'https://allbirds.com'],
productSampleLimit: 10,
collectionSampleLimit: 6,
includeContentMetadata: false,
contentSampleLimit: 5,
timeoutMs: 15000,
delivery: 'dataset',
dryRun: false,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items.map((row) => ({
normalizedDomain: row.normalizedDomain,
status: row.status,
chargedEvent: row.chargedEvent,
productCount: row.productCount,
})));

Tips And Limitations

  • This is not a search or discovery actor; provide known storefront URLs.
  • productCount, priceRange, rollups, and productsSample are based on sampled public products, not the full catalog.
  • Some Shopify stores restrict public JSON endpoints; these become partial, blocked, or no-charge diagnostic rows depending on available evidence.
  • not_shopify can be charged only when useful ecommerce evidence exists, because it helps classify non-Shopify merchant URLs.
  • Use delivery: "dataset" first. Move to webhooks only after downstream tools accept the row shape.
  • Use dryRun: true for local development or shape checks where dataset writes and webhooks should be skipped.

FAQ

Does this fetch every product and collection?

No. The actor samples public /products.json and /collections.json results up to your configured limits.

What happens on restricted stores?

The actor emits explicit warnings and errors, and uses partial, blocked, or another diagnostic status depending on whether useful Shopify evidence was still available.

Can non-Shopify stores be useful?

Yes. If the homepage contains ecommerce evidence such as cart, checkout, product structured data, or platform hints, the row is classified as not_shopify and charged as non_shopify_store_detected.

Can I route results to another tool?

Yes. Keep dataset mode for inspection, then use webhook mode for Slack, Make, n8n, BI ingestion, CRM enrichment, or internal monitoring.

Was this helpful?

If this actor saved you time, please leave a rating on Apify Store. Bug reports and feature requests belong on the actor Issues tab.