Shopify Scraper (GraphQL)
Pricing
Pay per usage
Go to Apify Store
Shopify Scraper (GraphQL)
An Apify actor that crawls Shopify stores via `sitemap.xml` and fetches product data using the Storefront GraphQL API. Optimized for speed and cost with per-host batching, incremental processing, and buffered dataset writes.
0.0 (0)
Pricing
Pay per usage
0
8
3
Last modified
2 months ago
Shopify Scraper (GraphQL)
An Apify actor that crawls Shopify stores via sitemap.xml and fetches product data using the Storefront GraphQL API. Optimized for speed and cost with per-host batching, incremental processing, and buffered dataset writes.
Features
- Reads
sitemap.xml, filters product URLs (/products/<handle>) - Batches GraphQL requests per store using aliases (fewer round-trips)
- Optional incremental runs (skips already processed product IDs)
- Optional lastmod cutoff to skip old products
- Outputs a single record per product; all variants are available under
additional.variants - Extensible via
extendScraperFunctionandextendOutputFunction
Input parameters (core)
startUrls: array ofsitemap.xmlURLsstorefrontApiVersion: Storefront API version (e.g.,2024-07)storefrontAccessToken: your Storefront access tokenmaxRequestsPerCrawl,maxConcurrency,maxRequestRetries,proxyConfig,debugLog
Performance inputs
updatedSince: ISO date; skips products with<lastmod>older than thisbatchSize: product handles per GraphQL request (default 10)flushIntervalMs: max delay before sending a partial batch (default 300)perHostConcurrency: parallel GraphQL requests per store (default 2)bufferWrites: buffer dataset writes (default true)bufferSize: items per dataset push (default 100)
Run locally
- Install dependencies:
$npm install
- Create local input at
apify_storage/key_value_stores/default/INPUT.json, for example:
{"startUrls": [{ "url": "https://example.com/sitemap.xml" }],"storefrontApiVersion": "2024-07","storefrontAccessToken": "<YOUR_STOREFRONT_TOKEN>","maxRequestsPerCrawl": 50,"maxConcurrency": 10,"debugLog": true}
- Start the actor:
$npm start
Or development mode with auto-restart:
$npm run dev
GitHub integration
Workflows in .github/workflows/:
ci.yml: install, lint, and syntax check on push/PR tomain.codeql.yml: CodeQL security analysis on push/PR and weekly.
Docker quick start
make init # creates .env and INPUT.json from templatesmake run # docker compose up --build actor
Outputs will be in apify_storage/datasets/default.
Extensibility
extendScraperFunction: lifecycle hooks (SETUP,FILTER_SITEMAP_URL,PRENAVIGATION,POSTNAVIGATION,RUN,FINISHED)extendOutputFunction: transform/filter final records before they are saved to the Dataset
License
This project is licensed under the Apache License 2.0. See LICENSE and NOTICE.
