Shopify Scraper (GraphQL)
Pricing
Pay per usage
Shopify Scraper (GraphQL)
An Apify actor that crawls Shopify stores via `sitemap.xml` and fetches product data using the Storefront GraphQL API. Optimized for speed and cost with per-host batching, incremental processing, and buffered dataset writes.
Pricing
Pay per usage
Rating
0.0
(0)
Developer

Alex
Actor stats
0
Bookmarked
8
Total users
1
Monthly active users
3 months ago
Last modified
Categories
Share
Shopify Scraper (GraphQL)
An Apify actor that crawls Shopify stores via sitemap.xml and fetches product data using the Storefront GraphQL API. Optimized for speed and cost with per-host batching, incremental processing, and buffered dataset writes.
Features
- Reads
sitemap.xml, filters product URLs (/products/<handle>) - Batches GraphQL requests per store using aliases (fewer round-trips)
- Optional incremental runs (skips already processed product IDs)
- Optional lastmod cutoff to skip old products
- Outputs a single record per product; all variants are available under
additional.variants - Extensible via
extendScraperFunctionandextendOutputFunction
Input parameters (core)
startUrls: array ofsitemap.xmlURLsstorefrontApiVersion: Storefront API version (e.g.,2024-07)storefrontAccessToken: your Storefront access tokenmaxRequestsPerCrawl,maxConcurrency,maxRequestRetries,proxyConfig,debugLog
Performance inputs
updatedSince: ISO date; skips products with<lastmod>older than thisbatchSize: product handles per GraphQL request (default 10)flushIntervalMs: max delay before sending a partial batch (default 300)perHostConcurrency: parallel GraphQL requests per store (default 2)bufferWrites: buffer dataset writes (default true)bufferSize: items per dataset push (default 100)
Run locally
- Install dependencies:
$npm install
- Create local input at
apify_storage/key_value_stores/default/INPUT.json, for example:
{"startUrls": [{ "url": "https://example.com/sitemap.xml" }],"storefrontApiVersion": "2024-07","storefrontAccessToken": "<YOUR_STOREFRONT_TOKEN>","maxRequestsPerCrawl": 50,"maxConcurrency": 10,"debugLog": true}
- Start the actor:
$npm start
Or development mode with auto-restart:
$npm run dev
GitHub integration
Workflows in .github/workflows/:
ci.yml: install, lint, and syntax check on push/PR tomain.codeql.yml: CodeQL security analysis on push/PR and weekly.
Docker quick start
make init # creates .env and INPUT.json from templatesmake run # docker compose up --build actor
Outputs will be in apify_storage/datasets/default.
Extensibility
extendScraperFunction: lifecycle hooks (SETUP,FILTER_SITEMAP_URL,PRENAVIGATION,POSTNAVIGATION,RUN,FINISHED)extendOutputFunction: transform/filter final records before they are saved to the Dataset
License
This project is licensed under the Apache License 2.0. See LICENSE and NOTICE.