Pinterest Easy Scraper
Pricing
$19.99/month + usage
Pinterest Easy Scraper
Discover unlimited public data on Pinterest with our Easy Scraper. Dive into profiles and "pins" with more depth than ever. Your free, go-to tool for seamless Pinterest insights.
Pricing
$19.99/month + usage
Rating
0.0
(0)
Developer
codemaster devops
Actor stats
5
Bookmarked
154
Total users
0
Monthly active users
19 hours ago
Last modified
Categories
Share
Pinterest Scraper — Profile + Pin Data Extractor
Scrape Pinterest profiles and pins into a clean, structured Apify dataset via a 4-tier reliability cascade: public widget API → SSR HTML bootstrap → internal resource API → headless browser fallback.
Keywords: Pinterest scraper, Pinterest API scraper, Pinterest profile data, Pinterest pin metadata, Apify actor, web scraping Pinterest.
Version 2.0.x — what changed
- 4-tier cascade replaces the single-endpoint scraper. Small runs resolve via the unauthenticated widget endpoint; large runs bootstrap session state from SSR HTML and paginate via the internal API; unreachable profiles optionally escalate to a real headless browser.
- Residential Apify proxy is mandatory for
useApifyProxy: true. Pinterest aggressively blocks datacenter IPs; the actor fails fast at input validation so you never waste a compute unit on a guaranteed-to-fail run. legacyKvKeysremoved (breaking change). Profiles are now stored under a single collision-safe key:profile-<sanitizedUsername>-<hash>. See Migration from 1.x below.- Pinterest-aware headers (
X-Requested-With,X-APP-VERSION,X-Pinterest-AppState,X-Pinterest-PWS-Handler,X-CSRFToken,Referer) are now emitted on every internal-API request.got-scraping's auto header generator is disabled so Pinterest-specific headers survive. - Session hardening —
useSessionPool: true,persistCookiesPerSession: true, soft-block detection via content-type + body regex, and deliberatesession.retire()/session.markBad()based on failure class. - Automated tests + nightly canary on the committed fixtures plus a real widget-endpoint shape check for
nasa/natgeo.
Features
- Profile extraction — usernames, bios, follower counts, website, profile image, verified status.
- Pin extraction — id, title, description, outbound link, domain, board, image, aggregated engagement stats.
- Curated output by default — compact, CSV-friendly schema; opt into the full raw payload with
includeRaw: true. - Deduplication — pins are deduped by
idacross pagination pages. - Bookmark-based pagination with safety guard — cannot get stuck when Pinterest returns the same cursor twice.
- Rate limiting — configurable jitter (
minDelayMs/maxDelayMs) + session pool rotation. - Per-tier retry budgets — widget: 2, html: 3, resource: 5, browser: 2.
- Fail-fast validation — invalid input, non-residential proxy, or non-Pinterest URLs error loudly.
Quick start (Apify Console)
- Open the Pinterest Scraper actor in the Apify Console.
- Click Try for free / Start.
- The default input already contains two public profiles (
nasa,natgeo) and Apify Proxy with residential group — just press Start. - When the run finishes, open the Dataset tab and export as JSON, CSV, Excel, RSS, or HTML.
Sample input
{"startUrls": ["https://www.pinterest.com/nasa/","natgeo"],"maxPinsCnt": 50,"includeRaw": false,"widgetFirst": true,"fallbackToBrowser": false,"proxyConfig": {"useApifyProxy": true,"apifyProxyGroups": ["RESIDENTIAL"]},"minConcurrency": 1,"maxConcurrency": 5,"maxRequestRetries": 5,"requestHandlerTimeoutSecs": 30,"minDelayMs": 500,"maxDelayMs": 2000}
Input schema
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
startUrls | string[] or {url:string}[] | yes | — | Profile URLs (https://www.pinterest.com/nasa/) or bare usernames (nasa). Locale segments (/en/, /de/, etc.) are stripped. |
maxPinsCnt | integer | no | 50 | Max pins per profile (1 – 10000). <= 25 uses the widget tier exclusively; larger values paginate via the resource tier. |
includeRaw | boolean | no | false | Attach the full raw Pinterest payload (minus known noise fields) to each pin/profile record. |
widgetFirst | boolean | no | true | Start small runs (maxPinsCnt <= 25) on the public widget endpoint. Disable to always bootstrap from HTML. |
fallbackToBrowser | boolean | no | false | Escalate to a real headless browser when widget + HTML + resource tiers all fail for a profile. Adds significant memory + runtime. |
proxyConfig | object | yes | {useApifyProxy:true, apifyProxyGroups:["RESIDENTIAL"]} | Apify Proxy (residential required) or custom {useApifyProxy:false, proxyUrls:[...]}. |
minConcurrency | integer | no | 1 | Lower bound of parallel requests. |
maxConcurrency | integer | no | 5 | Upper bound. Keep low (2 – 5); Pinterest rate-limits aggressively. |
maxRequestRetries | integer | no | 5 | Global retry ceiling. Per-tier caps apply (widget: 2, html: 3, resource: 5, browser: 2). |
requestHandlerTimeoutSecs | integer | no | 30 | Per-request processing timeout. |
minDelayMs / maxDelayMs | integer | no | 500 / 2000 | Jittered pre-request delay. Raise on 403/429. |
Residential proxy is mandatory
The actor throws at input validation if useApifyProxy: true is combined with any proxy group other than (or missing) RESIDENTIAL. If you prefer to bring your own proxy, set useApifyProxy: false and supply residential proxyUrls. Datacenter IPs will not work — Pinterest blocks them at the TLS layer.
Output
- Dataset (
Actor.pushData) — one record per profile (recordType: "profile") plus one per pin (recordType: "pin"). Iterate withDataset.forEachor export to any supported format. - Key-value store (
Actor.setValue) — each profile is additionally stored underprofile-<sanitizedUsername>-<hash>for direct lookup via the Apify API. Keys are collision-safe across case variants and unsafe characters. - Debug snapshots — set env
DEBUG_SNAPSHOT=1to capture the first profile payload, first pin payload, and per-tier hydration samples into the KV store underdebug-*keys for offline inspection.
Curated profile shape
{"recordType": "profile","username": "nasa","fullName": "NASA","about": "...","websiteUrl": "https://www.nasa.gov","profileUrl": "https://www.pinterest.com/nasa/","imageLargeUrl": "...","pinCount": 123,"boardCount": 45,"followerCount": 678901,"followingCount": 12,"country": "US","locale": "en-US","verified": true}
Curated pin shape
{"recordType": "pin","id": "123456789012345678","profile": "nasa","sourceUrl": "https://www.pinterest.com/nasa/","pinUrl": "https://www.pinterest.com/pin/123456789012345678/","title": "...","description": "...","link": "https://example.com/article","domain": "example.com","createdAt": "Tue, 01 Jan 2024 ...","image": { "url": "...", "width": 736, "height": 1104 },"board": { "id": "...", "name": "...", "url": "..." },"commentCount": 0,"richMetadata": { /* opengraph-like */ },"aggregatedStats": { "saves": 123, "done": 0 }}
Architecture — the 4-tier cascade
Every profile enters at the highest-confidence tier and downgrades only on terminal failure. failedRequestHandler in main.js computes the next tier via nextTierRequest(label, userData, { fallbackToBrowser }).
| # | Label | Endpoint | When selected | Downgrade target |
|---|---|---|---|---|
| 1 | profile-widget | https://widgets.pinterest.com/v3/pidgets/users/<u>/pins/ | widgetFirst && maxPinsCnt <= 25 | profile-html |
| 2 | profile-html | https://www.pinterest.com/<u>/ (parse __PWS_DATA__) | Default seed for maxPinsCnt > 25; also reached via widget downgrade. Hydrates session.userData.hydration with appVersion + csrfToken + cookies for tier 3. | profile-resource-start |
| 3 | profile-resource-start → profile-resource-pins | /resource/UserResource/get/ then /resource/UserPinsResource/get/ (bookmark-paginated) | Tier 2 downgrade, or direct entry when HTML fails. | profile-browser (only if fallbackToBrowser: true) |
| 4 | profile-browser | PlaywrightCrawler navigating the public profile page; intercepts the XHRs above via page.on('response') and reuses the same curated builders. | All HTTP tiers have failed for a profile AND fallbackToBrowser: true. | Terminal. |
The browser tier runs as a second crawler after the HTTP crawler finishes. Failed HTTP requests that qualify for escalation are batched into browserQueue; runBrowserTier is invoked once with the full batch so the Chromium instance is paid for only when strictly necessary.
Reliability layers
- Tier-aware headers —
src/headers.jsemits Pinterest-expected headers:Accept,Accept-Language,User-Agent(one of four modern desktop UAs, sticky per session),Referer,X-Requested-With: XMLHttpRequest,X-APP-VERSION,X-Pinterest-AppState: active,X-Pinterest-PWS-Handler: www/[username].js,X-CSRFToken(from hydration), andOrigin. Widget tier skips Pinterest-specific headers since the endpoint is public. - Session lifecycle —
src/session.jsclassifies every response asok | soft-block | rate-limit | transient | client-error | not-foundusing status code, content-type, and body-regex heuristics ("captcha","unusual traffic","rate limit", etc.).retire()on rate-limit / soft-block,markBad()on transient 5xx, no action onok/not-found. - Cookie persistence —
applyCookies(session, setCookieHeader)extractscsrftokenintosession.userData.hydration.csrfTokenand keeps a concatenatedCookieheader;persistCookiesPerSession: trueties cookies to session lifetime. - Retries budgeted per tier —
TIER_MAX_RETRIESinsrc/headers.jscaps retries per label so a stuck tier cannot exhaust the global budget. - Bookmark loop guard —
lastBookmarkis compared to the newly returned cursor; a repeat terminates pagination instead of looping. - Schema drift detection —
createSchemaStatecarriesprofileWarned/pinWarnedflags. The first empty curated record emits a one-shot warning with a sample of top-level keys. A nightly GitHub Action (.github/workflows/canary.yml) hits the widget endpoint fornasa+natgeoand asserts required keys so drift surfaces before users notice.
Migration from 1.x
Breaking change: legacyKvKeys is removed.
1.x wrote each profile to the key-value store twice — under the raw username (nasa) and under the safe key (profile-nasa-<hash>). 2.0 writes only the safe key.
If a downstream consumer reads profiles by raw username:
- Update it to read the new key format:
profile-<sanitizedUsername>-<hash>. Obtain the sanitized key viasafeKvKey(username)exported fromsrc/input.js, or resolve it from the profile record (which carriesusername) and compute the SHA-1 prefix yourself. - If a migration window is required, consume from the dataset (
recordType: "profile"records) instead of the KV store — the dataset is unchanged.
There is no compat shim in 2.0. If you cannot migrate, stay on 1.x.
Running locally
npm ci# optional (one-time): install Chromium for the browser tiernpx playwright install --with-deps chromiumecho '{"startUrls":["https://www.pinterest.com/nasa/"],"maxPinsCnt":25,"proxyConfig":{"useApifyProxy":false,"proxyUrls":["http://your-residential-proxy:port"]}}' > INPUT.jsonAPIFY_LOCAL_STORAGE_DIR=./apify_storage node main.js
For Apify Console runs the Dockerfile uses the apify/actor-node-playwright-chrome:20 base image, so Chromium ships with the actor — no separate install step.
Testing
npm testruns the fullnode:testsuite undertest/. No external dependencies, no browser launched.npm run canaryruns the live widget-endpoint schema canary locally (requires outbound network). The canary retries once on403/503/networkand skips-with-warning when the runner IP is block-flagged by Pinterest — real shape drift still hard-fails.npm run smokeis the manual release gate beforeapify push. It spawns the full actor against a user-supplied residential proxy and asserts the dataset contains ≥1 profile + ≥1 pin record.- Required env:
SMOKE_PROXY_URL=http://user:pass@residential-host:port - Optional env:
SMOKE_USER(defaultnasa),SMOKE_MAX_PINS(default10) - Storage is retained in a tmpdir for inspection on failure. Proxy credentials are redacted from all log output.
- Required env:
- CI (
.github/workflows/ci.yml) runs tests on every push tomain,master, or anyclaude/**branch withPLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1so CI stays fast. - Canary workflow (
.github/workflows/canary.yml) runs the live schema check daily at 08:00 UTC and on-demand viaworkflow_dispatch.
Dependency discipline
- The Dockerfile pins the Apify Playwright base image by tag suffix (
apify/actor-node-playwright-chrome:20-1.59.1) so a future base-image refresh cannot silently swap the Chromium binary underneath us. Bump this tag deliberately — keep the devDepplaywrightrange compatible. playwrightlives indevDependencies. Production images use the base image's pre-installed copy (guaranteeing a matched Chromium ↔ Playwright pair). Local dev + CI get it vianpm install.- Direct
lodashandomit-deep-lodashdependencies were removed in 2.0. The ~20-line vanilla replacements live insrc/curate.js(omit,omitDeep,lastValue). package.jsoncarries an npm override forcingfile-type >= 21.3.4to patch the Crawlee transitive decompression-bomb advisory.npm audit(full tree) reports 0 vulnerabilities as of 2.0. Re-run before every release.
Known limitations
- Cookie jar is process-local. Session restarts lose hydration; HTML tier re-bootstraps from scratch. This is a deliberate trade-off — the code is stateless across runs.
- Widget endpoint region-blocks. Some regions return 404 for the widget endpoint. The actor treats 404 as terminal for that tier and falls through to HTML automatically.
- Residential proxy is ~8× the cost of datacenter. Non-negotiable for Pinterest — see Apify proxy docs for pricing.
- Browser tier is slow and memory-hungry. Reserve
fallbackToBrowser: truefor profiles where HTTP tiers consistently fail; do not enable it as a prophylactic.
Project layout
.├── main.js # actor entry: CheerioCrawler + optional PlaywrightCrawler stage├── src/│ ├── input.js # validateInput, parseStartUrl, safeKvKey, jitter helpers│ ├── headers.js # buildHeaders(request, session), TIER_MAX_RETRIES, UA pool│ ├── session.js # classifyResponse, handleBadResponse, applyCookies│ ├── curate.js # buildCuratedProfile / buildCuratedPin / buildRawPin / KV payload│ ├── routes.js # createRouter, LABELS, DOWNGRADE, nextTierRequest│ └── tiers/│ ├── widget.js # public widget endpoint handler│ ├── html.js # __PWS_DATA__ parser + session hydration│ ├── resource.js # internal resource API (start + paginated pins)│ └── browser.js # PlaywrightCrawler + XHR capture (tier 4)├── test/│ ├── fixtures/ # widget/html/resource-userpins JSON + HTML fixtures│ ├── helpers/actor-stub.js # Module._load shim for apify stub│ ├── input.test.js # parseStartUrl, validateInput, safeKvKey│ ├── headers.test.js # per-tier header assertions│ ├── session.test.js # classification truth table + cookie parsing│ ├── router.test.js # LABELS, DOWNGRADE, nextTierRequest│ ├── tiers-widget.test.js│ ├── tiers-html.test.js│ ├── tiers-resource.test.js│ ├── tiers-browser.test.js # pure-logic tests (no browser launch)│ └── canary.js # nightly widget-shape check├── .actor/actor.json├── INPUT_SCHEMA.json├── Dockerfile # apify/actor-node-playwright-chrome:20├── .github/workflows/│ ├── ci.yml│ └── canary.yml└── package.json # "version": "2.0.0"
License
ISC.