Full pivot to HTTP-only scraping. v0.1 was 100% Cloudflare-blocked in production (every cloud run returned cloudflare_block for every slug); fixed by dropping Playwright entirely and switching to Crawlee's CheerioCrawler (got-scraping with Chrome TLS impersonation).
- Scraping path: Playwright → got-scraping (no browser). PH's WAF locks on Chromium fingerprints + Cloudflare managed challenge regardless of fingerprint injector tuning. got-scraping clears the WAF cleanly via Chrome TLS impersonation. Verified 2026-04-29 against
/products/lovable + /@chrismessina + /@antonosika.
- URL pattern:
/posts/<slug> → /products/<slug>. PH retired /posts/ for product detail pages in 2025. Legacy /posts/<slug> URLs are still accepted on input.
- Data source: DOM scraping → embedded Apollo SSR JSON. PH ships hunter+makers inline as
window[Symbol.for("ApolloSSRDataTransport")].push({...}). The parser walks the embedded JSON with a brace-counting helper — no DOM, no JS execution.
- Hunter key:
"hunter" → "user". PH's internal Post field for the submitter is user, not hunter. The parser uses adjacency to "makers":[...] to disambiguate from unrelated "user" occurrences (viewer, comments, etc.).
- Docker base:
apify/actor-node-playwright-chrome:24 → apify/actor-node:24 (Alpine, no browser). Image footprint and cold-start drop ~10x.
- Memory: 4096 MB → 1024 MB. No browser process; HTTP fetches are <100 ms each.
- Default
maxConcurrency: 2 → 5. Cap raised from 8 to 30. Concurrency was previously bottlenecked by Chromium memory.
- Tests: parser tests rewritten against body fixtures (
product-page.test.ts, profile-page.test.ts, json-walk.test.ts).
websiteUrl, linkedinUrl data, never populated. PH removed those fields from public profile pages in 2025; they remain in the output schema for back-compat but are always null. Caller must enrich downstream.
playwright dependency.
sanitizeSessionId helper. No longer needed — Crawlee's CheerioCrawler doesn't pass session ids to a subprocess.
stripTrackingParams call site. The helper file remains for any future use.
cheerio dep (peer of CheerioCrawler).
src/parsers/json-walk.ts — generic brace-counting JSON value extractor.
test/json-walk.test.ts, test/product-page.test.ts.
- README documents permanent
null for websiteUrl / linkedinUrl and the field-availability notes.
Initial private release of the Product Hunt maker/hunter extractor.
- PlaywrightCrawler-based scraper for
https://www.producthunt.com/posts/<slug> and /@<username> profile pages.
- Hunter + makers extraction:
name, phUsername, phProfileUrl, xHandle, websiteUrl, linkedinUrl, headline.
- Strict slug + username validation (
/^[a-zA-Z0-9_-]{1,80}$/ + /^[a-zA-Z0-9._-]{1,80}$/); URLs reconstructed from validated values before navigation.
- Hostname allowlist (
producthunt.com, www.producthunt.com); cross-origin navs aborted via context.route() + page.on('framenavigated').
- Session pool with rotation on Cloudflare / rate-limit detect (
useSessionPool: true, maxPoolSize: 5, maxUsageCount: 50). Memory-optimized via useIncognitoPages: true.
- Single same-run retry per slug per label on
cloudflare_block / rate_limited (slug-keyed retry flag, NOT request.userData).
- Per-slug accumulator (in-memory) — completion triggered by counted child requests. Run-end
flushPending() guarantees pending slugs surface; abort handler flushes too.
- Per-slug
status enum + per-record errors[]. Run-summary line at end-of-run with consecutiveBlocksAtEnd operational signal.
expectedMakerCounts input cross-check; mismatch logs maker_count_mismatch warning.
- README documents GDPR/legal caller responsibility + Apify-token-as-PII-credential operational guidance.