All notable changes to this actor will be documented in this file.
[0.3.0] — 2026-05-16
Security
SSRF defense on proxyConfiguration.proxyUrls — entries pointing at private / loopback / link-local hosts (127.0.0.0/8, 10.0.0.0/8, 192.168.0.0/16, 172.16.0.0/12, 169.254.0.0/16, localhost, ::1) are now rejected at input parsing with InputValidationError. Unsupported schemes (anything other than http / https / socks variants) are also rejected. Prior versions forwarded proxyUrls verbatim, leaving an SSRF surface.
websites[] scheme guard — extractor now rejects javascript: / data: / file: / vbscript: URLs that a malicious page could place as its first external anchor; these no longer land in the dataset.
/pages/<display>/<numeric-id> display-name validation — display-name segment is now checked against the slug regex before canonical-URL reconstruction; path-traversal characters are rejected.
Billing safety
pushData before chargePageScraped — order swapped in the request handler. A push failure now prevents the charge instead of leaving a billed-but-missing record on Crawlee retry.
Block-title denylist — isBillableRecord() predicate now rejects known Facebook interstitials (Page not found, Sorry, this content isn't available, Log in to Facebook, Create new account, etc.) before the charge fires, alongside the existing literal-Facebook check.
Minimum-signal gate — a record with a title but zero positive signals (no facebookId, no categories, no likes, no followers) is treated as a non-content response and not billed. Closes the "empty Comet store" overbilling case.
Cost-ceiling enforcement — when the per-event charge limit fires, crawler.autoscaledPool.abort() is now actually called. Previous versions only set a flag that was read after crawler.run() returned, allowing many in-flight requests to bill past the cap.
HTML response size cap — responses larger than 5 MB are skipped without charging (defends against OOM on adversarial pages).
Consecutive-failure escalation in Actor.charge — three consecutive billing failures now re-throw instead of silently swallowing, surfacing an extended Apify billing outage to the operator instead of producing silent under-billing.
Aborting drain extended to 5 s — Actor.on('aborting', …) now awaits in-flight charge calls (bounded by 5 s) before Actor.exit(), so mid-flight billing events aren't lost on user abort.
Extractor robustness
extractCometRenderers rewritten as a single forward pass with bracket-stack — replaces the previous regex with chained [^]*? spans. The new parser is order-of-keys agnostic ({"title": …, "__typename": "X"} and {"__typename": "X", "title": …} both extract correctly) and ReDoS-immune (O(N) with bounded per-renderer window). Closes the silent-zero-data risk if Facebook ever reorders inline JSON keys.
extractAdStatus no longer defaults to "not currently running ads" when the is_business_page_active boolean is absent from the page HTML. The field is now omitted from pageAdLibrary instead of emitting a confident-wrong false. ad_status is omitted accordingly.
Likes / followers / followings regex tightened — character class no longer includes \s, so adjacent numeric tokens like 4.7 stars · 1948 likes can't bridge into an inflated count via the digit-only fallback.
Schema change (soft-breaking)
OpeningHoursEntry split. The shape { day: 'current', open, close } is replaced by a dedicated LiveHoursStatus shape { status: 'open' | 'closed' | 'always_open' }. Per-day hours stay in record.openingHours: PerDayHoursEntry[] (always per-day, possibly empty). Live IntroCard status moves to a new optional sibling field record.hoursStatus?: LiveHoursStatus.
Migration: consumers reading record.openingHours[0].day === 'current' should switch to record.hoursStatus. Consumers using openingHours[] for per-day grids are unaffected.
OpeningHoursEntry remains exported as a @deprecated type alias of PerDayHoursEntry for one minor version.
Refactor
extractCometRenderers, CometRenderers, decodeAddressFromUrl, and the renderer-window logic moved from src/extract/page.ts to a new src/extract/comet.ts module. page.ts shrinks from ~670 to ~500 lines.
Docs
README gains standalone ## Input and ## Output sections per AGENTS.md spec.
README adds a known-limitation note on the followers field rendering intermittently across proxy sessions.
[0.2.0] — 2026-05-16
Fixed (real-Facebook extraction gaps surfaced by live smoke)
categories now extracted from FB Comet InfluencerCategoryContextItemRenderer instead of og:type (which always returns "video.other" on real Facebook). The "Page · " prefix is stripped; multi-category pages emit one entry per category.
creation_date: documented limitation — Facebook gates page-creation date behind the "Page Transparency" sub-page, which requires a click. Real-FB extraction now omits this field rather than emitting the misleading inline "creation_time" (which refers to post/photo/video timestamps, not page creation). Synthetic JSON-LD / text-pattern fallbacks remain for pages that do expose it.
ad_status now derived from the is_business_page_active boolean — yields upstream's exact strings ("This Page is currently running ads." / "This Page is not currently running ads.").
address, phone, website, rating now extracted from FB Comet AddressContextItemRenderer / WebsiteContextItemRenderer / RatingContextItemRenderer / TimelineContextItemWrapper blocks. Restaurant pages now return structured address and phone.
services now extracted from BusinessServicesContextItemRenderer (one entry per service).
openingHours now extracts the live "Open now" / "Closed now" / "Open 24 hours" status as a single { day: "current", open, close } entry when Facebook does not render the per-day grid (which it usually doesn't). Per-day entries are still emitted when JSON-LD openingHoursSpecification or a [data-hours] block is present.
messenger: the over-eager "page mentions messenger word" fallback (which made every record's messenger always null) is removed. Now returns either an actual m.me / messenger.com URL or is omitted entirely.
profile.php?id=NNN URLs now produce a canonical URL that preserves ?id=NNN (previously dropped the query, producing a /profile.php/NNN URL that Facebook 404s).
pageAdLibrary now emitted whenever is_business_page_active is present in the page state, even when the ad-library ID is unavailable (with id: null).
Added
OpeningHoursEntry.day now accepts "current" for live-status entries.
Three real-Facebook HTML fixtures (test/fixtures/real/*.html) captured from production runs, locking the extractor against actual FB DOM structure.
New test suite test/extract-real-fb.test.ts (20 tests).
scripts/capture-fb-fixtures.ts — standalone Playwright + Apify-proxy script to refresh real-FB fixtures when Facebook redesigns.
[0.1.0] — 2026-05-16
Added
Initial release. Drop-in compatible with apify/facebook-pages-scraper input/output schema.
Pay-per-event pricing: actor-start $5.00 / 1,000 + page-scraped $7.50 / 1,000 (~37% cheaper than the upstream's $12 / 1,000 flat rate at typical run sizes).