Changed — Phase 2 full rewrite: HTTP-only reviews (no more per-hotel browser)
Single browser launch at startup (Puppeteer) to obtain session cookies, then closes immediately
All review pages fetched via direct HTTP requests to reviewlist.fr.html — no browser per hotel
Run time reduced by ~50%; cloud reliability dramatically improved (no more navigation timeouts)
PuppeteerCrawler removed from review phase; replaced by bounded Promise.all batches
src/session-factory.js — single Puppeteer session for cookie acquisition
src/reviewlist-client.js — HTTP pagination of reviewlist.fr.html with session cookies and proxy rotation
src/reviewlist-parser.js — Cheerio-based HTML parser for review blocks
reviewLang input field — filter reviews by language (English, French, German, Spanish, etc.)
stay_nights and traveler_type correctly extracted from HTML (was always null / raw text before)
review_date normalized to ISO format YYYY-MM (was localized text e.g. "Mars 2026")
extractContactInfoFromUrl — HTTP-based contact enrichment (no browser needed)
stay_nights always null — now extracted from .c-review-block__stay-date
traveler_type contained raw stay text (e.g. "1 nuit · Avril 2026") — now clean ("Couple", "Solo", etc.)
Keyword matching false positives — switched to whole-word boundary matching (regex \b -equivalent)
"lit" no longer matches "accessibilité" or "buffet"
"usé" no longer matches "isolation" via "cause"
partner_reply now stripped of Booking.com prefix ("Réponse de l'hôtel :") and truncation artifact ("Lire la suite")
hotel_registration_number returning "chrome" — browserName key matched rna substring; now uses exact key matching
hotel_company_name returning "Booking.com缤客" — cn_b_company_name matched company substring; now uses exact key matching
checkin_date / checkout_date removed from output (never populated by reviewlist.fr.html )
Dataset view "Full Export" removed — Signal View is the default and covers all fields
Pay-per-event pricing: review-collected ($0.002/review) and contact-enriched ($0.01/review when contact enrichment is enabled)
Spending limit respected: run stops gracefully when ACTOR_MAX_TOTAL_CHARGE_USD is reached
Output schema refactored: columns reordered by signal relevance (hotel → signal → review content → reviewer → meta)
reviewer_type removed (duplicate of traveler_type )
review_url added — direct link to each review on Booking.com
Contact columns (hotel_email , hotel_phone , etc.) now omitted entirely from output when contact enrichment is disabled
Two dataset views: Signal View (keyword analysis) and Full Export
Proxy field removed from Console UI — datacenter proxy applied automatically by default
Changed — complete architecture rewrite
Phase 1 (hotel discovery) — HTTP-only GraphQL
Replaced Puppeteer-based hotel search with direct HTTP GraphQL requests to searchQueries.search
Hotel list extraction from search URLs is now ~10× faster (no browser needed)
Country code and review score from Phase 1 are passed to Phase 2
Phase 2 (review extraction) — browser session interception
Load each hotel page once with Puppeteer (#tab-reviews fragment triggers native XHR)
Intercept the ReviewList GraphQL request made by the page itself — captures JWT CSRF token, et-state, and full query body
Paginate additional review pages by reusing the intercepted session (same browser context, credentials: include )
This approach bypasses Booking.com's AWS WAF without needing to extract or transport cookies manually
src/session-extractor.js — extracts hotel metadata (name, country code) from page DOM
src/reviews-client.js — paginates reviews via page.evaluate with the intercepted JWT CSRF token
src/graphql-extractor.js — HTTP-only hotel list extraction using FullSearch GraphQL query
src/search-url-parser.js — parses Booking.com search URL query parameters
Keyword filter now searches positive points , negative points , and title (previously negative only)
keyword_matched_in field in output indicates which field(s) matched
themes field: auto-detected topics from review text (room, breakfast, service, noise, etc.)
hotel_avg_score and hotel_total_reviews fields in output
hotel_stars , hotel_city , hotel_address , hotel_distance_center_km from Phase 1 GraphQL
Contact fields: hotel_company_name , hotel_email , hotel_phone , hotel_registration_number , hotel_trade_register , hotel_contact_address , hotel_contact_city , hotel_contact_postal_code
Dynamic concurrency based on available memory (getMaxConcurrencyByMemory )
ETA progress logging via updateStatus
src/dom-fallback.js — DOM-based hotel extraction (replaced by HTTP GraphQL)
src/apollo-extractor.js — legacy Apollo cache parser (no longer needed)
src/review-extractor.js — DOM-based review extraction (replaced by GraphQL interception)
Puppeteer usage in Phase 1 entirely
WAF 202 challenge on hotel pages — resolved by loading #tab-reviews with networkidle0
VALIDATION_INVALID_TYPE_VARIABLE GraphQL errors — caused by null hotelScore ; fixed by using the intercepted body directly
JWT CSRF token vs b_csrf_token confusion — the page's native XHR uses a JWT, not the short inline token; fixed by intercepting request headers
[1.0.0] - Initial release
Puppeteer-based hotel and review extraction
Basic keyword filtering on negative points
CSV and JSON export