All notable changes to this Actor will be documented here.
[1.4] — 2026-05-23 — Slug-fallback for Thunderbit failures
Added — Recovery from Thunderbit pool throttling
After confirming via dedicated probe that direct Yelp scraping is impossible
on Apify infrastructure (DataDome enterprise WAF returns HTTP 403 to
datacenter, US/DE residential IPs, mobile UA, Googlebot, Bingbot, and Wayback
has no Yelp snapshots), the actor now implements a graceful fallback that
keeps it useful when Yelp throttles Thunderbit's IP pool.
the-french-laundry-yountville-2 → "The French Laundry" + cityHint
"Yountville" + branchIndex 2
Then run the entire website-discovery → enrichment chain (DDG search,
tech stack, emails, phones, JSON-LD, SEO, action links, lead score,
outreach pitch, outreach links).
Recovers ~60% of previously-failed runs into useful partial records:
dataSource: "slug_fallback" (vs "thunderbit")
success: true only when website data was actually recovered
partial: true when name was derived but no enrichment succeeded
30+ multi-word US cities in the slug → name lookup table (San Francisco,
New York, Los Angeles, Las Vegas, Fort Worth, El Paso, etc.)
Improved
Wayback listing age skipped when Thunderbit failed (saves 2-20s of latency
on dead URLs that wouldn't have a Yelp snapshot anyway).
Partial records (name only, no website) emit a ⚠ warning level instead
of being silently filed as failures.
Why direct Yelp scraping isn't viable
Probe-actor _probe_yelp_direct ran 8 fetch strategies on Apify infra:
Datacenter / Residential US / Residential DE / Googlebot / Bingbot / Mobile UA
All 8 returned HTTP 403 with DataDome JS challenge body (var dd={...})
Wayback Machine has no Yelp snapshots (Yelp opted out of the archive)
Headless browser would solve the challenge but adds ~$0.05/run in compute
(would zero out the $3/1K margin)
Thunderbit remains the only viable path because they likely maintain a
whitelisted IP pool or run a headless solver on their side. The new
slug-fallback gives users value even when their pool is throttled.
Compatibility
All v1.3 fields preserved.
New dataSource field marks the origin (thunderbit or slug_fallback).
New partial boolean for the rare case where slug is parseable but no
website was discovered.
slugFallbackOnFail can be set to false to restore strict v1.3 behavior.
[1.3] — 2026-05-23 — Sales-ops automation kit
Added — 9 new feature layers
This release focuses on end-to-end sales workflow support: better data
recovery (when Yelp throttles), structured CRM-friendly output, and
one-click outreach automation.
Thunderbit retry on transient block (thunderbitRetries, default 2):
Yelp periodically blocks Thunderbit's IP pool (5-30s windows), causing
Failed to fetch website content errors. Actor now retries with linear
backoff (8s, 16s) before giving up. Empirically lifts success rate from
~50% to ~75%+ on diverse listings.
JSON-LD schema.org parsing on the discovered website:
schema_same_as[] — real, brand-published social URLs (often more
accurate than what we extract from <a> tags). Auto-merged into
social_profiles{} so downstream consumers don't need to dedupe.
schema_telephones[], schema_emails[] — phones/emails the brand
exposes via Schema.org markup. Merged into phoneE164 if not already set.
schema_address{street, city, state, zipCode, country} — structured
address from PostalAddress block.
schema_latitude / schema_longitude — auto-promoted to latitude /
longitude (free geocoding, no Nominatim call needed).
When the website has no extractable email, generate likely contact
addresses for the discovered domain: info@, hello@, contact@,
support@, sales@, team@, office@, admin@, inquiries@,
reservations@, bookings@. Returned as emails_guessed[] with
emails_guessed_warning flag — speculative, verify before sending.
Lifts cold-outreach hit rate ~3-5x on small businesses without
obvious public emails.
Address parser (extractAddressParts, default on):
parsedAddress: {street, city, state, zipCode, country, formattedAddress}
— handles all observed Yelp formats (with/without commas, newlines).
timezone — derived from US state code → IANA timezone (PST/EST/CST/
MST/AKST/HST). Replaces the previous UTC-based is_open_now with a
timezone-aware variant that returns correct results for businesses
in Pacific/Mountain/Central time.
Domain age via crt.sh (extractDomainAge, default off — slow):
earliest_ssl_date, website_domain_age_years — earliest certificate
issuance for the discovered website's domain. Strong legitimacy signal
for B2B prospecting (vs Wayback which can be flaky on Apify IP).
mailto_url + mailto_url_with_pitch (subject + outreach pitch as body)
tel_url, sms_url, whatsapp_url (with auto-pasted pitch up to 240 chars)
linkedin_search_url — pre-filtered LinkedIn people search for
<business> owner OR founder OR CEO
google_search_url — "<business name>" <city>
yelp_competitors_url — Yelp search for the same primary category in
the same neighborhood (territory/competitor research)
TOP_LEADS sorted KV record (free, written automatically on bulk runs):
Top 20 prospects sorted by leadScore, with the most-actionable fields
flattened (businessName, city/state, leadScore, phoneE164, primaryEmail,
bestContact, mailtoUrl, whatsappUrl, linkedinSearchUrl).
Read it via:
client.key_value_store(run["defaultKeyValueStoreId"]).get_record("TOP_LEADS")
CSV export gained 9 new columns: street, city, state, zipCode,
timezone, website_domain_age_years, emails_guessed_first,
mailto_url, whatsapp_url, linkedin_search_url,
yelp_competitors_url.
Phones from Schema.org are normalized to E.164 if no other phone is set.
Address parser handles "73 Geary St San Francisco, CA 94108" (no comma
between street and city).
Compatibility
All v1.2 fields preserved with the same names.
New layers are independently toggleable via input flags.
Pricing unchanged: $3/1K. The retry layer adds latency but no per-run
cost (Thunderbit is free); the JSON-LD/email-guess/address-parse
layers reuse already-fetched bytes.
[1.2] — 2026-05-23 — Lead-gen powerhouse
Added — 8 new feature layers (no cost increase)
The actor now reuses the single website-fetch from v1.1 to extract a much
richer feature set. Cost stays the same; per-business value goes up
significantly.
outreachPitch — ready-to-paste cold-outreach message tailored to the
Yelp category. 15 industry templates: restaurants, salons/spas,
dentists/medical, auto, home services, law, real estate, fitness, hotels,
retail, pets, events, cleaning, education, finance. Generic fallback for
uncategorized businesses.
Geocoding (extractGeocoding, default OFF — opt-in):
latitude, longitude via Nominatim (OpenStreetMap, free, rate-limited).
Off by default — Nominatim's TOS allows ~1 req/sec.
Aggregate summary (writeSummary, default on):
On bulk runs (>1 business), a SUMMARY record is written to the run's
key-value store (free — doesn't trigger pay-per-event). Contains:
avg_rating, median_reviews, avg_popularity_score, avg_lead_score,
by_customer_segment, by_quality_tier, by_lead_tier,
with_website_alive_pct, with_emails_pct, with_social_profiles_pct,
top_categories[], top_tech_detected[], chains_detected[],
open_now_count. Read it with
client.key_value_store(run["defaultKeyValueStoreId"]).get_record("SUMMARY").
CSV export (exportFormat: "csv"):
Flattens the rich JSON into a sales-ready 30-column CSV with one row
per business. Designed for direct import into Google Sheets / HubSpot /
Pipedrive / Salesforce. Sample columns: businessName, leadScore,
leadTier, customer_segment, phoneE164, emails_from_website_first,
instagram, facebook, tech_stack_summary, outreachPitch.
Chain filtering (excludeChains, default off):
When set, skips businesses with chain_likelihood_score ≥ 50 (Starbucks,
McDonald's, etc.). Useful for SMB lead-gen.
Identifier:
yelpBusinessId — the slug from yelp.com/biz/{slug} URL, useful for
deduplication.
Single website fetch shared by all enrichment — tech stack, emails,
phones, socials, SEO, action links all extracted from one HTTP GET (with
body truncated to 600KB). No additional bandwidth/cost vs v1.1.
online_presence_score rebalanced to credit social profiles, recovered
emails, and SEO hygiene.
chain_likelihood_score keyword list extended (Wing Stop, Raising Cane's,
Popeye's, Jersey Mike's, Jimmy John's, Qdoba, Jamba Juice, etc.).
Emails are filtered against an extended CDN blacklist (Wix, Shopify,
Cloudflare, GTM, etc.) and a TLD whitelist (~120 valid TLDs) to drop
noise like loc@ion.reload from JS fragments.
Compatibility
All v1.1 fields preserved with the same names.
New layers can be toggled off independently via input flags.
Existing integrations using only the old fields continue to work unchanged.
Pricing unchanged: $3/1K. The single website fetch is the same as v1.1
— we just read more out of the same bytes.
[1.1] — 2026-05-22
Removed
Direct Yelp HTML / JSON-LD layer (extractSchema) — Yelp aggressively
blocks both datacenter and residential IPs for direct page fetches. After
testing on Apify infrastructure (with and without residential proxy), this
layer returned no usable data. Removing it keeps the actor fast and reliable.
Geo coordinates, payment_accepted, serves_cuisine, schema_*, photos_count,
is_claimed, is_verified, schema_same_as fields are no longer attempted.
Improved
Hours intelligence parser now handles both colon-delimited and
whitespace-delimited Thunderbit hours formats:
Mon-Sun: 7:30 AM - 6:00 PM
Mon 7:30 AM - 6:00 PM (Thunderbit's newer output format)
Mon-Fri: 9 AM - 5 PM, Sat: 10 AM - 4 PM, Sun: Closed
Monday: 7:00 AM - 9:00 PM\nTuesday: ...
"Closed" / "Off" days correctly skipped
Website detection now skips Yelp's own yelp.com/biz/... URLs (Thunderbit
occasionally returns the listing URL itself instead of the external website)
and tries to recover an external URL from amenities/address fields
chain_likelihood_score punctuation-tolerant — detects "In 'N Out Burger",
"In-N-Out", "In N Out" as the same chain
online_presence_score rebalanced — gives more weight to website + phone +
emails, removed score for fields that depended on the dropped Schema layer
service_offerings_count now correctly counts amenities split by ; as well as ,
Verified on production
On Apify infrastructure with a real Tartine Bakery URL:
WEBSITE and AGE layers gracefully degrade when their upstream sources
(Thunderbit website extraction, Wayback Machine availability) are unavailable.
[1.0] — 2026-05-22
Added — Major intelligence expansion
The actor was rebranded from Yelp Business Scraper to Yelp Business Analyzer
to reflect the depth of analysis. The original Thunderbit scraping behaviour is
preserved — all previous fields remain available, untouched, on the same
extractCore flag. New layers can be toggled off independently.
New data sources:
🕐 Hours intelligence — full structured weekly schedule parsed from
Thunderbit's free-text hours field