2026-02-15 — v0.3: Data Cleanup & Deduplication
Deduplicate and flatten the raw enrichment output. Price appeared 12+ times, IDs in 5+ objects, promos 4x, images mixed product/UGC/avatars/videos. Now a clean, consumer-friendly structure.
Output : ~1,850 lines/product → ~280 lines (85% reduction), zero data loss
Price : 12+ locations → 4 flat fields (price , priceYaBank , priceWithoutVat , currency )
Rating : 3 sources → 4 fields (rating , ratingCount , reviewCount , ratingDistribution )
Seller : scattered → 6 flat fields (sellerName , sellerLogo , sellerRating , etc.)
Images : 60+ mixed items → 3 typed arrays (images , ugcImages , videos ) with dedup
Delivery : 4 overlapping structures → delivery + deliveryAlternatives
Promos : 4 copies → 1 array (richest version with descriptions, all entries kept)
IDs : scattered across 5+ objects → single top-level set
Specs : dropped redundant "Артикул Маркета" and "Бренд" entries
Removed (dropped blobs — all data preserved in flat fields)
offerSnippet (237 lines of pure duplication)
productPayload (duplicate prices/rating/gallery)
buyOption (unique fields extracted to top level)
compose (slugs extracted, IDs deduplicated)
shopInfo (flattened to seller* fields)
deliveryOptions , deliveryConsole (UI config noise)
jsonldOffer , jsonldDescription , jsonldImage , jsonldUrl (redundant)
aggregateRating (merged into flat rating fields)
Search tracking fields: showUid , pp , meta , isReferral , hasVendorUrl , signals , snippetType , type
clean_product() in extractors.py — final transform step before dataset push
New top-level fields: modelName , categoryName , productSlug , oskuSlug , articleNumber , businessId , feedId , navnodeId , departmentId , priceYaBank , priceWithoutVat , stockCount , maxOrderQuantity , placementType , paymentType , paymentMethods , isBnpl , offerFlags , benefitBadge , feedOfferId , warehouseId , searchPosition
isCrossBorder type fix: string "false" → boolean
2026-02-14 — v0.2: Render-Lazy Enrichment Migration
Migrated product enrichment from full SSR page fetch (2.4MB/product) to render-lazy widget API (6 POSTs, ~95KB/product). 96% bandwidth reduction with richer structured data.
Enrichment : 6 render-lazy widget POSTs replace 1 full-page GET per product
ProductPageMeta (2.6KB) — JSON-LD: brand, price, availability
ProductDescription (14.5KB) — description + specs (same HTML format)
DefaultOfferUnified (45KB) — buyOption, stock, images, compose IDs
ShopInfoBlock (9.8KB) — seller name/logo/rating as structured JSON
Breadcrumbs (6.3KB) — JSON-LD BreadcrumbList
DeliveryConsole (17KB) — delivery dates/prices/types
Retries : 3 attempts (down from 5) — wraps entire 6-widget batch
Enrichability : Only needs modelId + marketSku (no slug extraction needed)
Session : Exposes csp_nonce from window.state for widget requests
New fields : sellerRating (float), sellerLogo (URL), sellerRatingCount , stock , compose (master ID object), buyOption (exact price + currency)
includeReviews input (default false) — fetches full product page for review text (separate from widget enrichment)
tests/test_review_widgets.py — pre-implementation test for review/seller widgets
aggregateRating — not available in render-lazy ProductPageMeta (was on full page)
numGrades , numReviews , numOffers — zone-data not in widgets. Reviews fallback to full-page fetch if includeReviews enabled
Slug extraction dependency — render-lazy uses dummy slug x
2026-02-14 — v0.1: Actor Implementation
First working Apify actor. All scraping logic ported from 9 test files into production structure.
Search : Resolve API with resolvePoorRemoteSearch , 48 products/page, backendState pagination up to 1500 products
Extraction : 35+ fields per product from zone-data (title, price, SKU, modelId, shopId, vendorId, delivery, etc.)
Filters : sortBy, priceFrom/priceTo, categoryId (hid) — passed as resolve API params
Region : Cookie-based region override (lr + yandex_gid), defaults to Moscow (213)
Enrichment (on by default): Fetches /product--{slug}/{modelId}?sku={skuId} for each product
Brand, description, 10-15 specs, seller name, aggregateRating, 6-level breadcrumbs
37-43 original-quality images, up to 5 reviews with author/date/pros/cons/comment/grade
numGrades vs numReviews distinction, numOffers (seller count)
Specs extraction : CSS-class-free — uses specLink zone-data text field + bare <span> fallback
Session : curl_cffi with Chrome TLS fingerprint, automatic sk/version extraction from window.state
Error handling : CAPTCHA detection, session expiry auto-recovery, user-friendly error messages
Proxy : Input proxyUrl > PROXY_URL env > Apify proxy configuration
PPE : Single product_scraped event per product
55 fields per enriched product (25 from search + 11 from enrichment + 19 bonus zone-data fields).