AliExpress Product Search Scraper (Crawlee, Proxy-Ready) avatar

AliExpress Product Search Scraper (Crawlee, Proxy-Ready)

Pricing

$19.00/month + usage

Go to Apify Store
AliExpress Product Search Scraper (Crawlee, Proxy-Ready)

AliExpress Product Search Scraper (Crawlee, Proxy-Ready)

Scrape AliExpress search results by keywords with proxy support, session rotation, and cost guardrails. Outputs a clean, typed dataset (type: product) plus optional suggestion/unknown entries. Also available as Pay-per-event version (better for automation/MCP).

Pricing

$19.00/month + usage

Rating

5.0

(1)

Developer

TrendMatch

TrendMatch

Maintained by Community

Actor stats

1

Bookmarked

1

Total users

1

Monthly active users

15 days ago

Last modified

Share

AliExpress Product Scraper (PRO)

Scrape AliExpress search results at scale using Crawlee PlaywrightCrawler with automatic session rotation, proxy support, and anti-bot handling.

Built for the Apify Store as a Rental Actor.

What it does

For each search query you provide, the Actor:

  1. Opens AliExpress search pages (SEO-friendly URLs with automatic fallback).
  2. Extracts product cards: title, price, image, rating, orders, store name, and URL.
  3. Deduplicates results across queries by product URL.
  4. (Optional) Visits individual product detail pages for enriched data (multi-strategy: window.runParams, ld+json, DOM fallback).
  5. Pushes all items to the Apify Dataset and saves run metadata to the Key-Value store.

Quick start

Minimal input:

{
"queries": ["LED strip lights", "wireless earbuds"]
}

Recommended input for reliable, cost-efficient scraping:

{
"queries": ["LED strip lights", "wireless earbuds", "phone case iPhone 15"],
"maxPages": 3,
"maxItemsTotal": 300,
"maxConcurrency": 2,
"throttleMs": 2500,
"proxy": {
"useApifyProxy": true,
"apifyProxyGroups": ["RESIDENTIAL"]
}
}

With enrichment:

{
"queries": ["portable blender"],
"maxPages": 2,
"maxItemsTotal": 100,
"enrich": true,
"enrichLimit": 30,
"proxy": {
"useApifyProxy": true,
"apifyProxyGroups": ["RESIDENTIAL"]
}
}

Input parameters

ParameterTypeDefaultDescription
queriesstring[]requiredSearch keywords to scrape
maxQueriesint10Max queries to process (cap: 200)
maxPagesint3Pages per query (cap: 20)
maxItemsTotalint200Total items hard cap (cap: 5000)
maxConcurrencyint2Parallel browser pages (cap: 5)
throttleMsint2500Min delay between requests in ms
maxRequestRetriesint3Retries per failed request
headlessbooltrueRun browser headless
enrichboolfalseVisit product detail pages
enrichLimitint20Max products to enrich
includeSuggestionsboolfalseInclude related-search suggestions in dataset (typed as type: "suggestion")
includeUnknownboolfalseInclude items with valid URL but no commerce signals (typed as type: "unknown")
proxyobject{ useApifyProxy: true, apifyProxyGroups: ["AUTO"] }Proxy configuration (Apify proxy editor)
proxy.useApifyProxybooltrueUse Apify Proxy
proxy.apifyProxyGroupsstring[]["AUTO"]Proxy group(s)
proxy.proxyUrlsstring[][]Custom proxy URLs
debugLogboolfalseVerbose logging
saveHtmlOnErrorboolfalseSave blocked page HTML to KV store

Proxy groups: AUTO vs RESIDENTIAL

AliExpress uses aggressive bot protection. Proxies are the biggest factor in reliability.

OptionWhen to useProsCons
apifyProxyGroups: ["AUTO"] (default)Start hereCheapest/simplest defaultMay hit WAF/CAPTCHA more often
apifyProxyGroups: ["RESIDENTIAL"]If you see frequent blocksMuch more reliable on protected pagesHigher proxy cost for the user

Disabling proxy: Set proxy.useApifyProxy: false and leave proxy.proxyUrls empty to run without any proxy. This is useful for local testing but will almost certainly get blocked on AliExpress in production.

Custom proxies: Set proxy.useApifyProxy: false and provide your own URLs in proxy.proxyUrls (format: http://user:pass@host:port). The Actor will rotate through them automatically.

Practical guidance:

  • Start with AUTO.
  • If results are empty or blocked often, switch to RESIDENTIAL and keep maxConcurrency low (1-2) with throttleMs >= 2500.

Output dataset fields

Each item in the output dataset has these fields:

Product items (type: "product")

FieldTypeDescription
typestringAlways "product"
querystringThe search query that produced this item
rankintPosition on the search results page
productIdstringAliExpress product ID
titlestringProduct title
urlstringCanonical product URL
imagestring|nullProduct image URL; null if AliExpress did not provide a valid CDN image
imageValidbooltrue if image is from a known AliExpress CDN and not a tiny UI asset
pricenumberSale price (USD) — may be null if not extracted
originalPricenumberOriginal price before discount
currencystringCurrency code (typically USD)
discountnumberDiscount percentage
ratingnumberStar rating (0-5) — may be null
ordersintApproximate orders/sold count — may be null
storeNamestringSeller store name — may be null
scrapedAtstringISO 8601 timestamp
sourcestringAlways aliexpress
enrichedboolWhether detail-page enrichment succeeded (only if enrich=true)
enrichSourcesstringWhich data sources were used for enrichment

Note: price, rating, orders, storeName, and image depend on what AliExpress renders and may be null. They are not used as hard filters — a product is included if it has a valid /item/ URL and an extractable numeric productId. Image validity is a soft check: imageValid indicates whether the image comes from a known AliExpress CDN host.

Suggestion items (type: "suggestion") — only with includeSuggestions: true

FieldTypeDescription
typestringAlways "suggestion"
suggestionstringThe related search term
querystringThe original query that produced this suggestion
rankintOrder within the suggestion block
scrapedAtstringISO 8601 timestamp

Product filtering

The Actor applies three layers of filtering to ensure a clean, store-ready dataset:

Layer 1 — Grid scoping (DOM)

The page.evaluate scopes its search to the main product-results grid using known container selectors (SearchProductFeed, search-item-card-wrapper, manhattan--container, list--gallery--). If none match, it falls back to document.body. Within that scope, each a[href*="/item/"] link must live inside a card-like container ([class*="card"], [class*="product"], [class*="item"], [class*="gallery"]) and that container must contain an <img> element. Bare text anchors and cards without images are skipped at the DOM level.

Layer 2 — Hard URL filter

A card is kept only if both conditions are met:

  1. URL contains /item/ — standard AliExpress product URL pattern.
  2. Numeric productId is extractable from the URL (e.g. /item/1234567890.html).

Cards that fail either check are silently dropped (nonProductDropped).

Layer 3 — Commerce signal check

After extraction, items with no commerce signals (price, orders, rating, storeName all null) and no valid CDN image are classified as type: "unknown" and dropped by default (unknownDropped). Set includeUnknown: true to keep them in the dataset.

Image validation (soft)

Image validation is a soft check — products with missing or invalid images are still included with image: null and imageValid: false. The missingImage metric tracks how many products had no valid CDN image. Valid images must be from a known AliExpress CDN (*.alicdn.com, *.aliexpress-media.com, *.aliyuncs.com) and not a tiny UI asset.

Run metadata

After each run, the Actor stores a metadata record in the Key-Value store with:

  • runSummary: queriesTotal, searchRequestsProcessed, uniqueQueriesProcessed, pagesFetched, itemsPushed, itemsSkipped, productsFound, productsPushed, nonProductDropped, missingImage, unknownDropped, suggestionsCaptured, blockedCount, failedCount, sessionsRotated, durationSecs
  • failedQueries: array of { query, page, reason }
  • effectiveConfig: the actual config used (with proxy credentials redacted)

Access it via the API: GET /v2/key-value-stores/{storeId}/records/metadata

ScenariomaxConcurrencythrottleMsProxyEstimated CU
Light (5 queries, 3 pages)13000AUTO~0.05-0.1
Medium (20 queries, 3 pages)22500AUTO~0.2-0.5
Heavy (50 queries, 5 pages)22500RESIDENTIAL~1-3
With enrichment (+50 products)13000RESIDENTIAL+0.5-1

Key cost-control lever: maxItemsTotal caps total output regardless of queries/pages.

Anti-bot notes

AliExpress uses several anti-bot protections:

  • CAPTCHA/WAF/Punish pages: The Actor detects these automatically, retires the session, and retries with a fresh session/proxy.
  • SEO alphabet redirects: Detected and handled via fallback SearchText URLs.
  • Locale redirects: US-market cookies are injected to force USD pricing.
  • Rate limiting: The throttleMs delay (with random jitter) helps avoid triggering rate limits.

Best practices:

  • Use RESIDENTIAL proxy group for the most reliable results.
  • Keep maxConcurrency at 1-2.
  • Keep throttleMs at 2500ms or higher.
  • If you see many blocks in the logs, increase throttle and switch to RESIDENTIAL.

Troubleshooting

ProblemSolution
All requests blockedSwitch to RESIDENTIAL proxy group, increase throttleMs to 3000+
Empty resultsCheck if your query returns results on aliexpress.us manually. Enable saveHtmlOnError to inspect blocked pages
Wrong currency / pricesThe Actor forces USD via cookies, but some redirects can override this. Use RESIDENTIAL proxy for US-based IPs
Enrichment low success rateAliExpress detail pages are heavily protected. Consider increasing throttleMs for enrichment
Run times outReduce maxQueries, maxPages, or maxItemsTotal. Lower enrichLimit
"No proxy configured" warningEnable proxy.useApifyProxy or provide proxy.proxyUrls. Without proxy, AliExpress will block almost immediately

Local testing

Run the self-contained smoke test (creates storage, writes INPUT, runs the Actor):

$npm test

Manual equivalent (reuses the smoke-test storage):

$CRAWLEE_STORAGE_DIR="$(pwd)/apify_storage_smoke" node src/main.js

Note: Apify SDK v3 uses CRAWLEE_STORAGE_DIR (not APIFY_LOCAL_STORAGE_DIR) to resolve the local key-value store where INPUT.json lives.

Monetization

This Actor is designed for Rental monetization on the Apify Store. Users rent access and pay for their own platform usage (compute units, proxy traffic) on their Apify account. There are no hidden costs or external API dependencies.

Changelog

v1.0.5 (2026-02-13)

  • Anti-false-positive filtering: grid scoping + card structure gates + commerce signal check.
  • page.evaluate now scopes to the main results grid (SearchProductFeed, search-item-card-wrapper, manhattan--container, list--gallery--) with document.body fallback.
  • Card detection requires a meaningful container class (card, product, item, gallery) — removed overly broad div[class] / parentElement fallback.
  • Each card must contain an <img> element; bare text anchors are skipped at DOM level.
  • Commerce signal check: items with no price, orders, rating, storeName, and no valid image are classified as type: "unknown" and dropped by default.
  • New includeUnknown input (default false): when enabled, unknown items appear in the dataset with type: "unknown".
  • New unknownDropped metric in run metadata.

v1.0.4 (2026-02-13)

  • Image validation is now soft: products with missing or invalid images are no longer discarded. Instead they get image: null and imageValid: false.
  • Hard filters remain: URL must contain /item/ and numeric productId must be extractable.
  • New imageValid boolean field on every product item.
  • New missingImage metric in run metadata (counts products with invalid/absent image).

v1.0.3 (2026-02-13)

  • Store-ready dataset: Non-product cards (suggestions, promos, banners) are now filtered out by default.
  • Product validation: URL must contain /item/, numeric productId must be extractable, image must be from a known AliExpress CDN host.
  • New type field on every dataset item ("product" or "suggestion").
  • New includeSuggestions input (default false): when enabled, related-search suggestions are captured with type: "suggestion".
  • New filtering metrics in run metadata: productsFound, productsPushed, nonProductDropped, suggestionsCaptured.
  • Image URL normalization (protocol-less URLs get https: prefix).
  • New helpers: isProductUrl, isValidProductImage, normalizeImageUrl, classifyCard, parseSuggestions.

v1.0.0 (2026-02-12)

  • Initial release.
  • Crawlee PlaywrightCrawler with SessionPool and ProxyConfiguration.
  • Search result parsing with SEO URL + fallback strategy.
  • Optional detail-page enrichment (runParams / ld+json / DOM).
  • Input validation with hard caps and conservative defaults.
  • CAPTCHA/WAF/block detection with automatic session rotation.
  • Run metadata with summary, failed queries, and effective config.