AliExpress Product Search Scraper (Crawlee, Proxy-Ready)
Pricing
$19.00/month + usage
AliExpress Product Search Scraper (Crawlee, Proxy-Ready)
Scrape AliExpress search results by keywords with proxy support, session rotation, and cost guardrails. Outputs a clean, typed dataset (type: product) plus optional suggestion/unknown entries. Also available as Pay-per-event version (better for automation/MCP).
Pricing
$19.00/month + usage
Rating
5.0
(1)
Developer

TrendMatch
Actor stats
1
Bookmarked
1
Total users
1
Monthly active users
15 days ago
Last modified
Categories
Share
AliExpress Product Scraper (PRO)
Scrape AliExpress search results at scale using Crawlee PlaywrightCrawler with automatic session rotation, proxy support, and anti-bot handling.
Built for the Apify Store as a Rental Actor.
What it does
For each search query you provide, the Actor:
- Opens AliExpress search pages (SEO-friendly URLs with automatic fallback).
- Extracts product cards: title, price, image, rating, orders, store name, and URL.
- Deduplicates results across queries by product URL.
- (Optional) Visits individual product detail pages for enriched data (multi-strategy:
window.runParams,ld+json, DOM fallback). - Pushes all items to the Apify Dataset and saves run metadata to the Key-Value store.
Quick start
Minimal input:
{"queries": ["LED strip lights", "wireless earbuds"]}
Recommended input for reliable, cost-efficient scraping:
{"queries": ["LED strip lights", "wireless earbuds", "phone case iPhone 15"],"maxPages": 3,"maxItemsTotal": 300,"maxConcurrency": 2,"throttleMs": 2500,"proxy": {"useApifyProxy": true,"apifyProxyGroups": ["RESIDENTIAL"]}}
With enrichment:
{"queries": ["portable blender"],"maxPages": 2,"maxItemsTotal": 100,"enrich": true,"enrichLimit": 30,"proxy": {"useApifyProxy": true,"apifyProxyGroups": ["RESIDENTIAL"]}}
Input parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
queries | string[] | required | Search keywords to scrape |
maxQueries | int | 10 | Max queries to process (cap: 200) |
maxPages | int | 3 | Pages per query (cap: 20) |
maxItemsTotal | int | 200 | Total items hard cap (cap: 5000) |
maxConcurrency | int | 2 | Parallel browser pages (cap: 5) |
throttleMs | int | 2500 | Min delay between requests in ms |
maxRequestRetries | int | 3 | Retries per failed request |
headless | bool | true | Run browser headless |
enrich | bool | false | Visit product detail pages |
enrichLimit | int | 20 | Max products to enrich |
includeSuggestions | bool | false | Include related-search suggestions in dataset (typed as type: "suggestion") |
includeUnknown | bool | false | Include items with valid URL but no commerce signals (typed as type: "unknown") |
proxy | object | { useApifyProxy: true, apifyProxyGroups: ["AUTO"] } | Proxy configuration (Apify proxy editor) |
proxy.useApifyProxy | bool | true | Use Apify Proxy |
proxy.apifyProxyGroups | string[] | ["AUTO"] | Proxy group(s) |
proxy.proxyUrls | string[] | [] | Custom proxy URLs |
debugLog | bool | false | Verbose logging |
saveHtmlOnError | bool | false | Save blocked page HTML to KV store |
Proxy groups: AUTO vs RESIDENTIAL
AliExpress uses aggressive bot protection. Proxies are the biggest factor in reliability.
| Option | When to use | Pros | Cons |
|---|---|---|---|
apifyProxyGroups: ["AUTO"] (default) | Start here | Cheapest/simplest default | May hit WAF/CAPTCHA more often |
apifyProxyGroups: ["RESIDENTIAL"] | If you see frequent blocks | Much more reliable on protected pages | Higher proxy cost for the user |
Disabling proxy: Set proxy.useApifyProxy: false and leave proxy.proxyUrls empty to run without any proxy. This is useful for local testing but will almost certainly get blocked on AliExpress in production.
Custom proxies: Set proxy.useApifyProxy: false and provide your own URLs in proxy.proxyUrls (format: http://user:pass@host:port). The Actor will rotate through them automatically.
Practical guidance:
- Start with AUTO.
- If results are empty or blocked often, switch to RESIDENTIAL and keep
maxConcurrencylow (1-2) withthrottleMs>= 2500.
Output dataset fields
Each item in the output dataset has these fields:
Product items (type: "product")
| Field | Type | Description |
|---|---|---|
type | string | Always "product" |
query | string | The search query that produced this item |
rank | int | Position on the search results page |
productId | string | AliExpress product ID |
title | string | Product title |
url | string | Canonical product URL |
image | string|null | Product image URL; null if AliExpress did not provide a valid CDN image |
imageValid | bool | true if image is from a known AliExpress CDN and not a tiny UI asset |
price | number | Sale price (USD) — may be null if not extracted |
originalPrice | number | Original price before discount |
currency | string | Currency code (typically USD) |
discount | number | Discount percentage |
rating | number | Star rating (0-5) — may be null |
orders | int | Approximate orders/sold count — may be null |
storeName | string | Seller store name — may be null |
scrapedAt | string | ISO 8601 timestamp |
source | string | Always aliexpress |
enriched | bool | Whether detail-page enrichment succeeded (only if enrich=true) |
enrichSources | string | Which data sources were used for enrichment |
Note:
price,rating,orders,storeName, andimagedepend on what AliExpress renders and may benull. They are not used as hard filters — a product is included if it has a valid/item/URL and an extractable numericproductId. Image validity is a soft check:imageValidindicates whether the image comes from a known AliExpress CDN host.
Suggestion items (type: "suggestion") — only with includeSuggestions: true
| Field | Type | Description |
|---|---|---|
type | string | Always "suggestion" |
suggestion | string | The related search term |
query | string | The original query that produced this suggestion |
rank | int | Order within the suggestion block |
scrapedAt | string | ISO 8601 timestamp |
Product filtering
The Actor applies three layers of filtering to ensure a clean, store-ready dataset:
Layer 1 — Grid scoping (DOM)
The page.evaluate scopes its search to the main product-results grid using known container selectors (SearchProductFeed, search-item-card-wrapper, manhattan--container, list--gallery--). If none match, it falls back to document.body. Within that scope, each a[href*="/item/"] link must live inside a card-like container ([class*="card"], [class*="product"], [class*="item"], [class*="gallery"]) and that container must contain an <img> element. Bare text anchors and cards without images are skipped at the DOM level.
Layer 2 — Hard URL filter
A card is kept only if both conditions are met:
- URL contains
/item/— standard AliExpress product URL pattern. - Numeric
productIdis extractable from the URL (e.g./item/1234567890.html).
Cards that fail either check are silently dropped (nonProductDropped).
Layer 3 — Commerce signal check
After extraction, items with no commerce signals (price, orders, rating, storeName all null) and no valid CDN image are classified as type: "unknown" and dropped by default (unknownDropped). Set includeUnknown: true to keep them in the dataset.
Image validation (soft)
Image validation is a soft check — products with missing or invalid images are still included with image: null and imageValid: false. The missingImage metric tracks how many products had no valid CDN image. Valid images must be from a known AliExpress CDN (*.alicdn.com, *.aliexpress-media.com, *.aliyuncs.com) and not a tiny UI asset.
Run metadata
After each run, the Actor stores a metadata record in the Key-Value store with:
- runSummary: queriesTotal, searchRequestsProcessed, uniqueQueriesProcessed, pagesFetched, itemsPushed, itemsSkipped, productsFound, productsPushed, nonProductDropped, missingImage, unknownDropped, suggestionsCaptured, blockedCount, failedCount, sessionsRotated, durationSecs
- failedQueries: array of
{ query, page, reason } - effectiveConfig: the actual config used (with proxy credentials redacted)
Access it via the API: GET /v2/key-value-stores/{storeId}/records/metadata
Recommended settings for reliability and cost
| Scenario | maxConcurrency | throttleMs | Proxy | Estimated CU |
|---|---|---|---|---|
| Light (5 queries, 3 pages) | 1 | 3000 | AUTO | ~0.05-0.1 |
| Medium (20 queries, 3 pages) | 2 | 2500 | AUTO | ~0.2-0.5 |
| Heavy (50 queries, 5 pages) | 2 | 2500 | RESIDENTIAL | ~1-3 |
| With enrichment (+50 products) | 1 | 3000 | RESIDENTIAL | +0.5-1 |
Key cost-control lever: maxItemsTotal caps total output regardless of queries/pages.
Anti-bot notes
AliExpress uses several anti-bot protections:
- CAPTCHA/WAF/Punish pages: The Actor detects these automatically, retires the session, and retries with a fresh session/proxy.
- SEO alphabet redirects: Detected and handled via fallback SearchText URLs.
- Locale redirects: US-market cookies are injected to force USD pricing.
- Rate limiting: The
throttleMsdelay (with random jitter) helps avoid triggering rate limits.
Best practices:
- Use
RESIDENTIALproxy group for the most reliable results. - Keep
maxConcurrencyat 1-2. - Keep
throttleMsat 2500ms or higher. - If you see many blocks in the logs, increase throttle and switch to RESIDENTIAL.
Troubleshooting
| Problem | Solution |
|---|---|
| All requests blocked | Switch to RESIDENTIAL proxy group, increase throttleMs to 3000+ |
| Empty results | Check if your query returns results on aliexpress.us manually. Enable saveHtmlOnError to inspect blocked pages |
| Wrong currency / prices | The Actor forces USD via cookies, but some redirects can override this. Use RESIDENTIAL proxy for US-based IPs |
| Enrichment low success rate | AliExpress detail pages are heavily protected. Consider increasing throttleMs for enrichment |
| Run times out | Reduce maxQueries, maxPages, or maxItemsTotal. Lower enrichLimit |
| "No proxy configured" warning | Enable proxy.useApifyProxy or provide proxy.proxyUrls. Without proxy, AliExpress will block almost immediately |
Local testing
Run the self-contained smoke test (creates storage, writes INPUT, runs the Actor):
$npm test
Manual equivalent (reuses the smoke-test storage):
$CRAWLEE_STORAGE_DIR="$(pwd)/apify_storage_smoke" node src/main.js
Note: Apify SDK v3 uses
CRAWLEE_STORAGE_DIR(notAPIFY_LOCAL_STORAGE_DIR) to resolve the local key-value store whereINPUT.jsonlives.
Monetization
This Actor is designed for Rental monetization on the Apify Store. Users rent access and pay for their own platform usage (compute units, proxy traffic) on their Apify account. There are no hidden costs or external API dependencies.
Changelog
v1.0.5 (2026-02-13)
- Anti-false-positive filtering: grid scoping + card structure gates + commerce signal check.
page.evaluatenow scopes to the main results grid (SearchProductFeed,search-item-card-wrapper,manhattan--container,list--gallery--) withdocument.bodyfallback.- Card detection requires a meaningful container class (
card,product,item,gallery) — removed overly broaddiv[class]/parentElementfallback. - Each card must contain an
<img>element; bare text anchors are skipped at DOM level. - Commerce signal check: items with no price, orders, rating, storeName, and no valid image are classified as
type: "unknown"and dropped by default. - New
includeUnknowninput (defaultfalse): when enabled, unknown items appear in the dataset withtype: "unknown". - New
unknownDroppedmetric in run metadata.
v1.0.4 (2026-02-13)
- Image validation is now soft: products with missing or invalid images are no longer discarded. Instead they get
image: nullandimageValid: false. - Hard filters remain: URL must contain
/item/and numericproductIdmust be extractable. - New
imageValidboolean field on every product item. - New
missingImagemetric in run metadata (counts products with invalid/absent image).
v1.0.3 (2026-02-13)
- Store-ready dataset: Non-product cards (suggestions, promos, banners) are now filtered out by default.
- Product validation: URL must contain
/item/, numericproductIdmust be extractable, image must be from a known AliExpress CDN host. - New
typefield on every dataset item ("product"or"suggestion"). - New
includeSuggestionsinput (defaultfalse): when enabled, related-search suggestions are captured withtype: "suggestion". - New filtering metrics in run metadata:
productsFound,productsPushed,nonProductDropped,suggestionsCaptured. - Image URL normalization (protocol-less URLs get
https:prefix). - New helpers:
isProductUrl,isValidProductImage,normalizeImageUrl,classifyCard,parseSuggestions.
v1.0.0 (2026-02-12)
- Initial release.
- Crawlee PlaywrightCrawler with SessionPool and ProxyConfiguration.
- Search result parsing with SEO URL + fallback strategy.
- Optional detail-page enrichment (runParams / ld+json / DOM).
- Input validation with hard caps and conservative defaults.
- CAPTCHA/WAF/block detection with automatic session rotation.
- Run metadata with summary, failed queries, and effective config.
