The only RedNote scraper you need. Search posts, extract user content, download videos, scrape comments, and get profile data — all in one reliable Actor. Clean JSON output, fast, and actually works.
All notable changes to the RedNote (Xiaohongshu) Scraper Actor.
[2.1] — 2026-05-14 (rc2)
Speed-focused release. Cuts in-actor scrape time for multi-URL modes by 50-80%; cuts single-URL search wall-clock by ~46%. No schema changes; one new optional input field added (networkCapture).
Added
networkCapture input (default true). Listens to xiaohongshu.com API responses during page load and pulls JSON data out of the wire directly, instead of waiting for Vue to hydrate and parsing state. Falls back to v2.0 state/DOM extraction automatically when capture finds nothing. Disable only for debugging.
src/response_capture.py. ResponseCapture class — a per-page response listener that buckets JSON payloads by shape (search feed / user notes / note detail / comments / user profile) without hard-coding endpoint paths in user-facing code.
Phase B URL-chunked workers in post_details, comments, video. Instead of one page per URL, N workers (N = concurrency) each hold one page for an entire chunk of URLs and navigate between them via page.goto(). Saves the ~500-800ms page-create overhead per URL after the first one in each chunk.
Changed
BrowserPool.page() now yields (page, capture) instead of page. Single-page modes receive both; multi-URL modes get the pool and acquire their own pages internally.
Tightened waits (Phase C). wait_for_selector reduced from 15000ms → 8000ms in navigate_and_wait and per-mode waiters. random_delay magnitudes halved across most call sites. wait_for_function for state hydration reduced from 8000ms → 5000ms in post_details, from 10000ms → 6000ms in user_posts and profile.
Mode runners (search, user_posts, post_details, comments, video, profile) now consult ResponseCapture first and fall back to Vue state / DOM only when capture finds nothing. The dual-path design preserves v2.0 reliability while making the common case faster.
Diagnostic logging. Each mode emits a fast_path=N, slow_path=M line so we can audit which path is winning in production.
Phase D (parallel pagination) — not shipped
RedNote search uses a single endless-scroll feed; pagination is JS-driven via signed API calls. Constructing parallel &page=N URLs against /search_result returns page 1 every time — the param is ignored. Replicating the JS pagination would require the same signed-request reverse engineering that v2.x forbids. The practical concurrency win for high-volume search workloads is multiple distinct queries in separate runs. Note documented in src/modes/search.py docstring.
Phase E (JS disable for liteMode) — not shipped
The in-page JS is what (a) populates window.__INITIAL_STATE__, (b) triggers the data-bearing API requests our v2.1 capture listens for, and (c) responds to anti-bot fingerprint probes. Blocking resource_type == "script" leaves a static HTML shell with no extractable data through any path. Classifying "essential vs non-essential" scripts ahead of time requires the forbidden RE. liteMode already saves on the post-extraction formatting work and skips the comment scroll loop; that's the realistic ceiling without breaking extraction. Note documented in src/resource_blocker.py.
Performance measured (rc2 smoke runs vs v2.0 rc1 baseline)
Mode
v1.0.28
v2.0 rc1
v2.1 rc2
Δ vs v2.0
Target
Hit?
search 10 items duration
39.1s
58.9s
31.9s
−46%
≤ 20s
❌ (closer, but Apify container startup is ~12-15s floor)
post_details per URL @ conc=3
60s
28.7s
5.7s
−80%
≤ 18s
✅ beat by 3×
search 10 items cost
$0.0132
$0.0115
$0.0108
−6%
≤ $0.008
❌ near-flat (the 30-50% target is gated by container time, not in-actor work)
post_details cost per URL @ conc=3
$0.020
$0.0097
$0.0019
−80%
≥ 30% cheaper
✅ beat by 2.7×
Search fast-path firing?
n/a
n/a
✅ fast_path=1, slow_path=0, items=10
new
n/a
✅
The single-URL targets (search ≤20s, search cost ≤$0.008) were not hit because ~12-15s of the 31.9s wall-clock is Apify's container pull + Python+Playwright init, which the Actor code cannot influence. The realistic ceiling for in-actor-only search time is now ~17s (31.9 − 15 container) vs v2.0's ~44s, a ~60% reduction in actual scraping work. The multi-URL targets (post_details per URL) blew past the goal because chunking + same-page navigation + fast-path stack their wins.
IP-rotation variance — same input on parallel runs got 5 items elsewhere; unchanged from v2.0 known behavior
post_details 3 URLs conc=3
iVtvQL5zFftKRwD3a
✅ 3/3 real records, titles populated
Phase B chunking
comments
m9rwajgaugRe8ZCgk
✅ Full-schema record, errorCode=None
—
video
H1tzaPVvh6R8sjw0i
✅ videoUrl extracted, hasVideo=true
—
profile
GI14TJS0hkMxRCCBA
✅ 18-field diagnostic, errorCode=login_required
Schema parity with v2.0; underlying gating unchanged
user_posts
kVz88U5mywmw1otoq
⚠ 0 items
Parity with v2.0 known limitation for gated users
5 of 6 modes returning useful data on the rc2 baseline. 1 mode (user_posts) consistent with v1.0.28/v2.0 limitation. 0 regressions detected.
Rollback path
latest tag — 1.0.28 (production buyers, untouched)
rc1 tag — 2.0.1 (v2.0 baseline, available as rollback target if v2.1 ever flips to latest)
rc2 tag — 2.1.2 (this release)
Promotion to latest requires explicit greenlight after rc2 bake period.
[2.0] — 2026-05-13 (rc1)
Performance-focused release. Cuts per-item runtime and bandwidth at no schema cost. Output schema unchanged for existing fields; three new optional input fields added (concurrency, blockResources, liteMode).
Added
concurrency input (default 4, max 10).post_details, comments, and video modes now process input URLs in parallel via a semaphore-bounded asyncio task group. Single-page modes (search, user_posts, profile) ignore the setting since they're inherently single-page.
blockResources input (default true). Network-layer filter that aborts requests for images, fonts, media, and known tracker/analytics domains. Extracted data is unchanged — media URLs are still emitted in the output, only the bytes are skipped on the wire. Roughly halves bandwidth per page.
liteMode input (default false). Returns a minimal field set per item: for posts {mode, postId, postUrl, xsecToken, type, title, likes, scrapedAt}; for profiles a similarly trimmed subset. The xsec_token in postUrl is preserved so search → post_details pipelines still work on lite output. Skips the scroll-load-more loop in comments mode and the DOM-walk fallbacks in video mode.
BrowserPool (src/browser_pool.py). One browser + one shared context per run; pages acquired through an async with pool.page() context manager bounded by the concurrency semaphore. Pages are closed automatically after each unit of work.
Resource blocker (src/resource_blocker.py). Mode-aware: video mode opts into allow_media=True since some player implementations require the media network request to expose <video>.src.
Changed
main.py now drives all modes through the BrowserPool rather than allocating a single shared page.
post_details.py, comments.py, video.py rewritten around asyncio.gather over per-URL tasks. Each task acquires its own page from the pool, processes one URL, releases. No shared state between tasks.
format_post_data / format_profile_data / format_comment_data accept a lite_mode kwarg.
Not changed (intentional)
Output schema for full mode. Every field present in v1.0.28 output is still present in v2.0 full-mode output.
cookieString input behavior. Premium-mode extraction still routes through context.add_cookies.
xsec_token propagation through search/user_posts → post_details/comments/video (the v1.0.28 fix).
Phase 4 (Hybrid Playwright + HTTP) — not shipped
Original v2.0 spec included a stretch goal to use httpx for paginated requests after Playwright bootstrapped the session, targeting 70-90% cost reduction on search and user_posts. This phase was not pursued in v2.0 for two reasons:
Xiaohongshu's web API requires per-request signed parameters (x-s, x-t, signed payload) generated by client-side JS. Producing those signatures from httpx requires either (a) running the signing JS inside a Playwright context per request — no cost savings, just relocates the work — or (b) replicating the signature algorithm in Python.
Option (b) crosses the v2.0 anti-reverse-engineering constraint (no RE content in code, comments, or docs), so it's off-limits regardless of feasibility.
Net: Phase 4 is structurally infeasible under the v2.0 constraints. Reconsider if RedNote ever offers an unsigned public API endpoint.
Performance measured (rc1 smoke runs vs v1.0.28 baseline, single-sample)
Search "护肤教程" — 10 items (same input both runs)
Metric
v1.0.28 (V3QBqmrWFRCnK0adv)
v2.0 (kArYMFGWMjA1OfkFL)
Δ
Memory peak
1215 MB
562 MB
−54%
Memory avg
891 MB
281 MB
−68%
Network RX
6.51 MB
4.42 MB
−32%
Run cost
$0.0132
$0.0115
−13%
Duration
39.1s
58.9s
+51% (within single-sample noise)
Items returned
10
10
—
post_details — 3 URLs (v2.0 in parallel with concurrency=3 vs v1.0.28 baseline of 1 URL)
Metric
v1.0.28 (PmvIfvCopAMp2fdKp, 1 URL)
v2.0 (h2lhMwzsHtSLg5xKP, 3 URLs)
Δ per URL
Duration
60s / 1 URL = 60s/URL
86s / 3 URLs = 28.7s/URL
−52% time/URL
Cost
$0.020 / URL
$0.029 / 3 = $0.0097 / URL
−52% cost/URL
Items with full schema
1
3
same
The concurrency win is the big delta — multi-URL modes scale near-linearly with concurrency up to the proxy/rate-limit ceiling. Memory ceiling is also dramatically lower, leaving headroom for higher concurrency on the same memory bucket.
Success rate (rc1 smoke pass)
Mode
Run
Status
search
kArYMFGWMjA1OfkFL
✅ 10/10 real records, full schema
user_posts
zvIDhr46DM8f0BFdB
⚠ 0 records — known limitation for gated users (unchanged from v1.0.28)
post_details
h2lhMwzsHtSLg5xKP
✅ 3/3 real records, full schema
comments
eJzf2pti7p0QMnzUR
✅ Real comments extracted
video
AkQdF9SqjpM3D8hwl
✅ videoUrl extracted, hasVideo=true
profile
aj7eGyqD8l4srUJuM
✅ Diagnostic record with errorCode (anonymous gated, unchanged)
5 of 6 modes returning useful data, 1 returning diagnostic record consistent with v1.0.28. No regression.
Rollback path
v1.0.28 remains tagged latest until v2.0 (currently on rc1) is promoted. Rollback from v2.0 = re-tag 1.0.28 as latest via Apify console build selector. No dual code paths to maintain.
ERR_ABORTED / ERR_CONNECTION_* now classified as errorCode: "login_required" (RedNote's anti-bot serves a TCP-level abort instead of redirecting to a login page; the buyer-actionable fix is the same).
README "App-shared URLs need cookies" callout added.
[1.0.27] — 2026-05-13
Critical bug fixes for post_details / comments / profile modes.
Fix #1: xsec_token now propagated through search and user_posts output (postUrl + dedicated xsecToken field). Previously stripped despite being captured in Vue state. Unblocks the canonical search → post_details pipeline.
post_details search-then-resolve: best-effort token recovery for tokenless input URLs.
comments search-then-resolve: same fallback before navigating to comment page.
profile false-positive fix: success check now requires actual content (nickname/redId/desc), not just the URL-derived userId — so the DOM fallback genuinely runs when state is empty.
Multi-path profile state extractor: tries userPageData, userInfo, profile, otherInfo, currentUser in order.
cookieString input (new): optional Cookie header for premium extraction of gated pages.
Honest error records: full-schema diagnostic with errorCode instead of empty rows; PPE never charged on failures.
Honest status messages: "Scraped X items, Y failed" instead of misleading totals.
[≤ 1.0.26]
See Apify console build history for earlier changes.