Douban Reviews Scraper
Pricing
from $25.00 / 1,000 long review (full body + sentiment)s
Douban Reviews Scraper
Scrape Douban (豆瓣) ratings, reviews & comments with sentiment tags for movies, TV, books, music & groups. Clean JSON for NLP/LLM training & analysis.
Pricing
from $25.00 / 1,000 long review (full body + sentiment)s
Rating
0.0
(0)
Developer
Tony
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
a day ago
Last modified
Categories
Share
Scrape public Douban (豆瓣) ratings, long-form reviews and short comments — each tagged with a positive / neutral / negative sentiment label — for movies, TV, books, music and groups. Built for media & entertainment researchers, publishers, recommendation-engine teams and AI-training-data buyers who need structured Chinese cultural-opinion data.
What you get
Paste one or more Douban subject URLs and get a clean dataset with three record types:
subject— the title, year, overall rating, rating count and genres.comment— short user comments (high volume) with star rating + sentiment.review— long-form reviews with full body text, star rating + sentiment.
Group URLs produce group_topic records (discussion topics with reply counts).
Sentiment is derived from the author's own Douban star rating — no guesswork, no ML black box: 5–4★ = positive, 3★ = neutral, 2–1★ = negative, unrated = null.
Supported URLs
| Type | Example |
|---|---|
| Movie / TV | https://movie.douban.com/subject/1292052/ |
| Book | https://book.douban.com/subject/1084336/ |
| Music | https://music.douban.com/subject/1407217/ |
| Group | https://www.douban.com/group/beethoven/ |
Example input
{"startUrls": [{ "url": "https://movie.douban.com/subject/1292052/" }],"scrapeShortComments": true,"scrapeLongReviews": true,"maxCommentsPerSubject": 200,"maxReviewsPerSubject": 50,"fetchFullReviewText": true,"tagSentiment": true,"proxyConfiguration": {"useApifyProxy": true,"apifyProxyGroups": ["RESIDENTIAL"]}}
Example output
A short comment record:
{"record_type": "comment","id": "comment:1234567890","subject_id": "1292052","subject_type": "movie","subject_title": "肖申克的救赎 The Shawshank Redemption","author": "影迷小王","author_url": "https://www.douban.com/people/12345/","rating_stars": 5,"rating_label": "力荐","sentiment": "positive","content": "希望让人自由。每看一次都有新的感动。","useful_count": 1842,"created_at": "2021-03-14T21:05:00+08:00","source_url": "https://movie.douban.com/subject/1292052/comments?status=P&start=0&limit=20&sort=new_score","scraped_at": "2026-06-15T09:12:00.000Z"}
Field notes:
idis stable across runs (built from Douban's comment/review id) — use it to deduplicate on your side.rating_stars/sentimentarenullwhen the user left a comment without a rating.created_atis the original Douban timestamp in China Standard Time (UTC+8);scraped_atis ISO-8601 UTC.- For long reviews,
content_truncated: falsemeans the full essay body was captured (fetchFullReviewTextenabled).
Pricing
Pay-per-event — you pay per item actually extracted, so cost scales with value:
| Event | Price |
|---|---|
| Long review (full body + sentiment) | $0.025 |
| Short comment | $0.005 |
| Subject info | $0.006 |
| Group topic | $0.012 |
| Actor start | $0.00005 |
(Final prices are shown on the Apify Store listing.)
Limitations
- Douban serves a JavaScript proof-of-work anti-bot challenge, so this actor runs a real headless browser (Playwright + Chromium) to clear it. Recommended run settings: 4 GB memory and residential Apify Proxy. The browser solves the challenge automatically and retries on a fresh session if it doesn't clear.
- Douban limits how deep short-comment pagination goes for logged-out access (typically the first few hundred). Set
maxCommentsPerSubjectrealistically. - A minority of titles (often sensitive ones) gate their short comments behind login entirely for logged-out visitors; for those, the actor still returns the subject info and long reviews, but short comments come back empty. Long reviews and ratings are not gated.
- Keep
maxConcurrencymodest (default 3). Under heavy concurrency Douban occasionally soft-throttles a page, which can make a single comment page come back empty; lower concurrency avoids this. - Some music/book subject pages expose fewer fields (e.g. no genres); those come back
null. - Public data only — the actor never logs in or scrapes login-walled content.
FAQ
Which content should I scrape? Toggle scrapeShortComments, scrapeLongReviews and scrapeSubjectInfo independently. Short comments are cheapest and highest-volume; long reviews are richer for sentiment / NLP work.
Can I run this on a schedule? Yes — use Apify Schedules. Reviews are evergreen, so weekly is usually plenty.
How do I export to my DB / Google Sheets? Use Apify Integrations or the Dataset API — every field above is available via /items?format=json|csv|xlsx. The dataset also ships pre-built Short comments and Long reviews table views.
Why is sentiment sometimes null? The user rated nothing, so there's no star signal to map. The raw content is still captured.