Douban Reviews Scraper avatar

Douban Reviews Scraper

Pricing

from $25.00 / 1,000 long review (full body + sentiment)s

Go to Apify Store
Douban Reviews Scraper

Douban Reviews Scraper

Scrape Douban (豆瓣) ratings, reviews & comments with sentiment tags for movies, TV, books, music & groups. Clean JSON for NLP/LLM training & analysis.

Pricing

from $25.00 / 1,000 long review (full body + sentiment)s

Rating

0.0

(0)

Developer

Tony

Tony

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a day ago

Last modified

Share

Scrape public Douban (豆瓣) ratings, long-form reviews and short comments — each tagged with a positive / neutral / negative sentiment label — for movies, TV, books, music and groups. Built for media & entertainment researchers, publishers, recommendation-engine teams and AI-training-data buyers who need structured Chinese cultural-opinion data.

What you get

Paste one or more Douban subject URLs and get a clean dataset with three record types:

  • subject — the title, year, overall rating, rating count and genres.
  • comment — short user comments (high volume) with star rating + sentiment.
  • review — long-form reviews with full body text, star rating + sentiment.

Group URLs produce group_topic records (discussion topics with reply counts).

Sentiment is derived from the author's own Douban star rating — no guesswork, no ML black box: 5–4★ = positive, 3★ = neutral, 2–1★ = negative, unrated = null.

Supported URLs

TypeExample
Movie / TVhttps://movie.douban.com/subject/1292052/
Bookhttps://book.douban.com/subject/1084336/
Musichttps://music.douban.com/subject/1407217/
Grouphttps://www.douban.com/group/beethoven/

Example input

{
"startUrls": [
{ "url": "https://movie.douban.com/subject/1292052/" }
],
"scrapeShortComments": true,
"scrapeLongReviews": true,
"maxCommentsPerSubject": 200,
"maxReviewsPerSubject": 50,
"fetchFullReviewText": true,
"tagSentiment": true,
"proxyConfiguration": {
"useApifyProxy": true,
"apifyProxyGroups": ["RESIDENTIAL"]
}
}

Example output

A short comment record:

{
"record_type": "comment",
"id": "comment:1234567890",
"subject_id": "1292052",
"subject_type": "movie",
"subject_title": "肖申克的救赎 The Shawshank Redemption",
"author": "影迷小王",
"author_url": "https://www.douban.com/people/12345/",
"rating_stars": 5,
"rating_label": "力荐",
"sentiment": "positive",
"content": "希望让人自由。每看一次都有新的感动。",
"useful_count": 1842,
"created_at": "2021-03-14T21:05:00+08:00",
"source_url": "https://movie.douban.com/subject/1292052/comments?status=P&start=0&limit=20&sort=new_score",
"scraped_at": "2026-06-15T09:12:00.000Z"
}

Field notes:

  • id is stable across runs (built from Douban's comment/review id) — use it to deduplicate on your side.
  • rating_stars / sentiment are null when the user left a comment without a rating.
  • created_at is the original Douban timestamp in China Standard Time (UTC+8); scraped_at is ISO-8601 UTC.
  • For long reviews, content_truncated: false means the full essay body was captured (fetchFullReviewText enabled).

Pricing

Pay-per-event — you pay per item actually extracted, so cost scales with value:

EventPrice
Long review (full body + sentiment)$0.025
Short comment$0.005
Subject info$0.006
Group topic$0.012
Actor start$0.00005

(Final prices are shown on the Apify Store listing.)

Limitations

  • Douban serves a JavaScript proof-of-work anti-bot challenge, so this actor runs a real headless browser (Playwright + Chromium) to clear it. Recommended run settings: 4 GB memory and residential Apify Proxy. The browser solves the challenge automatically and retries on a fresh session if it doesn't clear.
  • Douban limits how deep short-comment pagination goes for logged-out access (typically the first few hundred). Set maxCommentsPerSubject realistically.
  • A minority of titles (often sensitive ones) gate their short comments behind login entirely for logged-out visitors; for those, the actor still returns the subject info and long reviews, but short comments come back empty. Long reviews and ratings are not gated.
  • Keep maxConcurrency modest (default 3). Under heavy concurrency Douban occasionally soft-throttles a page, which can make a single comment page come back empty; lower concurrency avoids this.
  • Some music/book subject pages expose fewer fields (e.g. no genres); those come back null.
  • Public data only — the actor never logs in or scrapes login-walled content.

FAQ

Which content should I scrape? Toggle scrapeShortComments, scrapeLongReviews and scrapeSubjectInfo independently. Short comments are cheapest and highest-volume; long reviews are richer for sentiment / NLP work.

Can I run this on a schedule? Yes — use Apify Schedules. Reviews are evergreen, so weekly is usually plenty.

How do I export to my DB / Google Sheets? Use Apify Integrations or the Dataset API — every field above is available via /items?format=json|csv|xlsx. The dataset also ships pre-built Short comments and Long reviews table views.

Why is sentiment sometimes null? The user rated nothing, so there's no star signal to map. The raw content is still captured.