Pricing

from $4.99 / 1,000 results

Goodreads Review Scraper: Rating Distribution & Book Analytics

📚 Goodreads Review Scraper extracts book reviews at scale — ratings, review text, reviewer profiles, dates & shelves. ⚡ Clean, structured data for sentiment analysis, market research & author insights. 🔄 Export CSV/JSON. 🎯 Ideal for authors, publishers, marketers & data teams.

Pricing

from $4.99 / 1,000 results

Rating

0.0

(0)

Developer

API Empire

Actor stats

Bookmarked

Total users

Monthly active users

4 days ago

Last modified

Goodreads Review Scraper — Reviews, Reviewers & Book Analytics

Goodreads Review Scraper extracts public Goodreads review text, reviewer profiles, and a per-book rating-distribution analytics row from any Goodreads book page. Every run returns three linked record shapes in one dataset: a type="book" analytics row carrying the 1★–5★ histogram, polarization score and full book metadata, then type="review" rows carrying review text, star rating, shelving tags and the reviewer's public profile. Output is typed, normalized JSON — no HTML, no CSS selectors, no parsing step. By the end of this page you will know every input field, every output key, and exactly how each analytics number is derived.

What is the Goodreads Review Scraper?

Goodreads Review Scraper is an Apify Actor that turns a list of Goodreads book URLs into a structured review-and-analytics dataset. It targets Goodreads' public AWS AppSync GraphQL API rather than book page HTML, resolving book and work identifiers, the site-wide rating histogram and the review list without rendering a browser (src/helper.py, src/main.py).

No Goodreads account, login, cookie or personal API key is required. The Actor reuses the public AppSync endpoint and public API key that Goodreads ships in its own JavaScript bundle, and falls back to re-reading a fresh key from that bundle if the hard-coded one ever stops working (src/helper.py:26-30, src/helper.py:164-183). Nothing you supply is a credential — the only required input is a list of book URLs.

Scrape book analytics — one type="book" row per input URL with the 1★–5★ histogram, derived polarization score, positive/neutral/negative share and rating stats at both edition and work level.
Scrape reviews — type="review" rows with full review text, star rating, spoiler status, like and comment counts, plus epoch and ISO-8601 timestamps.
Scrape reviewer profiles — a nested creator object per review: display name, profile URL, avatar, follower count, text-review count and Goodreads-author flag.
Scrape book metadata and contributors — genres, page count, ISBN, ISBN-13, ASIN, format, language, publisher, publication date, series placement, primary author and secondary contributors.
Export as JSON, CSV, Excel, XML, HTML or RSS — no proxy management and no parsing on your side.

📚 What data does the Goodreads Review Scraper collect?

The Actor writes four distinct data shapes into a single dataset: book analytics rows, review rows, the reviewer profile nested inside each review, and the shelving/tag object that records which shelf the reader filed the book under. Every field name below is copied from the row-building functions in src/book_analytics.py and src/review_extractor.py.

Data Type	Key Fields	JSON Field Names
Book analytics row (`type: "book"`)	Rating histogram, polarization, headline stats, identifiers	`ratingsCountDist`, `ratingDistribution`, `polarizationScore`, `positivePct`, `neutralPct`, `negativePct`, `averageRating`, `ratingsCount`, `textReviewsCount`, `stats`, `bookId`, `bookResourceId`, `workId`, `workResourceId`, `inputUrl`, `bookUrl`, `scrapedAt`
Book metadata (same row, when `includeBookDetails` is on)	Genres, edition details, series, contributors	`genres`, `bookDetails.numPages`, `bookDetails.isbn`, `bookDetails.isbn13`, `bookDetails.asin`, `bookDetails.format`, `bookDetails.language`, `bookDetails.publisher`, `bookDetails.publicationDateIso`, `series[].title`, `series[].placement`, `secondaryContributors[].role`, `titleComplete`, `description`, `imageUrl`
Review row (`type: "review"`)	Review text, star rating, engagement, timestamps, parent link	`id`, `text`, `rating`, `spoilerStatus`, `recommendFor`, `likeCount`, `commentCount`, `createdAt`, `updatedAt`, `lastRevisionAt`, `createdAtIso`, `updatedAtIso`, `lastRevisionAtIso`, `parentBookId`, `bookTitle`, `isChild`, `scrapedAt`
Reviewer profile (nested in each review row)	Reader identity and reach	`creator.id`, `creator.name`, `creator.webUrl`, `creator.imageUrlSquare`, `creator.followersCount`, `creator.textReviewsCount`, `creator.isAuthor`, `creator.contributor.works.totalCount`
Shelving and reader tags (nested in each review row)	Which shelf, and which reader-applied tags	`shelving.shelf.name`, `shelving.shelf.webUrl`, `shelving.taggings[].tag.name`, `shelving.webUrl`
Author record (nested in the book row)	Primary contributor identity	`author.name`, `author.legacyId`, `author.webUrl`, `author.profileImageUrl`, `author.isGoodreadsAuthor`, `author.role`, `authorName`

Rows stream into the dataset as they are found, so a title's analytics row appears before its reviews and you can start reading results while the run is still going (src/main.py:465, src/main.py:519).

Need more review and rating data?

If you are building a cross-platform reputation dataset, the same review-analytics pattern exists elsewhere in the API Empire catalogue. The Apple App Store Review Scraper with Version Quality Trends and the Google Play Store Review Scraper by Country & Language apply the same "rows plus derived rating breakdown" shape to mobile app reviews, and the Fragrantica Scraper with Reviews & Sentiment Analysis covers a consumer product vertical. For the retail side of book data, the Amazon Product Scraper and the Amazon ASIN Scraper (Sizes, Colors, Variants) pair naturally with Goodreads output, since bookDetails.asin and bookDetails.isbn13 give you direct join keys.

🧭 How the rating distribution and polarization score are computed

Every analytics number in the book row is derived deterministically from one array — Goodreads' own work.stats.ratingsCountDist, a five-integer list ordered [1★, 2★, 3★, 4★, 5★]. There is no model, no sampling and no AI involved (src/book_analytics.py:115-168).

This is the single most important thing to understand about the output: the histogram describes Goodreads' site-wide rating totals for the entire work, not the reviews this run happened to collect. Set maxItems: 20 on a book with 1.8 million ratings and the review rows are a 20-item sample, but ratingDistribution still covers all 1.8 million ratings. The two are independent sources joined by parentBookId. If you want a distribution of the sampled reviews, aggregate the rating field across the review rows yourself.

ratingsCountDist is also a ratings histogram, not a shelvings histogram. A reader who shelved the book without leaving a star rating contributes to neither the buckets nor total. Likewise, a review row whose rating is 0 or null (text-only review, no stars) exists in your dataset but is invisible to the histogram.

The exact formulas

Let counts = [c1, c2, c3, c4, c5] and total = c1 + c2 + c3 + c4 + c5, with props[i] = counts[i] / total.

Output key	Formula (verbatim from `src/book_analytics.py`)
`ratingDistribution.total`	`sum(counts)`
`ratingDistribution.oneStar` … `fiveStar`	the raw integers from `ratingsCountDist`, unchanged
`oneStarPct` … `fiveStarPct`	`round(props[i] * 100.0, 2)`
`negativePct`	`round((props[0] + props[1]) * 100.0, 2)` — the 1★ + 2★ share
`neutralPct`	`round(props[2] * 100.0, 2)` — the 3★ share
`positivePct`	`round((props[3] + props[4]) * 100.0, 2)` — the 4★ + 5★ share
`weightedAverage`	`round(Σ (i+1) × props[i], 4)` for i in 0…4 — the mean star rating recomputed at full precision
`ratingStdDev`	`round(sqrt(Σ props[i] × ((i+1) − mean)²), 4)` — the population standard deviation of the star ratings
`polarizationScore`	`round(ratingStdDev / 2.0, 4)`
`extremesPct`	`round((props[0] + props[4]) * 100.0, 2)` — the 1★ + 5★ share

The divisor 2.0 in polarizationScore is the theoretical maximum population standard deviation on a 1–5 scale, reached when exactly half the mass sits on 1★ and half on 5★. That normalizes the score to 0–1: 0 means every rater agreed on the same star value, 1 means the book split its readership perfectly between love and hate. extremesPct is the blunter second signal — it ignores where the middle sits and reports only how much of the vote went to the two endpoints.

weightedAverage will not always equal averageRating: the latter is the figure Goodreads publishes (already rounded), the former is recomputed from the histogram to four decimals. Small divergences are expected and are not an error.

Edge cases you should handle in your pipeline

Situation	What the Actor writes
`ratingsCountDist` is missing, `null`, not a 5-element list, or holds a non-integer	`ratingsCountDist: null`, `ratingDistribution: null`, and `polarizationScore`, `positivePct`, `neutralPct`, `negativePct` all `null` (`src/book_analytics.py:119-124`, `src/book_analytics.py:289-303`)
`total == 0` — the work has zero star ratings	All count and percentage keys are present and set to `0` / `0.0`, but `weightedAverage`, `ratingStdDev` and `polarizationScore` are `null`, not `0` (`src/book_analytics.py:132-146`). Filter on `polarizationScore != null` before averaging, or you will silently drop these books
`includeRatingDistribution` is `false`	`ratingsCountDist`, `ratingDistribution`, `polarizationScore`, `positivePct`, `neutralPct`, `negativePct` are absent from the row entirely — not `null`. Use key-existence checks, not null checks
`includeBookDetails` is `false`	`genres`, `bookDetails`, `series` and `secondaryContributors` are absent entirely
The book has no `details` block, or no primary contributor edge	`bookDetails: null` rather than an object of nulls; `author: null` and `authorName: null` (`src/book_analytics.py:181-184`, `src/book_analytics.py:217-221`)
A series entry has no title	It is skipped rather than emitted with a null title (`src/book_analytics.py:205-206`)

One more caveat: averageRating, ratingsCount and textReviewsCount are resolved with a falsy fallback from work-level to edition-level stats (src/book_analytics.py:275-277), so for a work with genuinely zero ratings, ratingsCount can arrive as null rather than 0. The unreduced numbers are always in the stats object, which keeps the two levels separate: stats.ratingsCount versus stats.workRatingsCount, stats.ratingsSum versus stats.workRatingsSum, and so on.

Why not build a Goodreads review scraper yourself?

Goodreads does not offer a public developer API you can sign up for today, which is why every Goodreads scraper on the Apify Store — including this one — reads public web endpoints instead. The competing listing makework36/goodreads-scraper states that "Amazon shut down the legacy Goodreads API in late 2020 and never published a replacement" (checked on the Apify Store 2026-07-25 — not independently verified here). With no official API in the picture, the real comparison is "your scraper versus a maintained one," and three things make the DIY version more expensive than it looks.

The book HTML pages are behind an AWS WAF JavaScript challenge. The Actor's own source documents this: Goodreads book pages return HTTP 202 with an x-amzn-waf-action: challenge header to server IPs, so the traditional approach of regexing kca://work/... identifiers out of the page HTML no longer works (src/helper.py:3-7). A naive requests.get() against a book page returns a challenge stub, not the book. Solving that with a headless browser means paying for browser compute on every single book.

The reliable path requires knowing which endpoint is not WAF-protected, and how to keep its key alive. This Actor posts to Goodreads' public AppSync GraphQL endpoint with the stable public key Goodreads ships in its own JavaScript bundle (src/helper.py:26-30). When that key rotates, it re-fetches the book page, enumerates every <script src="*.js"> tag, downloads the bundles in parallel and re-extracts a fresh apiKey/endpoint pair (src/helper.py:135-183) — a fallback chain you would otherwise have to discover, write and maintain yourself.

Blocking is a routing problem, not a retry problem. Retrying a blocked request from the same IP just burns time. This Actor treats HTTP 202, 403, 429, 502 and 503 — plus any response body containing awswafintegration — as a block, retries up to three times with a growing backoff, and if that fails escalates its route from direct to Apify datacenter proxy to residential proxy, locking to residential for the rest of the run (src/main.py:259-322, src/main.py:374-379).

Building it yourself makes sense if you need something this Actor does not emit — Goodreads lists, author bibliographies or quote pages. If what you need is review text, reviewer profiles and a rating breakdown per book, the maintenance is already paid for here.

Why do developers and teams scrape Goodreads?

Goodreads is the largest public corpus of long-form reader opinion attached to a stable book identifier, which makes it useful to four very different audiences.

For AI engineers and agent builders

Goodreads review text is unusually good RAG material: long-form, opinionated, and already labelled with a numeric rating. A typical pipeline pulls text, rating, spoilerStatus and createdAtIso per review, filters on spoilerStatus before indexing so your retrieval layer never leaks a plot twist into an answer, and embeds each review with parentBookId and bookTitle as metadata. The book row then supplies the grounding facts an agent needs to avoid hallucinating: genres, bookDetails.numPages, bookDetails.isbn13 and the ratingDistribution object. Because every field is typed JSON, the dataset drops straight into a vector store or an agent tool response with no parsing step in the loop.

For publishers, authors and book marketers

polarizationScore is the field that changes decisions. Two books can both average 4.0 and behave completely differently in market: one with polarizationScore near 0.6 and extremesPct above 40 has a divided readership and needs positioning that names its audience, while one with a high positivePct and a low standard deviation is a safe recommendation for a broad list. Comparing your title's ratingDistribution against comparable titles — same genres, same bookDetails.format — tells you whether a mixed reception is category-normal or specific to your book, and the text of the 1★ and 2★ reviews tells you why.

For academic and market researchers

Reception studies, genre evolution and reading-behaviour research need rating distributions rather than averages, because the average hides exactly the disagreement the research is about. This Actor returns only publicly visible data, and ratingDistribution gives you the full histogram, the population standard deviation and the extremes share for each work without any per-user collection at all. If your study design does not need reviewer identity, build the entire corpus from type: "book" rows and never touch a personal field. When you do need reader-level analysis, shelving.taggings[].tag.name exposes reader-applied genre tags — the folksonomy that often diverges from the publisher's official category.

For developers building data products

Book discovery apps, reading-list tools and recommendation engines need a repeatable ingest, not a one-off export. bookId, workId, bookDetails.isbn13 and bookDetails.asin give you four stable join keys against your own catalogue and against retail data. Schedule the Actor on Apify with a fixed urls list, keep maxItems low for the recurring delta and high for the initial backfill, and use scrapedAt to version your snapshots. Because review rows and the book row land in the same dataset, one run gives you both the aggregate you display and the review text you show underneath it.

How to scrape Goodreads reviews (step by step)

The Actor runs on the Apify platform. You start it from the Apify Console UI or by calling the Apify API with your Apify API token — there is no separate signup, no vendor key, and no Goodreads credential anywhere in the flow.

Open the Actor on its Apify Store listing and click Try for free, or open it from your Apify Console if you have already added it.
Fill in urls — the one required parameter. Paste one or more Goodreads book links, one per line. A bare numeric identifier works too: anything that does not begin with http is expanded to https://www.goodreads.com/book/show/<value> (src/main.py:621-623).
Set maxItems and the toggles — maxItems is the review cap per book, defaulting to 20; includeRatingDistribution and includeBookDetails both default to on. Open Filters & sorting to change sortBy, languageCode and reviewEdition.
Click Start. The live log prints a histogram summary for each book as soon as its analytics row is saved, then one line per review, and updates the run status message with a running row count.
Export the dataset as JSON, CSV, Excel, XML, HTML or RSS from the Storage tab, or read it through the Apify API. Filter on type to split the two record shapes apart.

What to do when Goodreads changes its structure

The Actor is maintained, and the output contract is designed to be stable: field names and types stay the same on your end even when the upstream shape shifts. Both the GraphQL queries and the row-building functions are versioned in the source, so a Goodreads-side change is fixed inside the Actor rather than inside your pipeline. Scheduled runs and integrations keep working against the same keys.

What changed in Goodreads scraping recently?

The most significant change is that Goodreads' book HTML pages sit behind an AWS WAF JavaScript challenge — they return HTTP 202 with x-amzn-waf-action: challenge to server-side requests, which broke the long-standing technique of scraping kca://work/... identifiers out of the page markup. This is documented directly in this Actor's source (src/helper.py:3-7) and is the reason the Actor no longer parses book page HTML on its primary path at all.

What changed technically. Identifier resolution moved off the HTML page and onto Goodreads' public AppSync GraphQL endpoint, which is not behind the same WAF and accepts a stable public API key (src/helper.py:8-13). The review list, the rating histogram and the book metadata now all arrive over GraphQL.
What this means for DIY scrapers. HTML-selector scrapers against book pages fail at the fetch stage, not the parse stage, so failures look like empty pages rather than broken selectors. A headless browser fixes it but multiplies compute cost per book.
What this means for users of this Actor. Nothing — no action required. The GraphQL path, the key-refresh fallback and the proxy escalation ladder are all inside the Actor.
What remains publicly accessible. Review text, star ratings, reviewer display names and public profile URLs, shelvings and reader tags, the site-wide rating histogram and full edition metadata are all still returned. Anything that depends on being signed in is not: viewerHasLiked and viewerRelationshipStatus reflect an unauthenticated viewer, so treat them as constants rather than signal.

Maintenance on the extraction path is ongoing, and the key-rotation fallback exists specifically so a routine upstream key change does not require a code release.

⬇️ Input

Every parameter below is reproduced exactly from .actor/actor.json. Only urls is required; everything else has a working default, so the minimum viable input is a single book link.

Parameter	Required	Type	Description	Example Value
`urls`	Yes	`array` of strings	Goodreads book links, one per entry. `editor: stringList`. A bare numeric identifier is accepted and expanded to `https://www.goodreads.com/book/show/<value>`. No `minItems`/`maxItems` on the array itself. Prefilled with `["https://www.goodreads.com/book/show/26032825"]`.	`["https://www.goodreads.com/book/show/26032825", "2767052"]`
`maxItems`	No	`integer`	Reviews to collect per book. The analytics row is emitted in addition and does not count against it. `default: 20`, `minimum: 1`, `maximum: 10000`. The min/max are enforced by the Console form; the Actor code itself only rejects values below zero.	`50`
`includeRatingDistribution`	No	`boolean`	Adds the 1★–5★ histogram with counts and percentages, positive/neutral/negative share, `weightedAverage`, `ratingStdDev` and the 0–1 `polarizationScore` to each book row. When `false`, those keys are omitted entirely. `default: true`.	`true`
`includeBookDetails`	No	`boolean`	Adds `genres`, `bookDetails` (page count, ISBN, ISBN-13, ASIN, format, language, publisher, publication date), `series` and `secondaryContributors`. When `false`, those keys are omitted entirely. `default: true`.	`true`
`filtersAndOptions`	No	`object`	Container for review sort order, language and edition scope. `editor: schemaBased`. Omit it to accept all three defaults.	see sub-fields below
`filtersAndOptions.sortBy`	No	`string`	Review sort order. `editor: select`, `enum: ["popular", "newest", "oldest"]`, `default: "popular"`. Internally `popular` sends no sort argument at all (Goodreads' own default ordering), `newest` maps to `NEWEST`, `oldest` to `OLDEST` (`src/main.py:146-154`). Any other value raises an error.	`"newest"`
`filtersAndOptions.languageCode`	No	`string`	Restrict reviews to one language. `editor: select`, `default: "all"`. `enum`: `all`, `en`, `bn`, `fr`, `de`, `es`, `it`, `pt`, `ru`, `ja`, `ko`, `zh`, `ar`, `hi`, `nl`, `pl`, `tr`, `vi`, `id`, `th`. The value `all` (and an empty string) drops the language argument from the query (`src/main.py:166-172`).	`"en"`
`filtersAndOptions.reviewEdition`	No	`string`	Which edition's reviews to read. `editor: select`, `enum: ["ALL", "only_this_book"]`, `default: "ALL"`. `ALL` maps to `resourceType: "WORK"` — every edition of the work. `only_this_book` maps to `resourceType: "BOOK"` — the specific edition in the URL only (`src/main.py:157-163`).	`"ALL"`
`proxyConfiguration`	No	`object`	`editor: proxy`, prefilled with `{ "useApifyProxy": false }`. Accepted by the schema but not read by the run loop — routing is decided by the built-in escalation ladder, which starts direct and switches to Apify datacenter then residential proxies only when a request is blocked (`src/main.py:588-651` never reads this key). Setting it does no harm; it has no effect.	`{ "useApifyProxy": false }`

Example JSON input

{
  "urls": [
    "https://www.goodreads.com/book/show/26032825",
    "https://www.goodreads.com/book/show/2767052",
    "6148028"
  ],
  "maxItems": 50,
  "includeRatingDistribution": true,
  "includeBookDetails": true,
  "filtersAndOptions": {
    "sortBy": "newest",
    "languageCode": "en",
    "reviewEdition": "ALL"
  },
  "proxyConfiguration": {
    "useApifyProxy": false
  }
}

That input produces up to 3 book-analytics rows plus up to 150 review rows — 3 books × 50 reviews each.

The most common input mistake is reading maxItems as a run total. It is a per-book cap: ten URLs with maxItems: 200 asks for up to 2,010 rows, not 200. The second most common is passing a Goodreads work, author, list or series URL instead of a book URL — the Actor matches /book/show/(\d+) or a bare all-digits string, and anything else fails with Could not read a numeric Goodreads book id from URL (src/helper.py:49-58). A failure on one URL ends the whole run, since the URL loop has no per-book error isolation (src/main.py:620-639), so validate your list before a large batch.

⬆️ Output

Results land in the Actor's default Apify dataset as typed, normalized JSON with stable key names, exportable as JSON, CSV, Excel, XML, HTML or RSS. Two structurally different record types share the dataset, distinguished by type ("book" or "review") and by isChild (false on book rows, true on review rows). Split them with type == "book" and type == "review", or join reviews to their parent with parentBookId == bookId.

The Actor never writes error rows, status rows or accounting rows. A failure raises and terminates the run rather than pushing a marker row, so every row you see is a real book or a real review and no exclusion filter is needed to clean the output.

Scraped book analytics (`type: "book"`)

Values below are illustrative and self-consistent — they demonstrate the shape and the arithmetic, not a captured run.

{
  "type": "book",
  "isChild": false,
  "bookUrl": "https://www.goodreads.com/book/show/26032825-the-cruel-prince",
  "inputUrl": "https://www.goodreads.com/book/show/26032825",
  "bookId": 26032825,
  "bookResourceId": "kca://book/amzn1.gr.book.v1.p3jVQZBOw5RzM0ZKr5aRSA",
  "workId": 46032825,
  "workResourceId": "kca://work/amzn1.gr.work.v1.SsGCwHFtHfKrSs2sQaVdUw",
  "title": "The Cruel Prince",
  "titleComplete": "The Cruel Prince (The Folk of the Air, #1)",
  "description": "Of course I want to be like them. They're beautiful as blades forged in some divine fire…",
  "imageUrl": "https://images.gr-assets.com/books/1517834720i/26032825.jpg",
  "author": {
    "name": "Holly Black", "legacyId": 8720,
    "webUrl": "https://www.goodreads.com/author/show/8720.Holly_Black",
    "profileImageUrl": "https://images.gr-assets.com/authors/1620322270p5/8720.jpg",
    "isGoodreadsAuthor": true, "role": "Author"
  },
  "authorName": "Holly Black",
  "averageRating": 4.0,
  "ratingsCount": 1834520,
  "textReviewsCount": 172431,
  "stats": {
    "averageRating": 4.0, "ratingsCount": 1834520, "ratingsSum": 7344936, "textReviewsCount": 172431,
    "workAverageRating": 4.0, "workRatingsCount": 1834520, "workRatingsSum": 7344936, "workTextReviewsCount": 172431
  },
  "ratingsCountDist": [39817, 95182, 356231, 670388, 672902],
  "ratingDistribution": {
    "total": 1834520,
    "oneStar": 39817, "twoStar": 95182, "threeStar": 356231, "fourStar": 670388, "fiveStar": 672902,
    "oneStarPct": 2.17, "twoStarPct": 5.19, "threeStarPct": 19.42, "fourStarPct": 36.54, "fiveStarPct": 36.68,
    "negativePct": 7.36, "neutralPct": 19.42, "positivePct": 73.22,
    "weightedAverage": 4.0037, "ratingStdDev": 0.9817,
    "polarizationScore": 0.4908, "extremesPct": 38.85
  },
  "polarizationScore": 0.4908,
  "positivePct": 73.22,
  "neutralPct": 19.42,
  "negativePct": 7.36,
  "genres": ["Fantasy", "Young Adult", "Romance", "Fiction", "Faeries", "Magic"],
  "bookDetails": {
    "numPages": 370, "isbn": "0316310271", "isbn13": "9780316310277", "asin": "0316310271",
    "format": "Hardcover", "language": "English",
    "publisher": "Little, Brown Books for Young Readers",
    "publicationTime": 1515456000000, "publicationDateIso": "2018-01-09T00:00:00Z"
  },
  "series": [
    { "title": "The Folk of the Air", "webUrl": "https://www.goodreads.com/series/209448-the-folk-of-the-air", "placement": "1" }
  ],
  "secondaryContributors": [
    { "name": "Caitlin Kelly", "role": "Narrator", "webUrl": "https://www.goodreads.com/author/show/6541002.Caitlin_Kelly" }
  ],
  "scrapedAt": "2026-07-25T09:14:02.517394Z"
}

The four flattened convenience keys — polarizationScore, positivePct, neutralPct, negativePct — are copies of the same values inside ratingDistribution, lifted to the top level so they survive a CSV export without nested-object flattening.

Scraped reviews (`type: "review"`)

{
  "type": "review",
  "isChild": true,
  "bookUrl": "https://www.goodreads.com/book/show/26032825",
  "parentBookId": 26032825,
  "bookTitle": "The Cruel Prince",
  "__typename": "Review",
  "id": "kca://review/amzn1.gr.review.v1.4qHPnyk2CkX0nEQqUcbYRw",
  "creator": {
    "id": 4312837,
    "imageUrlSquare": "https://images.gr-assets.com/users/1508374928p2/4312837.jpg",
    "isAuthor": false,
    "viewerRelationshipStatus": { "isFollowing": false, "isFriend": false, "isBlockedByViewer": false },
    "followersCount": 12840,
    "__typename": "User",
    "textReviewsCount": 1136,
    "name": "Petrik",
    "webUrl": "https://www.goodreads.com/user/show/4312837-petrik",
    "contributor": null
  },
  "recommendFor": "readers who like morally grey faerie politics",
  "updatedAt": 1749208842000,
  "createdAt": 1748905311000,
  "spoilerStatus": false,
  "lastRevisionAt": 1749208842000,
  "text": "This was a pleasant surprise. The court intrigue is the real engine here, and Jude's ruthlessness is earned rather than posed...",
  "rating": 4,
  "shelving": {
    "shelf": { "name": "read", "webUrl": "https://www.goodreads.com/review/list/4312837?shelf=read", "__typename": "Shelf" },
    "taggings": [
      { "tag": { "name": "fantasy", "webUrl": "https://www.goodreads.com/review/list/4312837?shelf=fantasy", "__typename": "Tag" }, "__typename": "Tagging" }
    ],
    "webUrl": "https://www.goodreads.com/review/show/2418765430",
    "__typename": "Shelving"
  },
  "likeCount": 2841,
  "viewerHasLiked": false,
  "commentCount": 137,
  "createdAtIso": "2025-06-02T22:21:51Z",
  "updatedAtIso": "2025-06-06T10:40:42Z",
  "lastRevisionAtIso": "2025-06-06T10:40:42Z",
  "scrapedAt": "2026-07-25T09:14:03.882101Z"
}

Notes on the review shape, all from src/review_extractor.py:

createdAt, updatedAt and lastRevisionAt are Goodreads' raw epoch milliseconds, cast to integers. Each has an ISO-8601 UTC twin — createdAtIso, updatedAtIso, lastRevisionAtIso — so you never divide by 1000 in your pipeline. An unparseable value yields null on the ISO key while the raw key passes through unchanged.
creator.id is Goodreads' numeric legacy user id, aliased from legacyId. creator.contributor is null for ordinary readers and an object with id and works.totalCount when the reviewer is a published Goodreads author — combine it with creator.isAuthor to separate author reviews from reader reviews.
creator itself is null if the review has no attached user; shelving is null if the reviewer did not shelve the book. parentBookId and bookTitle are written only when the parent book row resolved a bookId and title — otherwise those keys are absent rather than null (src/review_extractor.py:100-103).
viewerHasLiked and viewerRelationshipStatus always reflect an anonymous viewer. scrapedAt is stamped per row at push time, so it differs by milliseconds across rows in the same run.

How does this Goodreads scraper compare to other Goodreads scrapers?

All competitor statements below are quoted from the competitors' own Apify Store listings as checked on 2026-07-25, and are not measured or verified here.

Feature	Goodreads Review Scraper (this Actor)	Metadata-only Goodreads scrapers
Individual review text	One `type="review"` row per review with full `text`, `rating`, `spoilerStatus`, `likeCount`, `commentCount`	`makework36/goodreads-scraper` states "Does the scraper extract full review text? Not in v1"; `klondikeking/goodreads-book-scraper` and `lulzasaur/goodreads-books-scraper` document only an aggregate `reviewCount`
Rating distribution histogram	`ratingsCountDist` plus a `ratingDistribution` object with per-star counts, per-star percentages, positive/neutral/negative share and `extremesPct`	None of the three listings documents a per-star breakdown; all three publish an average rating and a ratings count only
Derived analytics	`weightedAverage`, `ratingStdDev`, `polarizationScore` (0–1), `extremesPct` — computed in-Actor with a published formula	Not offered on any of the three listings
Reviewer profile data	Nested `creator` with `name`, `webUrl`, `followersCount`, `textReviewsCount`, `isAuthor`, plus `shelving.taggings[].tag.name` reader tags	Not offered; `lulzasaur` returns book `authors[]` but no reviewer identity
Input flexibility	Book URL or bare numeric id, plus `sortBy`, `languageCode` (19 languages + all) and `reviewEdition` (whole work vs single edition)	`klondikeking` states "Requires direct Goodreads book URLs (search/discovery is not supported)"; `lulzasaur` accepts `searchQueries` and `bookUrls`; `makework36` accepts book, author and list URLs
Anti-bot handling	Public GraphQL endpoint (bypasses the book-page WAF), key-rotation fallback from the live JS bundle, 3 retries with backoff, direct → datacenter → residential escalation	`klondikeking` states "Goodreads may rate-limit requests; proxy usage is recommended for large batches"; `makework36` states "No anti-bot encountered on book, author and list pages"
Entity coverage	Books, reviews, reviewers, authors, series, shelvings/tags	`makework36` covers books, authors and curated lists — a wider page-type footprint, but no review-level or reviewer-level records

If you are building an AI agent or a RAG pipeline, the output-format row is the decision-maker — parsing HTML inside an agent loop is a reliability failure mode, not a feature. If you need book discovery by keyword or author bibliographies, a metadata-oriented scraper is the better fit; this Actor starts from a book URL you already have and goes deep on that book instead.

How many results can you scrape with the Goodreads Review Scraper?

maxItems caps reviews per book at a default of 20, with a schema range of 1 to 10,000, and there is no cap at all on how many URLs you put in urls — so total dataset size is roughly len(urls) × (1 + maxItems), one analytics row plus up to maxItems review rows per book.

Pagination is a cursor loop over Goodreads' getReviews GraphQL connection. The Actor requests min(30, maxItems - collected) reviews at a time, so the last page is trimmed to land exactly on your cap rather than overshooting (src/main.py:28, src/main.py:493). After each page it reads pageInfo.nextPageToken and passes it as the after cursor for the next request, pausing a randomized 0.2–0.8 seconds between pages (src/main.py:250-256, src/main.py:540). A book's loop ends when the cap is reached, a page returns zero review nodes, or Goodreads returns no nextPageToken. Books are processed one after another, not in parallel.

Two platform-side ceilings are worth knowing. A book with fewer than maxItems reviews simply runs out — you get what exists, and the log says "Reached the end of the review list for this book." And filtersAndOptions.languageCode narrows the pool before pagination, so a high maxItems combined with a low-volume language often terminates early. Setting reviewEdition to only_this_book narrows it further, scoping the query to one edition (resourceType: "BOOK") instead of the whole work.

💸 How this Goodreads scraper is billed

The Actor uses Apify's pay-per-event model with a single event: row_result, charged once per dataset row pushed — the per-book analytics row and every individual review row bill identically (src/main.py:397-413). A run over 3 books with maxItems: 50 therefore bills up to 153 row_result events, not 150. The current per-event price is on the Actor's Apify Store listing. Three behaviours are worth planning around:

There are no free rows and no uncharged rows. Every push goes through the same _push_row helper, so there is no error-row or accounting-row category to filter out of your billing. To separate the two row types for cost accounting after the fact, filter on type == "book" versus type == "review" — or equivalently isChild == false versus isChild == true.
Charging degrades gracefully rather than failing the run. If the account is not configured for pay-per-event, or the charge call errors for any reason, the Actor logs a notice, sets an internal flag and pushes every subsequent row without metering so your data is still saved (src/main.py:403-413). Once that flag flips it stays flipped for the rest of the run.
A mid-book network failure can re-bill a book. When a request fails in a way the Actor considers escalation-worthy, it switches proxy tier and restarts that book from the beginning — pushing and charging a second analytics row and re-pushing reviews it had already collected (src/main.py:562-585 re-enters the function that pushes the book row at src/main.py:465). Deduplicate on id for reviews and bookId for book rows if exact counts matter.

To keep costs predictable, set maxItems deliberately and deduplicate your urls list — the Actor does not dedupe it, so the same book listed twice is scraped and billed twice. Note that the two include* toggles do not change the row count or the bill: the extra data rides the same single GraphQL call that resolves the book identifiers (src/helper.py:104-128), so turning them off saves nothing but field width.

Integrate the Goodreads Review Scraper and automate your workflow

Goodreads Review Scraper works with any language or tool that can send an HTTP request to the Apify API. There is no separate service to sign up for and no product-specific key — authentication is your Apify API token, and the Actor is addressed by its <username>/<actor-name> slug.

REST API integration

from apify_client import ApifyClient

client = ApifyClient("<YOUR_APIFY_API_TOKEN>")

run = client.actor("<YOUR_USERNAME>/goodreads-review-scraper-rating-distribution-book-analytics").call(
    run_input={
        "urls": ["https://www.goodreads.com/book/show/26032825"],
        "maxItems": 50,
        "includeRatingDistribution": True,
        "includeBookDetails": True,
        "filtersAndOptions": {"sortBy": "newest", "languageCode": "en", "reviewEdition": "ALL"},
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    if item["type"] == "book":
        d = item.get("ratingDistribution") or {}
        print(item["title"], "polarization:", d.get("polarizationScore"), "positive%:", d.get("positivePct"))
    else:
        print(" ", item["rating"], "★", (item.get("creator") or {}).get("name"), "-", (item.get("text") or "")[:80])

Works in Python, Node.js, Go, Ruby and cURL — the Apify API exposes the same run-and-fetch-dataset endpoint pair to every client, and the official Apify clients wrap them identically.

Automation platforms (n8n, Make, LangChain)

n8n ships an official Apify node. Authenticate with your Apify API token, choose the Run Actor operation, select this Actor and paste the run input JSON. Chain the Get dataset items operation after it, then route book rows and review rows down separate branches with an IF node on type — book rows into your analytics warehouse, review rows into your text pipeline.

Make offers an Apify app with a Run an Actor module. Configure it with the Actor slug and your run input, then follow it with Get dataset items. A common pattern is a scheduled scenario that runs a fixed urls list weekly and appends the book rows to a Google Sheet, giving you a time series of polarizationScore and positivePct per title with no code.

LangChain integrates through the langchain-apify package, whose ApifyDatasetLoader reads an Apify dataset into Document objects using a mapping function you control. Map text to page_content and rating, spoilerStatus, parentBookId, bookTitle and createdAtIso into metadata, and the reviews are ready for chunking and retrieval — with the book row's genres and ratingDistribution available as structured grounding alongside the free text.

Is it legal to scrape Goodreads?

Scraping publicly accessible Goodreads pages is generally lawful in most jurisdictions, but how you store and use the results is a separate question with a real answer. Goodreads Review Scraper collects only data that Goodreads shows to any signed-out visitor — it never signs in, never accesses private shelves, and never touches content behind an account wall.

Because review rows include reviewer display names, public profile URLs, avatar images and follower counts, the output contains personal data under GDPR and CCPA. You need a lawful basis before you store it, a defined retention period, and a plan for deletion requests. Practical mitigation: if your use case is aggregate — genre trends, rating distributions, reception analysis — build it from type: "book" rows only, which contain no personal data at all, or drop the creator object at ingest and keep the anonymous text and rating. Review text remains the copyrighted expression of its author, so republishing reviews verbatim raises a copyright question separate from data protection.

Consult legal counsel for commercial use cases involving bulk personal data.

Frequently asked questions

Does the Goodreads Review Scraper work without a Goodreads account?

Yes. No Goodreads account, login, cookie or password is ever used. The Actor reads Goodreads' public AppSync GraphQL endpoint using the stable public API key that Goodreads itself ships in its browser JavaScript bundle (src/helper.py:26-30). The only credential in the whole workflow is your Apify API token, and that authenticates you to Apify, not to Goodreads.

How often is the scraped Goodreads data updated?

Every run fetches live. There is no cache layer anywhere in the Actor — each book triggers a fresh getBookByLegacyId GraphQL call for the histogram and metadata, then fresh getReviews calls for the review pages. scrapedAt records the exact UTC moment each row was built. To track how a book's polarizationScore or positivePct moves over time, schedule the Actor on Apify and keep the historical rows.

What happens if a book has no reviews, or the ratings histogram is missing?

You still get the book-analytics row; you just get fewer review rows. On an empty review page the Actor logs "No more reviews to show for this book" and moves on, so a book with zero reviews produces exactly one row. If work.stats.ratingsCountDist is absent or malformed, ratingDistribution and the four flattened analytics keys are all null. If the work exists but has zero ratings, counts and percentages are 0 while weightedAverage, ratingStdDev and polarizationScore are null — guard against that before averaging across books.

No. Only publicly accessible content is returned. Private shelves, friends-only activity and anything requiring a signed-in session are out of scope by construction — the Actor has no login path. Deleted reviews are gone from Goodreads' API and never appear. viewerHasLiked and viewerRelationshipStatus exist in the output only because they belong to the upstream schema; as an anonymous client the Actor always sees them in their default state.

How do I run it, and what do I need to start?

Open the Actor on its Apify Store listing and click Try for free, or call it from the Apify API with your Apify API token. Access and pricing terms are shown on the listing itself — the Actor bills through Apify's pay-per-event model on the row_result event. There is no separate account, no vendor-specific key and no Goodreads credential to obtain.

Does the Goodreads Review Scraper work for AI agent workflows and LLM pipelines?

Yes. The Actor is callable as an HTTP endpoint by any agent framework — LangChain, LlamaIndex, CrewAI, a custom tool-calling loop, or a plain requests call — through the Apify API's run-and-fetch-dataset endpoints. Every response is typed JSON with stable key names, so there is no parsing step before you pass it to an LLM. A practical split: put review text into your vector store with rating and parentBookId as metadata, and hand the book row's ratingDistribution to the model as structured context so it reasons over the real histogram instead of guessing from an average.

How does the Actor handle Goodreads' anti-bot system?

By routing around it first and escalating second. Goodreads protects its book HTML pages with an AWS WAF JavaScript challenge that returns HTTP 202 with an x-amzn-waf-action: challenge header, so the Actor does not scrape that HTML on its primary path at all — it uses the public GraphQL endpoint, which is not behind that WAF (src/helper.py:3-13). It treats HTTP 202, 403, 429, 502 and 503, plus any body containing awswafintegration, as a block. On a block it retries up to 3 times with a growing backoff (1.2 s, then 2.4 s), and if the retries are exhausted it escalates its route: direct → Apify datacenter proxy → Apify residential proxy, locking to residential for the rest of the run (src/main.py:25-26, src/main.py:273-322, src/main.py:374-379). Separately, if the public API key rotates, it re-reads a fresh key and endpoint from Goodreads' live JavaScript bundles and retries (src/helper.py:164-183).

Does it return data in a format LLMs can use directly?

Yes. Typed, normalized JSON with stable field names — no HTML, no CSS selectors, no parsing. Pass a book row straight into an LLM context window, index review text into a vector store, or route the whole dataset through an agent tool. Numeric fields arrive as numbers (rating, likeCount, polarizationScore), booleans as booleans (isChild, spoilerStatus), and every timestamp has both an epoch-milliseconds form and an ISO-8601 UTC form.

Can I use it without managing proxies?

Yes, and by default you do not configure proxies at all. The Actor starts every run direct and only reaches for Apify Proxy when a request is actually blocked, at which point it moves to the datacenter pool and then, if needed, to the residential pool — automatically, mid-run, without restarting the job (src/main.py:273-322). No proxy list, no rotation logic, no session management on your side. Note that the proxyConfiguration input is accepted by the schema but is not read by the run loop, so the escalation ladder is what actually decides routing.

What happens when Goodreads changes its structure or blocks the scraper?

The Actor is maintained and the output schema stays stable — field names and types do not change on your end. The GraphQL queries, block-detection rules and row-building functions all live inside the Actor, so a Goodreads-side change is repaired there rather than in your pipeline. Some resilience is already built in: the API key refreshes itself from Goodreads' live JavaScript bundles if the public one rotates, and the proxy ladder handles IP-level blocking without intervention.

Your feedback

Found a bug, hit a book that parses oddly, or missing a Goodreads field? We want to know — a concrete report with the exact urls input that reproduced the problem is the fastest route to a fix. Open an issue on the Issues tab of the Actor's Apify Store listing; feature requests are welcome there too. If a field exists in Goodreads' public GraphQL response and is not in the output yet, that is usually a quick addition.

Goodreads Review Scraper

scrapier/goodreads-review-scraper

📚 Goodreads Review Scraper extracts book reviews at scale — ratings, review text, dates, reviewer profiles, helpful votes & shelves. 🔎 Clean, structured data for sentiment analysis & insights. 🚀 Perfect for authors, publishers, marketers & researchers.

Scrapier

Goodreads Review Scraper With Reviewer Lead Enrichment

scraper-engine/goodreads-review-scraper

📚 Goodreads Review Scraper pulls reviews from book & author pages — ratings, review text, dates, shelves, likes & reviewer info. ⚡ Export CSV/JSON/API for sentiment, market research & book marketing. 🚀 Perfect for publishers, authors & data teams.

Scraper Engine

Goodreads Review Scraper By Star Rating & Spoiler Filter

simpleapi/goodreads-review-scraper

📚 Goodreads Review Scraper extracts ratings, reviews, dates, reviewers & metadata from Goodreads book pages at scale. 🔍 Export clean data to CSV/JSON for sentiment, market research & content analysis. 🚀 Ideal for authors, publishers, researchers & SEO teams.

SimpleAPI

Goodreads Book Reviews Scraper

seemuapps/goodreads-reviews-scraper

Scrape reviews from any Goodreads book. Get full review text, star rating, reviewer name, likes, shelves, and book metadata. No login required.

Andrew

Goodreads Reviews Scraper - Low-cost💲🔥 📚⭐

delectable_incubator/goodreads-reviews-scraper-low-cost

Scrape Goodreads book reviews 📚⭐ with a powerful review scraper. Extract reviewer names, ratings, review text, review dates, and profile links from any Goodreads book page. Ideal for book market research, sentiment analysis, literary studies, reader feedback analysis, and AI/NLP datasets 📊🚀

Prime Scrape

5.0

Goodreads Review Scraper

kawsar/goodreads-review-scraper

Goodreads review scraper that collects book reviews, star ratings, and reviewer profiles without login or authentication, giving authors and researchers clean data for sentiment analysis and competitive research.

Kawsar

Goodreads Reviews Scraper

parseforge/goodreads-reviews-scraper

Automate collection of book reviews from Goodreads. Get complete review data including ratings, review text, reviewer information, dates, and helpful counts. Perfect for authors, publishers, researchers, and book enthusiasts who need accurate, up-to-date review intelligence without manual work.

ParseForge

5.0

Goodreads Reviews Scraper - Book Ratings and Text

benthepythondev/goodreads-reviews-scraper

Scrape public Goodreads book reviews with ratings, text, reviewer profiles, shelves, reactions and book metadata through a fast paginated no-login HTTP engine.

Ben

Goodreads Books Reviews Scraper

stealth_mode/goodreads-books-reviews-scraper

Scrape book reviews from Goodreads.com, the world's largest book recommendation platform. Extract review text, ratings, user profiles, timestamps, and engagement metrics. Ideal for publishers, authors, market researchers, and sentiment analysis applications.

Stealth mode

Goodreads Book Scraper - Metadata, Ratings & Reviews

klondikeking/goodreads-book-scraper

Extract book metadata, ratings, reviews, and author information from Goodreads. Get structured data including title, author, ISBN, rating, review count, description, and cover image. Ideal for book market research, catalog building, and literary analytics.

Pierrick McD0nald

Goodreads Review Scraper: Rating Distribution & Book Analytics

Goodreads Review Scraper — Reviews, Reviewers & Book Analytics

What is the Goodreads Review Scraper?

📚 What data does the Goodreads Review Scraper collect?

Need more review and rating data?

🧭 How the rating distribution and polarization score are computed

The exact formulas

Edge cases you should handle in your pipeline

Why not build a Goodreads review scraper yourself?

Why do developers and teams scrape Goodreads?

For AI engineers and agent builders

For publishers, authors and book marketers

For academic and market researchers

For developers building data products

How to scrape Goodreads reviews (step by step)

What to do when Goodreads changes its structure

What changed in Goodreads scraping recently?

⬇️ Input

Example JSON input

⬆️ Output

Scraped book analytics (type: "book")

Scraped reviews (type: "review")

How does this Goodreads scraper compare to other Goodreads scrapers?

How many results can you scrape with the Goodreads Review Scraper?

💸 How this Goodreads scraper is billed

Integrate the Goodreads Review Scraper and automate your workflow

REST API integration

Automation platforms (n8n, Make, LangChain)

Is it legal to scrape Goodreads?

Frequently asked questions

Does the Goodreads Review Scraper work without a Goodreads account?

How often is the scraped Goodreads data updated?

What happens if a book has no reviews, or the ratings histogram is missing?

Can I scrape private shelves, deleted reviews or login-gated Goodreads content?

How do I run it, and what do I need to start?

Does the Goodreads Review Scraper work for AI agent workflows and LLM pipelines?

How does the Actor handle Goodreads' anti-bot system?

Does it return data in a format LLMs can use directly?

Can I use it without managing proxies?

What happens when Goodreads changes its structure or blocks the scraper?

Your feedback

You might also like

Goodreads Review Scraper

Goodreads Review Scraper With Reviewer Lead Enrichment

Goodreads Review Scraper By Star Rating & Spoiler Filter

Goodreads Book Reviews Scraper

Goodreads Reviews Scraper - Low-cost💲🔥 📚⭐

Goodreads Review Scraper

Goodreads Reviews Scraper

Goodreads Reviews Scraper - Book Ratings and Text

Goodreads Books Reviews Scraper

Goodreads Book Scraper - Metadata, Ratings & Reviews

Scraped book analytics (`type: "book"`)

Scraped reviews (`type: "review"`)