Internet Archive Book Reviews Scraper avatar

Internet Archive Book Reviews Scraper

Pricing

from $1.00 / 1,000 results

Go to Apify Store
Internet Archive Book Reviews Scraper

Internet Archive Book Reviews Scraper

Extract public Archive.org book metadata, ISBNs, ratings, and user reviews from public Internet Archive endpoints. Start from URLs, identifiers, ISBNs, creators, collections, subjects, or search queries. Output is always one dataset row per public review. No API key required.

Pricing

from $1.00 / 1,000 results

Rating

0.0

(0)

Developer

Inus Grobler

Inus Grobler

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a day ago

Last modified

Share

At a glance: what it does is extract public Archive.org book reviews and item metadata; input examples include Archive.org URLs, identifiers, ISBNs, creators, collections, and subjects; output examples are review rows with book metadata; use cases include library research and catalog enrichment; limitations, troubleshooting, and pricing/cost notes are covered below.

Internet Archive Book Reviews Scraper extracts public Archive.org book reviews, ratings, ISBNs, and item metadata for researchers, publishers, librarians, analysts, and data teams that need structured review data from Internet Archive pages.

No Internet Archive API key is required for public item URLs, public identifiers, public metadata, or public reviews.

What You Can Use It For

  • Build datasets of public Archive.org book reviews and star ratings
  • Enrich book records with title, creator, ISBN, publisher, subject, language, collection, and cover URL
  • Monitor public reviews for selected Internet Archive books
  • Research library, public-domain, and scanned-book collections
  • Export review data to spreadsheets, dashboards, databases, or AI workflows
  • Find reviewed books by ISBN, creator, collection, subject, or Archive.org search query

What It Extracts

Each dataset row is one public review enriched with source item metadata.

Field groupExamples
Item identityidentifier, itemUrl, metadataUrl, coverUrl
Book metadatatitle, creators, publisher, publishedDate, language, subjects, collections, mediatype
Book identifiersisbn10, isbn13
Review datareviewTitle, reviewText, stars, rating, ratingScale, reviewerName, createdAt, reviewUpdatedAt
Run metadatareviewSource, reviewHash, scrapedAt

Simple Input

For the easiest run, add known Archive.org item URLs, item identifiers, or ISBNs to sources.

{
"sources": ["https://archive.org/details/goodytwoshoes00newyiala"],
"maxItems": 1,
"maxReviewsPerItem": 10
}

You can also paste multiple values into sources, one per line:

https://archive.org/details/goodytwoshoes00newyiala
goodytwoshoes00newyiala
9780140449136

Discovery Inputs

Use these when you want the Actor to find matching Archive.org text items:

  • isbns: ISBN-10 or ISBN-13 values
  • creators: author or creator names
  • collections: Archive.org collection identifiers, such as internetarchivebooks
  • subjects: subject terms, such as fiction
  • searchQueries: raw Archive.org search queries

For review-focused discovery, use:

{
"searchQueries": ["mediatype:texts AND reviewdate:*"],
"maxItems": 25,
"maxReviewsPerItem": 20
}

General searches can match books that have no public review rows. Keep onlyItemsWithReviews enabled if you only want review output.

Advanced Options

Most users do not need advanced settings. They are available for large or filtered runs:

  • minStars and maxStars: keep reviews within a rating range
  • reviewTextContains: keep reviews whose title or body contains text
  • languageFilter: keep items in selected languages
  • mediatypes: defaults to texts
  • globalConcurrency, perHostConcurrency, requestDelayMs, requestTimeoutSecs, maxRetries: HTTP reliability controls
  • includeRawMetadata and includeRawReviews: debugging fields for advanced users

Backward-compatible fields such as startUrls, identifiers, includeMetadata, includeReviews, includeFiles, and outputMode are still accepted. The dataset output remains one row per public review.

Example Output

{
"entityType": "review",
"source": "internet_archive",
"identifier": "goodytwoshoes00newyiala",
"itemUrl": "https://archive.org/details/goodytwoshoes00newyiala",
"metadataUrl": "https://archive.org/metadata/goodytwoshoes00newyiala",
"title": "Goody Two-Shoes",
"creators": [],
"isbn10": null,
"isbn13": null,
"publisher": "New-York : McLoughlin Bro's",
"publishedDate": "c1888",
"language": ["eng"],
"subjects": ["Brothers and sisters", "Orphans", "Conduct of life", "Education"],
"collections": ["cdl", "yrlsc", "iacl", "americana"],
"mediatype": "texts",
"coverUrl": "https://archive.org/services/img/goodytwoshoes00newyiala",
"reviewTitle": "Fun",
"reviewText": "This is an enjoyable read.",
"stars": 4,
"rating": 4,
"ratingScale": 5,
"reviewerName": "ErniePye",
"reviewerItemName": null,
"createdAt": "2007-09-06 03:26:39",
"reviewUpdatedAt": "2007-09-06 03:26:39",
"reviewSource": "metadata_reviews_branch",
"reviewHash": "ce6e1ad425a071ae6e6cccc8ec08d8e73c216f5ee1d3de825be0b5e82c550dfd",
"scrapedAt": "2026-06-14T18:01:23.349Z"
}

How To Run On Apify

  1. Open the Actor on Apify.
  2. Add one or more values to sources, or use discovery fields such as creators, collections, or searchQueries.
  3. Set maxItems and maxReviewsPerItem to control run size.
  4. Click Start.
  5. Download results from the Dataset tab as JSON, CSV, Excel, XML, or HTML.

Python API Example

from apify_client import ApifyClient
client = ApifyClient("YOUR_APIFY_TOKEN")
run_input = {
"sources": ["https://archive.org/details/goodytwoshoes00newyiala"],
"maxItems": 1,
"maxReviewsPerItem": 10,
}
run = client.actor("TheScrapeLab/internet-archive-book-reviews-scraper").call(run_input=run_input)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item["identifier"], item["stars"], item["reviewText"])

Limits And Caveats

  • The Actor extracts public Archive.org metadata and public review objects only.
  • Items without public reviews do not produce dataset rows when onlyItemsWithReviews is enabled.
  • Some public review objects may omit stars, reviewer names, or timestamps.
  • Archive.org endpoint latency and availability can vary.
  • The Actor is read-only. It does not log in, modify Archive.org data, solve captchas, or bypass access controls.
  • Optional Internet Archive credentials are accepted for compatibility, but they are not required or used for public review scraping.

Troubleshooting

No rows in the dataset: The item may have no public reviews, or your filters may have removed all reviews.

Search finds books but no reviews: Use mediatype:texts AND reviewdate:* in searchQueries to focus on reviewed text items.

Invalid URL error: Use an Archive.org /details/{identifier}, /metadata/{identifier}, or /metadata/{identifier}/reviews URL.

Missing identifier: The item may be unavailable through the public metadata endpoint. The Actor records a warning and continues when possible.

Slow broad searches: Reduce maxItems, keep maxReviewsPerItem close to what you need, and use specific creators, collections, subjects, or queries.

Pricing

Recommended pricing model: pay per result.

Each useful dataset row is one public review enriched with item metadata. A simple per-result price is easiest for users to understand and keeps small tests affordable. Based on measured 256 MB runs, the recommended starting price is $0.001 per dataset item, with platform usage paid by the user. Large-volume users can be offered lower private pricing after enough production usage is measured.

FAQ

Can I scrape Internet Archive book reviews by URL?

Yes. Add Archive.org item URLs to sources, such as https://archive.org/details/goodytwoshoes00newyiala.

Can I search Internet Archive reviews by author?

Yes. Add author names to creators, or use a raw Archive.org query in searchQueries.

Does this download books or files?

No. It extracts public metadata and public reviews. It does not download book files.

Does it need an Internet Archive account?

No. Public item metadata and reviews do not require an Internet Archive account.

Why do some books have no output?

The dataset is review-focused. If a book has no public reviews, it normally produces no dataset rows.

Can I export to CSV or Excel?

Yes. Use the Dataset tab in Apify Console and choose CSV, Excel, JSON, XML, or HTML.