Internet Archive Reviews & Metadata Scraper avatar

Internet Archive Reviews & Metadata Scraper

Pricing

Pay per usage

Go to Apify Store
Internet Archive Reviews & Metadata Scraper

Internet Archive Reviews & Metadata Scraper

Extract public Archive.org book metadata, ISBNs, ratings, and user reviews from public Internet Archive endpoints. Start from URLs, identifiers, ISBNs, creators, collections, subjects, or search queries. Output is always one dataset row per public review. No API key required.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Inus Grobler

Inus Grobler

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a day ago

Last modified

Share

Archive.org scraper for extracting public Internet Archive book metadata, item details, ISBNs, ratings, and public user reviews. Add Archive.org item URLs, identifiers, ISBNs, creators, collections, subjects, or search queries and get one dataset row per public review.

No Internet Archive API key is required for public item URLs, public identifiers, public metadata, or public reviews.

Quick Start

The default input is intentionally small and fast. If you click Start without changing anything, the Actor scrapes up to 10 public reviews from one known Archive.org item:

{
"sources": ["https://archive.org/details/goodytwoshoes00newyiala"],
"maxItems": 1,
"maxReviewsPerItem": 10,
"onlyItemsWithReviews": true
}

Use that default run to confirm the output format, then replace sources or use the discovery fields for your own books, authors, collections, ISBNs, subjects, or search queries.

Why Use This Actor

  • Scrape Archive.org book reviews without browser automation
  • Extract Internet Archive item metadata from public API-style endpoints
  • Enrich review records with title, creators, ISBNs, publisher, language, subjects, collections, and cover URL
  • Search by ISBN, creator, collection, subject, or Archive.org query
  • Export results as JSON, CSV, Excel, XML, or via the Apify API
  • Run as a one-off scrape, scheduled research workflow, or data pipeline input

What It Extracts

Book and item metadata

  • Archive.org identifier, title, creators/authors, description, publisher, publication date, year, language, subjects, collections, and media type
  • ISBN-10, ISBN-13, all ISBN values, Open Library IDs, OCLC, LCCN, contributors, sponsors, scanners, uploader, rights, license URL, and external identifiers
  • Cover thumbnail URL, item URL, metadata URL, file count, item size, and update timestamps

Public reviews

  • Review title and body text
  • Star rating, normalized rating, and rating scale
  • Reviewer name and reviewer item/profile reference when available
  • Created date, updated/review date, source item identifier, source item title, source item URL, stable review hash, and scrape timestamp

Computed review statistics

  • Review count
  • Non-zero rating count and zero-rating count
  • Rating distribution
  • First review date and latest review date
  • computedAverageRating

computedAverageRating is calculated by this Actor from fetched public review star values. It is not an official Internet Archive average rating.

Supported Inputs

Use plain fields. No prefixes are needed.

For known Archive.org items, add values to the sources list:

https://archive.org/details/goodytwoshoes00newyiala
goodytwoshoes00newyiala
9780140449136

For discovery, use the dedicated list fields:

  • Archive.org item URLs, for example https://archive.org/details/adventuresoftomsa00twai
  • Archive.org metadata URLs, for example https://archive.org/metadata/adventuresoftomsa00twai
  • Archive.org review URLs, for example https://archive.org/metadata/adventuresoftomsa00twai/reviews
  • Direct Archive.org identifiers
  • isbns: ISBN-10 or ISBN-13 values
  • creators: creator or author names
  • collections: Archive.org collection identifiers
  • subjects: subject terms
  • searchQueries: raw Archive.org search queries

For review-focused discovery, use searchQueries with mediatype:texts AND reviewdate:*. General searches can match many books that do not expose public review objects.

Example Inputs

Scrape reviews from an Archive.org item URL

{
"sources": ["https://archive.org/details/goodytwoshoes00newyiala"],
"maxItems": 1,
"maxReviewsPerItem": 10,
"onlyItemsWithReviews": true
}

Scrape a known Archive.org identifier

{
"sources": ["goodytwoshoes00newyiala"],
"maxReviewsPerItem": 10
}

Find reviewed books by creator

{
"creators": ["Mark Twain"],
"maxItems": 10,
"maxReviewsPerItem": 10,
"onlyItemsWithReviews": true
}

Search reviewed text items

{
"searchQueries": ["mediatype:texts AND reviewdate:*"],
"maxItems": 10,
"maxReviewsPerItem": 10
}

Dataset Output

The Actor always writes one dataset item per public review. Each review row includes key item metadata, so the output is ready for review analysis, sentiment workflows, rating exports, and spreadsheets.

Example Output

{
"entityType": "review",
"source": "internet_archive",
"identifier": "goodytwoshoes00newyiala",
"itemUrl": "https://archive.org/details/goodytwoshoes00newyiala",
"metadataUrl": "https://archive.org/metadata/goodytwoshoes00newyiala",
"title": "Goody Two-Shoes",
"creators": [],
"isbn13": null,
"publisher": null,
"publishedDate": "1900",
"language": ["eng"],
"subjects": ["fiction"],
"collections": ["internetarchivebooks"],
"mediatype": "texts",
"coverUrl": "https://archive.org/services/img/goodytwoshoes00newyiala",
"reviewTitle": "Great book",
"reviewText": "Review text here",
"stars": 5,
"rating": 5,
"ratingScale": 5,
"reviewerName": "Reviewer",
"reviewerItemName": "@reviewer",
"createdAt": "2020-01-01 00:00:00",
"reviewUpdatedAt": "2020-01-02 00:00:00",
"reviewSource": "metadata_reviews_branch",
"reviewHash": "stable_hash",
"scrapedAt": "2026-05-28T00:00:00.000Z"
}

How It Works

The Actor uses public Internet Archive API-style endpoints:

  • https://archive.org/metadata/{identifier}
  • https://archive.org/metadata/{identifier}/reviews
  • https://archive.org/services/search/v1/scrape for discovery
  • https://archive.org/advancedsearch.php as a fallback search endpoint

It does not use Playwright, Puppeteer, browser automation, login sessions, captcha solving, or private APIs.

Read-Only And Compliance

This Actor is read-only. It does not write, update, upload, delete, rate reviews, or modify anything on Internet Archive. It extracts public metadata and public reviews only. It does not scrape private data, bypass login walls, bypass access controls, solve captchas, or attempt to evade rate limits.

Optional Internet Archive credentials can be provided for advanced or future reliability needs, but they are not required for the core scraper.

Filtering

onlyItemsWithReviews defaults to true, so items with no public reviews are skipped. The dataset is review-only, so no-review items do not produce dataset rows.

You can also filter by:

  • Minimum star rating
  • Maximum star rating
  • Text contained in the review title or body
  • Item language
  • Archive.org media type

API Usage

from apify_client import ApifyClient
client = ApifyClient("YOUR_APIFY_TOKEN")
actor_client = client.actor("TheScrapeLab/internet-archive-book-reviews-scraper")
run_input = {
"sources": ["https://archive.org/details/goodytwoshoes00newyiala"],
"maxItems": 1,
"maxReviewsPerItem": 10,
"onlyItemsWithReviews": True,
}
run = actor_client.call(run_input=run_input)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item)

Common Use Cases

  • Build a dataset of Archive.org book reviews and ratings
  • Enrich book records with Internet Archive metadata, ISBNs, subjects, and collection data
  • Monitor public reviews for selected Archive.org items
  • Research public-domain and library collections
  • Prepare structured review data for analytics, dashboards, LLM workflows, or data warehouses
  • Find reviewed books by creator, collection, subject, or ISBN

Rate Limits And Polite Usage

The Actor uses a descriptive User-Agent, request timeouts, retries with exponential backoff, 429 handling, concurrency limits, and configurable request delays. For broad discovery jobs, keep concurrency modest and limit maxItems to the amount of data you actually need.

Troubleshooting

No reviews found: The item may not have public reviews, or the public reviews endpoint may not expose reviews for that identifier. Because output is one row per review, no-review items produce no dataset rows.

Identifier not found: Use the Archive.org item identifier from /details/{identifier}.

Search returns items but no reviews: Search can match books that have no public reviews. Put mediatype:texts AND reviewdate:* in searchQueries for review-focused discovery.

Review stars missing or zero: Internet Archive review objects may omit stars or include zero-star records. The Actor preserves those values and reports zero-rating counts separately.

Rate limited or slow runs: Lower concurrency, increase request delay, or reduce maxItems. Archive.org endpoint latency can vary.

Metadata endpoint returned an error: The item may be unavailable, removed, or temporarily inaccessible through the public metadata endpoint.

Search API unavailable: The Actor logs a warning and continues with any already known direct URLs or identifiers.

Credentials are optional: Direct public item scraping does not require Internet Archive credentials.

Pricing Suggestion

Recommended pricing model: pay per event.

  • review-scraped for each public review successfully written
  • item-enriched for each item whose metadata is successfully enriched
  • search-result-processed for each search result considered during discovery

Charge mainly per review scraped. Optionally charge a small amount per enriched item. Do not charge for failed items.