Internet Archive Reviews & Metadata Scraper
Pricing
Pay per usage
Internet Archive Reviews & Metadata Scraper
Extract public Archive.org book metadata, ISBNs, ratings, and user reviews from public Internet Archive endpoints. Start from URLs, identifiers, ISBNs, creators, collections, subjects, or search queries. Output is always one dataset row per public review. No API key required.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Inus Grobler
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
a day ago
Last modified
Categories
Share
Archive.org scraper for extracting public Internet Archive book metadata, item details, ISBNs, ratings, and public user reviews. Add Archive.org item URLs, identifiers, ISBNs, creators, collections, subjects, or search queries and get one dataset row per public review.
No Internet Archive API key is required for public item URLs, public identifiers, public metadata, or public reviews.
Quick Start
The default input is intentionally small and fast. If you click Start without changing anything, the Actor scrapes up to 10 public reviews from one known Archive.org item:
{"sources": ["https://archive.org/details/goodytwoshoes00newyiala"],"maxItems": 1,"maxReviewsPerItem": 10,"onlyItemsWithReviews": true}
Use that default run to confirm the output format, then replace sources or use the discovery fields for your own books, authors, collections, ISBNs, subjects, or search queries.
Why Use This Actor
- Scrape Archive.org book reviews without browser automation
- Extract Internet Archive item metadata from public API-style endpoints
- Enrich review records with title, creators, ISBNs, publisher, language, subjects, collections, and cover URL
- Search by ISBN, creator, collection, subject, or Archive.org query
- Export results as JSON, CSV, Excel, XML, or via the Apify API
- Run as a one-off scrape, scheduled research workflow, or data pipeline input
What It Extracts
Book and item metadata
- Archive.org identifier, title, creators/authors, description, publisher, publication date, year, language, subjects, collections, and media type
- ISBN-10, ISBN-13, all ISBN values, Open Library IDs, OCLC, LCCN, contributors, sponsors, scanners, uploader, rights, license URL, and external identifiers
- Cover thumbnail URL, item URL, metadata URL, file count, item size, and update timestamps
Public reviews
- Review title and body text
- Star rating, normalized rating, and rating scale
- Reviewer name and reviewer item/profile reference when available
- Created date, updated/review date, source item identifier, source item title, source item URL, stable review hash, and scrape timestamp
Computed review statistics
- Review count
- Non-zero rating count and zero-rating count
- Rating distribution
- First review date and latest review date
computedAverageRating
computedAverageRating is calculated by this Actor from fetched public review star values. It is not an official Internet Archive average rating.
Supported Inputs
Use plain fields. No prefixes are needed.
For known Archive.org items, add values to the sources list:
https://archive.org/details/goodytwoshoes00newyialagoodytwoshoes00newyiala9780140449136
For discovery, use the dedicated list fields:
- Archive.org item URLs, for example
https://archive.org/details/adventuresoftomsa00twai - Archive.org metadata URLs, for example
https://archive.org/metadata/adventuresoftomsa00twai - Archive.org review URLs, for example
https://archive.org/metadata/adventuresoftomsa00twai/reviews - Direct Archive.org identifiers
isbns: ISBN-10 or ISBN-13 valuescreators: creator or author namescollections: Archive.org collection identifierssubjects: subject termssearchQueries: raw Archive.org search queries
For review-focused discovery, use searchQueries with mediatype:texts AND reviewdate:*. General searches can match many books that do not expose public review objects.
Example Inputs
Scrape reviews from an Archive.org item URL
{"sources": ["https://archive.org/details/goodytwoshoes00newyiala"],"maxItems": 1,"maxReviewsPerItem": 10,"onlyItemsWithReviews": true}
Scrape a known Archive.org identifier
{"sources": ["goodytwoshoes00newyiala"],"maxReviewsPerItem": 10}
Find reviewed books by creator
{"creators": ["Mark Twain"],"maxItems": 10,"maxReviewsPerItem": 10,"onlyItemsWithReviews": true}
Search reviewed text items
{"searchQueries": ["mediatype:texts AND reviewdate:*"],"maxItems": 10,"maxReviewsPerItem": 10}
Dataset Output
The Actor always writes one dataset item per public review. Each review row includes key item metadata, so the output is ready for review analysis, sentiment workflows, rating exports, and spreadsheets.
Example Output
{"entityType": "review","source": "internet_archive","identifier": "goodytwoshoes00newyiala","itemUrl": "https://archive.org/details/goodytwoshoes00newyiala","metadataUrl": "https://archive.org/metadata/goodytwoshoes00newyiala","title": "Goody Two-Shoes","creators": [],"isbn13": null,"publisher": null,"publishedDate": "1900","language": ["eng"],"subjects": ["fiction"],"collections": ["internetarchivebooks"],"mediatype": "texts","coverUrl": "https://archive.org/services/img/goodytwoshoes00newyiala","reviewTitle": "Great book","reviewText": "Review text here","stars": 5,"rating": 5,"ratingScale": 5,"reviewerName": "Reviewer","reviewerItemName": "@reviewer","createdAt": "2020-01-01 00:00:00","reviewUpdatedAt": "2020-01-02 00:00:00","reviewSource": "metadata_reviews_branch","reviewHash": "stable_hash","scrapedAt": "2026-05-28T00:00:00.000Z"}
How It Works
The Actor uses public Internet Archive API-style endpoints:
https://archive.org/metadata/{identifier}https://archive.org/metadata/{identifier}/reviewshttps://archive.org/services/search/v1/scrapefor discoveryhttps://archive.org/advancedsearch.phpas a fallback search endpoint
It does not use Playwright, Puppeteer, browser automation, login sessions, captcha solving, or private APIs.
Read-Only And Compliance
This Actor is read-only. It does not write, update, upload, delete, rate reviews, or modify anything on Internet Archive. It extracts public metadata and public reviews only. It does not scrape private data, bypass login walls, bypass access controls, solve captchas, or attempt to evade rate limits.
Optional Internet Archive credentials can be provided for advanced or future reliability needs, but they are not required for the core scraper.
Filtering
onlyItemsWithReviews defaults to true, so items with no public reviews are skipped. The dataset is review-only, so no-review items do not produce dataset rows.
You can also filter by:
- Minimum star rating
- Maximum star rating
- Text contained in the review title or body
- Item language
- Archive.org media type
API Usage
from apify_client import ApifyClientclient = ApifyClient("YOUR_APIFY_TOKEN")actor_client = client.actor("TheScrapeLab/internet-archive-book-reviews-scraper")run_input = {"sources": ["https://archive.org/details/goodytwoshoes00newyiala"],"maxItems": 1,"maxReviewsPerItem": 10,"onlyItemsWithReviews": True,}run = actor_client.call(run_input=run_input)for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(item)
Common Use Cases
- Build a dataset of Archive.org book reviews and ratings
- Enrich book records with Internet Archive metadata, ISBNs, subjects, and collection data
- Monitor public reviews for selected Archive.org items
- Research public-domain and library collections
- Prepare structured review data for analytics, dashboards, LLM workflows, or data warehouses
- Find reviewed books by creator, collection, subject, or ISBN
Rate Limits And Polite Usage
The Actor uses a descriptive User-Agent, request timeouts, retries with exponential backoff, 429 handling, concurrency limits, and configurable request delays. For broad discovery jobs, keep concurrency modest and limit maxItems to the amount of data you actually need.
Troubleshooting
No reviews found: The item may not have public reviews, or the public reviews endpoint may not expose reviews for that identifier. Because output is one row per review, no-review items produce no dataset rows.
Identifier not found: Use the Archive.org item identifier from /details/{identifier}.
Search returns items but no reviews: Search can match books that have no public reviews. Put mediatype:texts AND reviewdate:* in searchQueries for review-focused discovery.
Review stars missing or zero: Internet Archive review objects may omit stars or include zero-star records. The Actor preserves those values and reports zero-rating counts separately.
Rate limited or slow runs: Lower concurrency, increase request delay, or reduce maxItems. Archive.org endpoint latency can vary.
Metadata endpoint returned an error: The item may be unavailable, removed, or temporarily inaccessible through the public metadata endpoint.
Search API unavailable: The Actor logs a warning and continues with any already known direct URLs or identifiers.
Credentials are optional: Direct public item scraping does not require Internet Archive credentials.
Pricing Suggestion
Recommended pricing model: pay per event.
review-scrapedfor each public review successfully writtenitem-enrichedfor each item whose metadata is successfully enrichedsearch-result-processedfor each search result considered during discovery
Charge mainly per review scraped. Optionally charge a small amount per enriched item. Do not charge for failed items.