Internet Archive Scraper
Pricing
$2.00 / 1,000 item returneds
Internet Archive Scraper
Searches the Internet Archive (archive.org) by keyword and returns structured items (title, creator, year, downloads, subjects, item URL); filter by media type and sort by downloads or upload date.
Pricing
$2.00 / 1,000 item returneds
Rating
0.0
(0)
Developer
Dami's Studio
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 hours ago
Last modified
Categories
Share
Search the Internet Archive (archive.org) by keyword and get back clean, structured items — title, creator, year, downloads, subjects, description and the item URL. No API key, no login.
Built on the public advancedsearch.php JSON API. Filter by media type (texts, audio, movies, software, image, …), sort by downloads, date, or relevance, and paginate transparently up to your item limit.
What you get per item
identifier, title, creator, year, date, mediaType, downloads, subjects (array), description (first ~500 chars), publicdate, and url (https://archive.org/details/{identifier}).
Fields that can be null
title,creator,year,date,description,publicdate— null when archive.org's metadata doesn't include that field for an item.subjects— empty array when the item has no subject tags.downloads—0when not reported.
Input
| Field | Notes |
|---|---|
query | Required. Keywords, e.g. nasa apollo, jazz. Supports archive.org Lucene operators, e.g. title:(grateful dead) AND year:[1977 TO 1980]. |
mediaType | Restrict to one type: texts, audio, movies, software, image, web, data, collection. Empty = any. |
sort | downloads (default), date, publicdate, or relevance. |
maxItems | Max unique items to return (default 100). Paginates 100 per request until reached or exhausted. |
Output
One dataset row per item. Pricing is pay-per-result: you are only charged for genuine item rows (ok: true). Diagnostic rows are never charged — this includes:
- empty/invalid input (
errorCode: "BAD_INPUT"— empty query or an unknownmediaType), - no results for the query (
NO_RESULTS), - rate limits or network errors (
RATE_LIMITED/NETWORK/SERVER_ERROR).
Results are de-duplicated by identifier.
Proxy
The archive.org advancedsearch API is a public, no-auth JSON endpoint with no anti-bot, so no proxy is required and the default runs without one (saving proxy credits). Only enable Apify Proxy if you hit IP rate limits at very high volume.
Troubleshooting
- Getting a
BAD_INPUTrow? Provide a non-emptyquery, and if you setmediaTypemake sure it's one of the allowed values. NO_RESULTS? The query matched nothing on archive.org — broaden the keywords or remove the media-type filter.- Want fewer/more results? Adjust
maxItems. The archive can return very large result sets for broad queries.
Example
{ "query": "jazz", "mediaType": "audio", "sort": "downloads", "maxItems": 50 }
Notes
The actor calls advancedsearch.php with output=json, requesting identifier, title, creator, year, date, mediatype, downloads, description, subject, and publicdate, then maps each doc to a clean row. Pagination uses page with 100 rows per request until your maxItems is reached or the numFound total is exhausted.