Internet Archive Scraper avatar

Internet Archive Scraper

Pricing

$2.00 / 1,000 item returneds

Go to Apify Store
Internet Archive Scraper

Internet Archive Scraper

Searches the Internet Archive (archive.org) by keyword and returns structured items (title, creator, year, downloads, subjects, item URL); filter by media type and sort by downloads or upload date.

Pricing

$2.00 / 1,000 item returneds

Rating

0.0

(0)

Developer

Dami's Studio

Dami's Studio

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 hours ago

Last modified

Share

Search the Internet Archive (archive.org) by keyword and get back clean, structured items — title, creator, year, downloads, subjects, description and the item URL. No API key, no login.

Built on the public advancedsearch.php JSON API. Filter by media type (texts, audio, movies, software, image, …), sort by downloads, date, or relevance, and paginate transparently up to your item limit.

What you get per item

identifier, title, creator, year, date, mediaType, downloads, subjects (array), description (first ~500 chars), publicdate, and url (https://archive.org/details/{identifier}).

Fields that can be null

  • title, creator, year, date, description, publicdate — null when archive.org's metadata doesn't include that field for an item.
  • subjects — empty array when the item has no subject tags.
  • downloads0 when not reported.

Input

FieldNotes
queryRequired. Keywords, e.g. nasa apollo, jazz. Supports archive.org Lucene operators, e.g. title:(grateful dead) AND year:[1977 TO 1980].
mediaTypeRestrict to one type: texts, audio, movies, software, image, web, data, collection. Empty = any.
sortdownloads (default), date, publicdate, or relevance.
maxItemsMax unique items to return (default 100). Paginates 100 per request until reached or exhausted.

Output

One dataset row per item. Pricing is pay-per-result: you are only charged for genuine item rows (ok: true). Diagnostic rows are never charged — this includes:

  • empty/invalid input (errorCode: "BAD_INPUT" — empty query or an unknown mediaType),
  • no results for the query (NO_RESULTS),
  • rate limits or network errors (RATE_LIMITED / NETWORK / SERVER_ERROR).

Results are de-duplicated by identifier.

Proxy

The archive.org advancedsearch API is a public, no-auth JSON endpoint with no anti-bot, so no proxy is required and the default runs without one (saving proxy credits). Only enable Apify Proxy if you hit IP rate limits at very high volume.

Troubleshooting

  • Getting a BAD_INPUT row? Provide a non-empty query, and if you set mediaType make sure it's one of the allowed values.
  • NO_RESULTS? The query matched nothing on archive.org — broaden the keywords or remove the media-type filter.
  • Want fewer/more results? Adjust maxItems. The archive can return very large result sets for broad queries.

Example

{ "query": "jazz", "mediaType": "audio", "sort": "downloads", "maxItems": 50 }

Notes

The actor calls advancedsearch.php with output=json, requesting identifier, title, creator, year, date, mediatype, downloads, description, subject, and publicdate, then maps each doc to a clean row. Pagination uses page with 100 rows per request until your maxItems is reached or the numFound total is exhausted.