Archive.org Scraper avatar

Archive.org Scraper

Pricing

from $5.00 / 1,000 results

Go to Apify Store
Archive.org Scraper

Archive.org Scraper

Scrape the Internet Archive (archive.org). Search 50M+ texts, 13M+ audio, 16M+ movies, and 1.3M+ software items. Get metadata, download counts, file lists, and more via public APIs.

Pricing

from $5.00 / 1,000 results

Rating

0.0

(0)

Developer

lulz bot

lulz bot

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Categories

Share

Scrape the Internet Archive (archive.org) to search and extract metadata from one of the world's largest digital libraries. Access 50M+ texts, 13M+ audio recordings, 16M+ movies, 1.3M+ software titles, and millions of images -- all via Archive.org's public APIs.

What does Archive.org Scraper do?

This actor searches Archive.org's massive collection and extracts structured metadata for each item. It uses the official public Advanced Search API and Metadata API -- no authentication needed.

Key capabilities:

  • Search across all media types or filter by texts, audio, movies, software, or images
  • Sort by downloads, date, or title
  • Extract up to 10,000 items per run
  • Optionally fetch full item details including file lists, formats, ratings, and reviews
  • Respectful rate limiting (500ms between requests)

Input

FieldTypeDescriptionDefault
searchQuerystringSearch query (supports field queries like creator:NASA)required
mediaTypestringFilter: all, texts, audio, movies, software, imageall
sortBystringSort order: downloads, newest, oldest, or titledownloads desc
maxResultsintegerMax items to return (1-10,000)200
fetchDetailsbooleanFetch full metadata per item (slower, adds files/formats)false
proxyConfigurationobjectOptional proxy configurationnone

Example input

{
"searchQuery": "public domain classical music",
"mediaType": "audio",
"sortBy": "downloads desc",
"maxResults": 100,
"fetchDetails": false
}

Advanced query examples

  • creator:"Grateful Dead" -- items by a specific creator
  • subject:jazz AND date:[1950-01-01 TO 1960-01-01] -- jazz items from the 1950s
  • collection:prelinger -- items from the Prelinger Archives
  • title:"machine learning" AND mediatype:texts -- texts about machine learning
  • licenseurl:creativecommons.org -- Creative Commons licensed items

Output

Each item in the dataset contains:

FieldTypeDescription
identifierstringUnique Archive.org identifier
titlestringItem title
creatorstringCreator/author
datestringPublication or creation date
descriptionstringItem description
mediatypestringMedia type (texts, audio, movies, etc.)
collectionstringCollection(s) the item belongs to
downloadsnumberTotal download count
subjectstringSubject tags
languagestringLanguage
sourcestringOriginal source
archiveUrlstringDirect link to the item on archive.org
scrapedAtstringISO timestamp of when the data was collected

Additional fields when fetchDetails is enabled:

FieldTypeDescription
filesarrayList of files with name, format, and size
formatstringAvailable formats (e.g., "PDF; EPUB; Text")
licenseurlstringLicense URL
runtimestringRuntime/duration for audio/video
addeddatestringDate item was added to Archive.org
publicdatestringDate item was made public
uploaderstringUploader username
num_reviewsnumberNumber of reviews
avg_ratingnumberAverage rating

Example output

{
"identifier": "greatgatsby_1201",
"title": "The Great Gatsby",
"creator": "F. Scott Fitzgerald",
"date": "1925",
"description": "The Great Gatsby is a 1925 novel by American writer F. Scott Fitzgerald...",
"mediatype": "texts",
"collection": "opensource; americana",
"downloads": 1523847,
"subject": "fiction; american literature; jazz age",
"language": "English",
"source": null,
"archiveUrl": "https://archive.org/details/greatgatsby_1201",
"scrapedAt": "2026-04-25T12:00:00.000Z"
}

Use cases

  • Research: Search and catalog public domain texts, audio, and video
  • Digital preservation: Track download counts and availability of archived content
  • Content discovery: Find Creative Commons media for projects
  • Data analysis: Analyze trends across Archive.org's collections
  • Library science: Build catalogs of specific collections or media types
  • Machine learning: Discover training datasets from public domain sources

Performance and costs

  • Without details: ~200 items/minute (paginated search, 500ms delay)
  • With details: ~2 items/second (individual metadata API calls)
  • The Archive.org API returns up to 100 items per page
  • Proxy is optional but recommended for large scrapes (10,000+ items)

Limitations

  • The Advanced Search API caps at ~10,000 results per query. Use more specific queries for larger datasets.
  • Some items may have incomplete metadata (missing creator, date, etc.)
  • File lists are only available when fetchDetails is enabled
  • Rate limited to 500ms between requests to be respectful to Archive.org servers