Archive.org Scraper
Pricing
from $5.00 / 1,000 results
Archive.org Scraper
Scrape the Internet Archive (archive.org). Search 50M+ texts, 13M+ audio, 16M+ movies, and 1.3M+ software items. Get metadata, download counts, file lists, and more via public APIs.
Pricing
from $5.00 / 1,000 results
Rating
0.0
(0)
Developer
lulz bot
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
Scrape the Internet Archive (archive.org) to search and extract metadata from one of the world's largest digital libraries. Access 50M+ texts, 13M+ audio recordings, 16M+ movies, 1.3M+ software titles, and millions of images -- all via Archive.org's public APIs.
What does Archive.org Scraper do?
This actor searches Archive.org's massive collection and extracts structured metadata for each item. It uses the official public Advanced Search API and Metadata API -- no authentication needed.
Key capabilities:
- Search across all media types or filter by texts, audio, movies, software, or images
- Sort by downloads, date, or title
- Extract up to 10,000 items per run
- Optionally fetch full item details including file lists, formats, ratings, and reviews
- Respectful rate limiting (500ms between requests)
Input
| Field | Type | Description | Default |
|---|---|---|---|
searchQuery | string | Search query (supports field queries like creator:NASA) | required |
mediaType | string | Filter: all, texts, audio, movies, software, image | all |
sortBy | string | Sort order: downloads, newest, oldest, or title | downloads desc |
maxResults | integer | Max items to return (1-10,000) | 200 |
fetchDetails | boolean | Fetch full metadata per item (slower, adds files/formats) | false |
proxyConfiguration | object | Optional proxy configuration | none |
Example input
{"searchQuery": "public domain classical music","mediaType": "audio","sortBy": "downloads desc","maxResults": 100,"fetchDetails": false}
Advanced query examples
creator:"Grateful Dead"-- items by a specific creatorsubject:jazz AND date:[1950-01-01 TO 1960-01-01]-- jazz items from the 1950scollection:prelinger-- items from the Prelinger Archivestitle:"machine learning" AND mediatype:texts-- texts about machine learninglicenseurl:creativecommons.org-- Creative Commons licensed items
Output
Each item in the dataset contains:
| Field | Type | Description |
|---|---|---|
identifier | string | Unique Archive.org identifier |
title | string | Item title |
creator | string | Creator/author |
date | string | Publication or creation date |
description | string | Item description |
mediatype | string | Media type (texts, audio, movies, etc.) |
collection | string | Collection(s) the item belongs to |
downloads | number | Total download count |
subject | string | Subject tags |
language | string | Language |
source | string | Original source |
archiveUrl | string | Direct link to the item on archive.org |
scrapedAt | string | ISO timestamp of when the data was collected |
Additional fields when fetchDetails is enabled:
| Field | Type | Description |
|---|---|---|
files | array | List of files with name, format, and size |
format | string | Available formats (e.g., "PDF; EPUB; Text") |
licenseurl | string | License URL |
runtime | string | Runtime/duration for audio/video |
addeddate | string | Date item was added to Archive.org |
publicdate | string | Date item was made public |
uploader | string | Uploader username |
num_reviews | number | Number of reviews |
avg_rating | number | Average rating |
Example output
{"identifier": "greatgatsby_1201","title": "The Great Gatsby","creator": "F. Scott Fitzgerald","date": "1925","description": "The Great Gatsby is a 1925 novel by American writer F. Scott Fitzgerald...","mediatype": "texts","collection": "opensource; americana","downloads": 1523847,"subject": "fiction; american literature; jazz age","language": "English","source": null,"archiveUrl": "https://archive.org/details/greatgatsby_1201","scrapedAt": "2026-04-25T12:00:00.000Z"}
Use cases
- Research: Search and catalog public domain texts, audio, and video
- Digital preservation: Track download counts and availability of archived content
- Content discovery: Find Creative Commons media for projects
- Data analysis: Analyze trends across Archive.org's collections
- Library science: Build catalogs of specific collections or media types
- Machine learning: Discover training datasets from public domain sources
Performance and costs
- Without details: ~200 items/minute (paginated search, 500ms delay)
- With details: ~2 items/second (individual metadata API calls)
- The Archive.org API returns up to 100 items per page
- Proxy is optional but recommended for large scrapes (10,000+ items)
Limitations
- The Advanced Search API caps at ~10,000 results per query. Use more specific queries for larger datasets.
- Some items may have incomplete metadata (missing creator, date, etc.)
- File lists are only available when
fetchDetailsis enabled - Rate limited to 500ms between requests to be respectful to Archive.org servers