Internet Archive Scraper
Pricing
Pay per event
Internet Archive Scraper
Search and extract metadata from the Internet Archive. Find books, videos, audio, software, and more from 40M+ items.
Pricing
Pay per event
Rating
0.0
(0)
Developer

Stas Persiianenko
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
a day ago
Last modified
Categories
Share
Search and extract metadata from the Internet Archive — the world's largest digital library with 40M+ items. Find books, videos, audio, software, images, and web archives.
What does Internet Archive Scraper do?
Internet Archive Scraper searches the Internet Archive's vast collection and extracts structured metadata for each item. Get titles, creators, descriptions, download counts, file formats, subjects, and direct links. Supports filtering by media type (books, movies, audio, software, etc.) and sorting by popularity, date, or title.
The Internet Archive hosts over 40 million items including 28M+ books, 14M+ audio recordings, 7M+ videos, and millions of software titles, images, and web pages.
Why use Internet Archive Scraper?
- 40M+ items — access the largest free digital library in the world
- All media types — books, movies, audio, software, images, data, web archives
- Download stats — see how popular each item is with download counts
- Multiple formats — items often have PDF, EPUB, MOBI, MP3, MP4, and more
- Pagination — extract up to 500 items per search query
- Sorting — sort by relevance, most downloaded, newest, oldest, or title
Use cases
- Research — find public domain books, papers, and historical documents
- Media analysis — track download trends for audio, video, and software
- Content curation — discover popular public domain content for projects
- Digital preservation — catalog archived websites and historical software
- Education — find open educational resources and textbooks
- Historical research — access vintage software, old magazines, and rare recordings
How to use Internet Archive Scraper
- Go to the Internet Archive Scraper input page.
- Add search terms to the Search queries list.
- Optionally filter by Media type and choose a Sort by order.
- Click Start and wait for the run to finish.
- Download your data in JSON, CSV, or Excel format.
Input parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
searchQueries | array | Yes | — | Search terms to find items |
mediaType | string | No | all | Filter: texts, movies, audio, software, image, data, web, collection, etree |
sortBy | string | No | relevance | Sort: downloads desc, date desc, date asc, titleSorter asc/desc |
maxResults | integer | No | 50 | Max results per query (1–500) |
Example input
{"searchQueries": ["machine learning", "public domain films"],"sortBy": "downloads desc","maxResults": 50}
Output example
Each item returns structured metadata:
{"identifier": "deep-learning-collection-pdf","title": "Deep Learning Collection PDF","creator": "","description": "A collection of deep learning resources...","mediaType": "texts","collection": "opensource","date": "2019-01-15","year": "2019","language": "English","subject": ["deep learning", "machine learning", "neural networks"],"downloads": 89271,"itemSize": 524288000,"filesCount": 12,"format": ["Archive BitTorrent", "PDF", "Text"],"licenseUrl": "","detailsUrl": "https://archive.org/details/deep-learning-collection-pdf","downloadUrl": "https://archive.org/download/deep-learning-collection-pdf","thumbnailUrl": "https://archive.org/services/img/deep-learning-collection-pdf","searchQuery": "machine learning","scrapedAt": "2026-03-03T05:42:00.000Z"}
Output fields
| Field | Type | Description |
|---|---|---|
identifier | string | Unique Archive.org item identifier |
title | string | Item title |
creator | string | Author, artist, or uploader |
description | string | Item description |
mediaType | string | Media type (texts, movies, audio, software, etc.) |
collection | string | Collection(s) the item belongs to |
date | string | Publication or upload date |
year | string | Year of publication |
language | string | Content language |
subject | array | Subject tags and categories |
downloads | number | Total download count |
itemSize | number | Total size in bytes |
filesCount | number | Number of files in the item |
format | array | Available file formats |
licenseUrl | string | License URL if specified |
detailsUrl | string | Link to item details page |
downloadUrl | string | Direct download link |
thumbnailUrl | string | Thumbnail image URL |
searchQuery | string | The search query that found this item |
scrapedAt | string | ISO 8601 timestamp of extraction |
Pricing
Internet Archive Scraper uses pay-per-event pricing:
| Event | Price |
|---|---|
| Run started | $0.001 |
| Item extracted | $0.001 per item |
Cost examples
| Items | Cost |
|---|---|
| 50 items (1 search) | $0.051 |
| 200 items (2 searches) | $0.201 |
| 500 items (5 searches) | $0.501 |
API usage
Python
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run = client.actor("YOUR_USERNAME/internet-archive-scraper").call(run_input={"searchQueries": ["artificial intelligence"],"mediaType": "texts","sortBy": "downloads desc","maxResults": 25})for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(f"{item['title']} — {item['downloads']:,} downloads — {item['detailsUrl']}")
Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });const run = await client.actor('YOUR_USERNAME/internet-archive-scraper').call({searchQueries: ['artificial intelligence'],mediaType: 'texts',sortBy: 'downloads desc',maxResults: 25,});const { items } = await client.dataset(run.defaultDatasetId).listItems();items.forEach(item => {console.log(`${item.title} — ${item.downloads.toLocaleString()} downloads`);});
Integrations
Connect Internet Archive Scraper to your workflow with Apify integrations:
- Webhooks — trigger actions when extraction completes
- Google Sheets — export archive data to spreadsheets
- Slack — get notified about new uploads matching your criteria
- Zapier / Make — connect to 5,000+ apps and services
- REST API — call the actor programmatically from any language
Tips and best practices
- Use specific search terms for focused results — broad queries return millions of items
- Filter by
mediaTypeto narrow results (e.g., "texts" for books, "software" for games) - Sort by "downloads desc" to find the most popular items
- The
downloadUrlprovides direct access to download all files for an item thumbnailUrlworks for most items and is useful for building visual catalogs- Use
subjecttags for categorization — they're user-contributed and can be very detailed
Changelog
- v0.1 — Initial release with full-text search, media type filtering, and sorting