Internet Archive Scraper
Pricing
Pay per event
Internet Archive Scraper
Search and extract metadata from the Internet Archive. Find books, videos, audio, software, and more from 40M+ items.
Pricing
Pay per event
Rating
0.0
(0)
Developer
Stas Persiianenko
Actor stats
0
Bookmarked
6
Total users
1
Monthly active users
8 hours ago
Last modified
Share
Search and extract metadata from the Internet Archive — the world's largest digital library with 40M+ items. Find books, videos, audio, software, images, and web archives.
What does Internet Archive Scraper do?
Internet Archive Scraper searches the Internet Archive's vast collection and extracts structured metadata for each item. Get titles, creators, descriptions, download counts, file formats, subjects, and direct links. Supports filtering by media type (books, movies, audio, software, etc.) and sorting by popularity, date, or title.
The Internet Archive hosts over 40 million items including 28M+ books, 14M+ audio recordings, 7M+ videos, and millions of software titles, images, and web pages.
Why use Internet Archive Scraper?
- 40M+ items — access the largest free digital library in the world
- All media types — books, movies, audio, software, images, data, web archives
- Download stats — see how popular each item is with download counts
- Multiple formats — items often have PDF, EPUB, MOBI, MP3, MP4, and more
- Pagination — extract up to 500 items per search query
- Sorting — sort by relevance, most downloaded, newest, oldest, or title
Use cases
- Research — find public domain books, papers, and historical documents
- Media analysis — track download trends for audio, video, and software
- Content curation — discover popular public domain content for projects
- Digital preservation — catalog archived websites and historical software
- Education — find open educational resources and textbooks
- Historical research — access vintage software, old magazines, and rare recordings
How to scrape the Internet Archive
- Go to the Internet Archive Scraper page on Apify Store.
- Click Try for free to open the actor configuration.
- Add search terms to the Search queries list.
- Optionally filter by Media type and choose a Sort by order.
- Click Start and wait for the run to finish.
- Download your data as JSON, CSV, or Excel, or connect via the Apify API.
Input parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
searchQueries | array | Yes | — | Search terms to find items |
mediaType | string | No | all | Filter: texts, movies, audio, software, image, data, web, collection, etree |
sortBy | string | No | relevance | Sort: downloads desc, date desc, date asc, titleSorter asc/desc |
maxResults | integer | No | 50 | Max results per query (1–500) |
Example input
{"searchQueries": ["machine learning", "public domain films"],"sortBy": "downloads desc","maxResults": 50}
Output example
Each item returns structured metadata:
{"identifier": "deep-learning-collection-pdf","title": "Deep Learning Collection PDF","creator": "","description": "A collection of deep learning resources...","mediaType": "texts","collection": "opensource","date": "2019-01-15","year": "2019","language": "English","subject": ["deep learning", "machine learning", "neural networks"],"downloads": 89271,"itemSize": 524288000,"filesCount": 12,"format": ["Archive BitTorrent", "PDF", "Text"],"licenseUrl": "","detailsUrl": "https://archive.org/details/deep-learning-collection-pdf","downloadUrl": "https://archive.org/download/deep-learning-collection-pdf","thumbnailUrl": "https://archive.org/services/img/deep-learning-collection-pdf","searchQuery": "machine learning","scrapedAt": "2026-03-03T05:42:00.000Z"}
Output fields
| Field | Type | Description |
|---|---|---|
identifier | string | Unique Archive.org item identifier |
title | string | Item title |
creator | string | Author, artist, or uploader |
description | string | Item description |
mediaType | string | Media type (texts, movies, audio, software, etc.) |
collection | string | Collection(s) the item belongs to |
date | string | Publication or upload date |
year | string | Year of publication |
language | string | Content language |
subject | array | Subject tags and categories |
downloads | number | Total download count |
itemSize | number | Total size in bytes |
filesCount | number | Number of files in the item |
format | array | Available file formats |
licenseUrl | string | License URL if specified |
detailsUrl | string | Link to item details page |
downloadUrl | string | Direct download link |
thumbnailUrl | string | Thumbnail image URL |
searchQuery | string | The search query that found this item |
scrapedAt | string | ISO 8601 timestamp of extraction |
How much does it cost to scrape the Internet Archive?
Internet Archive Scraper uses pay-per-event pricing:
| Event | Price |
|---|---|
| Run started | $0.001 |
| Item extracted | $0.001 per item |
Cost examples
| Items | Cost |
|---|---|
| 50 items (1 search) | $0.051 |
| 200 items (2 searches) | $0.201 |
| 500 items (5 searches) | $0.501 |
API usage
Python
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run = client.actor("automation-lab/internet-archive-scraper").call(run_input={"searchQueries": ["artificial intelligence"],"mediaType": "texts","sortBy": "downloads desc","maxResults": 25})for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(f"{item['title']} — {item['downloads']:,} downloads — {item['detailsUrl']}")
Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });const run = await client.actor('automation-lab/internet-archive-scraper').call({searchQueries: ['artificial intelligence'],mediaType: 'texts',sortBy: 'downloads desc',maxResults: 25,});const { items } = await client.dataset(run.defaultDatasetId).listItems();items.forEach(item => {console.log(`${item.title} — ${item.downloads.toLocaleString()} downloads`);});
cURL
curl -X POST "https://api.apify.com/v2/acts/automation-lab~internet-archive-scraper/runs?token=YOUR_API_TOKEN" \-H "Content-Type: application/json" \-d '{"searchQueries":["artificial intelligence"],"mediaType":"texts","sortBy":"downloads desc","maxResults":25}'
Use with AI agents via MCP
Internet Archive Scraper is available as a tool for AI assistants via the Model Context Protocol (MCP).
Setup for Claude Code
$claude mcp add --transport http apify "https://mcp.apify.com"
Setup for Claude Desktop, Cursor, or VS Code
{"mcpServers": {"apify": {"url": "https://mcp.apify.com"}}}
Example prompts
- "Search the Internet Archive for 'artificial intelligence' books"
- "Find archived copies of old websites"
- "Get the most downloaded public domain films from archive.org"
Learn more in the Apify MCP documentation.
Legality
Scraping publicly available data is generally legal according to the US Court of Appeals ruling (HiQ Labs v. LinkedIn). This actor only accesses publicly available information and does not require authentication. Always review and comply with the target website's Terms of Service before scraping. For personal data, ensure compliance with GDPR, CCPA, and other applicable privacy regulations.
FAQ
Why does my search return fewer results than expected?
The maxResults parameter caps results per query at 500. The Internet Archive's search API also limits results for very broad queries. Use more specific search terms or add a mediaType filter to get more targeted results.
Can I download the actual files, not just metadata?
This scraper extracts metadata only. Each result includes a downloadUrl field that links directly to the item's download page on archive.org. You can use that URL to download files programmatically or in your browser.
Integrations
Connect Internet Archive Scraper to your workflow with Apify integrations:
- Webhooks — trigger actions when extraction completes
- Google Sheets — export archive data to spreadsheets
- Slack — get notified about new uploads matching your criteria
- Zapier / Make — connect to 5,000+ apps and services
- REST API — call the actor programmatically from any language
Tips and best practices
- Use specific search terms for focused results — broad queries return millions of items
- Filter by
mediaTypeto narrow results (e.g., "texts" for books, "software" for games) - Sort by "downloads desc" to find the most popular items
- The
downloadUrlprovides direct access to download all files for an item thumbnailUrlworks for most items and is useful for building visual catalogs- Use
subjecttags for categorization — they're user-contributed and can be very detailed
Other scrapers
- iTunes Search Scraper — Search and extract media metadata from the Apple iTunes catalog
- Wikipedia Scraper — Extract articles and data from Wikipedia
- IMDB Scraper — Scrape movie and TV show data from IMDB
Changelog
- v0.1 — Initial release with full-text search, media type filtering, and sorting