Pricing

Pay per event

Go to Apify Store

Internet Archive Scraper

Try for free

Search and extract metadata from the Internet Archive. Find books, videos, audio, software, and more from 40M+ items.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Stas Persiianenko

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

What does Internet Archive Scraper do?

Internet Archive Scraper searches the Internet Archive's vast collection and extracts structured metadata for each item. Get titles, creators, descriptions, download counts, file formats, subjects, and direct links. Supports filtering by media type (books, movies, audio, software, etc.) and sorting by popularity, date, or title.

The Internet Archive hosts over 40 million items including 28M+ books, 14M+ audio recordings, 7M+ videos, and millions of software titles, images, and web pages.

Why use Internet Archive Scraper?

40M+ items — access the largest free digital library in the world
All media types — books, movies, audio, software, images, data, web archives
Download stats — see how popular each item is with download counts
Multiple formats — items often have PDF, EPUB, MOBI, MP3, MP4, and more
Pagination — extract up to 500 items per search query
Sorting — sort by relevance, most downloaded, newest, oldest, or title

Use cases

Research — find public domain books, papers, and historical documents
Media analysis — track download trends for audio, video, and software
Content curation — discover popular public domain content for projects
Digital preservation — catalog archived websites and historical software
Education — find open educational resources and textbooks
Historical research — access vintage software, old magazines, and rare recordings

How to scrape the Internet Archive

Go to the Internet Archive Scraper page on Apify Store.
Click Try for free to open the actor configuration.
Add search terms to the Search queries list.
Optionally filter by Media type and choose a Sort by order.
Click Start and wait for the run to finish.
Download your data as JSON, CSV, or Excel, or connect via the Apify API.

Input parameters

Parameter	Type	Required	Default	Description
`searchQueries`	array	Yes	—	Search terms to find items
`mediaType`	string	No	all	Filter: texts, movies, audio, software, image, data, web, collection, etree
`sortBy`	string	No	relevance	Sort: downloads desc, date desc, date asc, titleSorter asc/desc
`maxResults`	integer	No	50	Max results per query (1–500)

Example input

{
    "searchQueries": ["machine learning", "public domain films"],
    "sortBy": "downloads desc",
    "maxResults": 50
}

Output example

Each item returns structured metadata:

{
    "identifier": "deep-learning-collection-pdf",
    "title": "Deep Learning Collection PDF",
    "creator": "",
    "description": "A collection of deep learning resources...",
    "mediaType": "texts",
    "collection": "opensource",
    "date": "2019-01-15",
    "year": "2019",
    "language": "English",
    "subject": ["deep learning", "machine learning", "neural networks"],
    "downloads": 89271,
    "itemSize": 524288000,
    "filesCount": 12,
    "format": ["Archive BitTorrent", "PDF", "Text"],
    "licenseUrl": "",
    "detailsUrl": "https://archive.org/details/deep-learning-collection-pdf",
    "downloadUrl": "https://archive.org/download/deep-learning-collection-pdf",
    "thumbnailUrl": "https://archive.org/services/img/deep-learning-collection-pdf",
    "searchQuery": "machine learning",
    "scrapedAt": "2026-03-03T05:42:00.000Z"
}

Output fields

Field	Type	Description
`identifier`	string	Unique Archive.org item identifier
`title`	string	Item title
`creator`	string	Author, artist, or uploader
`description`	string	Item description
`mediaType`	string	Media type (texts, movies, audio, software, etc.)
`collection`	string	Collection(s) the item belongs to
`date`	string	Publication or upload date
`year`	string	Year of publication
`language`	string	Content language
`subject`	array	Subject tags and categories
`downloads`	number	Total download count
`itemSize`	number	Total size in bytes
`filesCount`	number	Number of files in the item
`format`	array	Available file formats
`licenseUrl`	string	License URL if specified
`detailsUrl`	string	Link to item details page
`downloadUrl`	string	Direct download link
`thumbnailUrl`	string	Thumbnail image URL
`searchQuery`	string	The search query that found this item
`scrapedAt`	string	ISO 8601 timestamp of extraction

How much does it cost to scrape the Internet Archive?

Internet Archive Scraper uses pay-per-event pricing:

Event	Price
Run started	$0.001
Item extracted	$0.001 per item

Cost examples

Items	Cost
50 items (1 search)	$0.051
200 items (2 searches)	$0.201
500 items (5 searches)	$0.501

API usage

Python

from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")

run = client.actor("automation-lab/internet-archive-scraper").call(
    run_input={
        "searchQueries": ["artificial intelligence"],
        "mediaType": "texts",
        "sortBy": "downloads desc",
        "maxResults": 25
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(f"{item['title']} — {item['downloads']:,} downloads — {item['detailsUrl']}")

Node.js

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });

const run = await client.actor('automation-lab/internet-archive-scraper').call({
    searchQueries: ['artificial intelligence'],
    mediaType: 'texts',
    sortBy: 'downloads desc',
    maxResults: 25,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach(item => {
    console.log(`${item.title} — ${item.downloads.toLocaleString()} downloads`);
});

cURL

curl -X POST "https://api.apify.com/v2/acts/automation-lab~internet-archive-scraper/runs?token=YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"searchQueries":["artificial intelligence"],"mediaType":"texts","sortBy":"downloads desc","maxResults":25}'

Use with AI agents via MCP

Internet Archive Scraper is available as a tool for AI assistants via the Model Context Protocol (MCP).

Setup for Claude Code

$claude mcp add --transport http apify "https://mcp.apify.com?tools=automation-lab/internet-archive-scraper"

Setup for Claude Desktop, Cursor, or VS Code

{
    "mcpServers": {
        "apify": {
            "url": "https://mcp.apify.com?tools=automation-lab/internet-archive-scraper"
        }
    }
}

Example prompts

"Search the Internet Archive for 'artificial intelligence' books"
"Find archived copies of old websites"
"Get the most downloaded public domain films from archive.org"

Learn more in the Apify MCP documentation.

Legality

Scraping publicly available data is generally legal according to the US Court of Appeals ruling (HiQ Labs v. LinkedIn). This actor only accesses publicly available information and does not require authentication. Always review and comply with the target website's Terms of Service before scraping. For personal data, ensure compliance with GDPR, CCPA, and other applicable privacy regulations.

FAQ

Why does my search return fewer results than expected? The maxResults parameter caps results per query at 500. The Internet Archive's search API also limits results for very broad queries. Use more specific search terms or add a mediaType filter to get more targeted results.

Can I download the actual files, not just metadata? This scraper extracts metadata only. Each result includes a downloadUrl field that links directly to the item's download page on archive.org. You can use that URL to download files programmatically or in your browser.

Integrations

Connect Internet Archive Scraper to your workflow with Apify integrations:

Webhooks — trigger actions when extraction completes
Google Sheets — export archive data to spreadsheets
Slack — get notified about new uploads matching your criteria
Zapier / Make — connect to 5,000+ apps and services
REST API — call the actor programmatically from any language

Tips and best practices

Use specific search terms for focused results — broad queries return millions of items
Filter by mediaType to narrow results (e.g., "texts" for books, "software" for games)
Sort by "downloads desc" to find the most popular items
The downloadUrl provides direct access to download all files for an item
thumbnailUrl works for most items and is useful for building visual catalogs
Use subject tags for categorization — they're user-contributed and can be very detailed

Other scrapers

iTunes Search Scraper — Search and extract media metadata from the Apple iTunes catalog
Wikipedia Scraper — Extract articles and data from Wikipedia
IMDB Scraper — Scrape movie and TV show data from IMDB

Changelog

v0.1 — Initial release with full-text search, media type filtering, and sorting

Internet Archive Search — Wayback Machine Advanced Query Tool

maged120/archive-org-advanced-search

Search the Internet Archive (archive.org) with full advanced filter support — date range, media type, language, subject, and more. Returns metadata from archived web pages, books, audio, and video.

Maged

Dark Web Search Results Scraper

lofomachines/dark-web-search-results-scraper

Scrapes search results from dark web search engines. Get titles, onion urls, page description.

Lofomachines

104

5.0

Wayback Machine Scraper - Track Website Changes Over Time

ryanclinton/wayback-machine-search

Search the Internet Archive's Wayback Machine for historical snapshots of any website. Retrieve archived page metadata -- including timestamps, URLs, MIME types, HTTP status codes, and content hashes -- for up to 10,000 snapshots per run.

Ryan Clinton

Darkweb Scraper

crawlerbros/darkweb-scraper

Crawl dark web .onion sites via Tor. Extract links, emails, phone numbers, cryptocurrency wallet addresses, social media handles, and API keys from hidden services.

Crawler Bros

Dark Web Scraper

epctex/darkweb-scraper

Uncover valuable insights with our Dark Web Scraper. Extract sensitive data, including crypto wallets, API keys, emails, phone numbers, and more, from the depths of the Dark Web. You can specify search terms, and customize and retrieve OSINT data out of the box.

epctex

1.8K

5.0

OK.RU Video Scraper 🎥

easyapi/ok-ru-video-scraper

Scrape video search results from OK.RU (Odnoklassniki) including video details, views, likes, and metadata. Perfect for video content analysis, market research, and monitoring trending content on Russia's popular social network.

EasyApi

5.0

OK.ru Video Downloader

codenest/ok-ru-video-downloader

Effortlessly extracts video title, description, duration, like count, upload date + 6 quality formats (144p to 1080p) with direct download URLs. Returns JSON array with complete metadata.

CodeNest

Spotify Music Downloader- 🛠️ Accurate Mode

scrapearchitect/spotify-music-downloader-Accurate

🎵 Spotify Music/Tracks Downloader 🛠️ 🎧 Download high-quality, device-compatible Spotify tracks with near-perfect accuracy. Supports track URLs, keyword search, or both. Prioritizes quality over speed – perfect for playlists, libraries, and media apps.

Scrape Architect

Apple 🍎 Music Extractor

jupri/apple-music

💫 All-In-One Apple Music Scraper

cat

256

Deep Website Content Crawler

6sigmag/deep-website-content-crawler

Scrape Failed Killer! A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David

1.1K

3.0