Pricing

from $10.00 / 1,000 results

Archive.org Scraper

Scrape the Internet Archive (archive.org). Search 50M+ texts, 13M+ audio, 16M+ movies, and 1.3M+ software items. Get metadata, download counts, file lists, and more via public APIs.

Pricing

from $10.00 / 1,000 results

Rating

0.0

(0)

Developer

lulz bot

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

What does Archive.org Scraper do?

This actor searches Archive.org's massive collection and extracts structured metadata for each item. It uses the official public Advanced Search API and Metadata API -- no authentication needed.

Key capabilities:

Search across all media types or filter by texts, audio, movies, software, or images
Sort by downloads, date, or title
Extract up to 10,000 items per run
Optionally fetch full item details including file lists, formats, ratings, and reviews
Respectful rate limiting (500ms between requests)

Input

Field	Type	Description	Default
`searchQuery`	string	Search query (supports field queries like `creator:NASA`)	required
`mediaType`	string	Filter: all, texts, audio, movies, software, image	all
`sortBy`	string	Sort order: downloads, newest, oldest, or title	downloads desc
`maxResults`	integer	Max items to return (1-10,000)	200
`fetchDetails`	boolean	Fetch full metadata per item (slower, adds files/formats)	false
`proxyConfiguration`	object	Optional proxy configuration	none

Example input

{
    "searchQuery": "public domain classical music",
    "mediaType": "audio",
    "sortBy": "downloads desc",
    "maxResults": 100,
    "fetchDetails": false
}

Advanced query examples

creator:"Grateful Dead" -- items by a specific creator
subject:jazz AND date:[1950-01-01 TO 1960-01-01] -- jazz items from the 1950s
collection:prelinger -- items from the Prelinger Archives
title:"machine learning" AND mediatype:texts -- texts about machine learning
licenseurl:creativecommons.org -- Creative Commons licensed items

Output

Each item in the dataset contains:

Field	Type	Description
`identifier`	string	Unique Archive.org identifier
`title`	string	Item title
`creator`	string	Creator/author
`date`	string	Publication or creation date
`description`	string	Item description
`mediatype`	string	Media type (texts, audio, movies, etc.)
`collection`	string	Collection(s) the item belongs to
`downloads`	number	Total download count
`subject`	string	Subject tags
`language`	string	Language
`source`	string	Original source
`archiveUrl`	string	Direct link to the item on archive.org
`scrapedAt`	string	ISO timestamp of when the data was collected

Additional fields when `fetchDetails` is enabled:

Field	Type	Description
`files`	array	List of files with name, format, and size
`format`	string	Available formats (e.g., "PDF; EPUB; Text")
`licenseurl`	string	License URL
`runtime`	string	Runtime/duration for audio/video
`addeddate`	string	Date item was added to Archive.org
`publicdate`	string	Date item was made public
`uploader`	string	Uploader username
`num_reviews`	number	Number of reviews
`avg_rating`	number	Average rating

Example output

{
    "identifier": "greatgatsby_1201",
    "title": "The Great Gatsby",
    "creator": "F. Scott Fitzgerald",
    "date": "1925",
    "description": "The Great Gatsby is a 1925 novel by American writer F. Scott Fitzgerald...",
    "mediatype": "texts",
    "collection": "opensource; americana",
    "downloads": 1523847,
    "subject": "fiction; american literature; jazz age",
    "language": "English",
    "source": null,
    "archiveUrl": "https://archive.org/details/greatgatsby_1201",
    "scrapedAt": "2026-04-25T12:00:00.000Z"
}

Use cases

Research: Search and catalog public domain texts, audio, and video
Digital preservation: Track download counts and availability of archived content
Content discovery: Find Creative Commons media for projects
Data analysis: Analyze trends across Archive.org's collections
Library science: Build catalogs of specific collections or media types
Machine learning: Discover training datasets from public domain sources

Performance and costs

Without details: ~200 items/minute (paginated search, 500ms delay)
With details: ~2 items/second (individual metadata API calls)
The Archive.org API returns up to 100 items per page
Proxy is optional but recommended for large scrapes (10,000+ items)

Limitations

The Advanced Search API caps at ~10,000 results per query. Use more specific queries for larger datasets.
Some items may have incomplete metadata (missing creator, date, etc.)
File lists are only available when fetchDetails is enabled
Rate limited to 500ms between requests to be respectful to Archive.org servers

Internet Archive Scraper

truenorth/internet-archive-scraper

Search archive.org and export books, audio, video, software, images, metadata, and file lists as structured JSON or CSV.

TrueNorth

Internet Archive Items Scraper - archive.org Search by Query

gio21/archive-org-items-scraper

Search Internet Archive (archive.org) items: books, movies, audio, software, images, web archives, data. Returns title, creator, date, description, downloads, identifier, URLs. Free, no key. For research, content discovery, digital preservation.

Gio

Internet Archive Search Scraper

parseforge/internet-archive-search-scraper

Search the Internet Archive's 50M+ item catalog of texts, audio, movies, software, web pages, and images. Filter by collection, media type, creator, and date. Pull identifiers, titles, descriptions, downloads, and rich metadata.

ParseForge

Internet Archive Search Scraper

crawlergang/internet-archive-search-scraper

Searches and retrieves items from the Internet Archive (archive.org) - 44M+ books, videos, audio, software, and web archives. Free, no API key required.

Crawler Gang

5.0

Internet Archive Search Scraper

crawlerbros/internet-archive-search-scraper

Searches and retrieves items from the Internet Archive (archive.org) - 44M+ books, videos, audio, software, and web archives. Free, no API key required.

Crawler Bros

Internet Archive & Wayback Machine Scraper

mangudai/internet-archive-scraper

Search the Internet Archive's 40M+ items, pull full item metadata and file lists, and query the Wayback Machine for URL snapshots. Books, audio, video, software, and archived pages on official archive.org APIs. No API key.

Mangudäi

Internet Archive Metadata Scraper — Bulk archive.org Export

logiover/internet-archive-metadata-scraper

Bulk-export item metadata from the Internet Archive (archive.org) by full-text query, collection, media type, creator, subject and date range. Extract identifier, title, creator, date, downloads, format, subject and more. Millions of items. No API key, no login.

Logiover

Internet Archive (archive.org) Items Scraper

scrapers_lat/archive-org-scraper

Scrape Internet Archive items by keyword, media type or identifier. Get title, creator, downloads, subjects, collections, dates, item size and downloadable files as JSON, CSV or Excel.

Scrapers Lat

Internet Archive Digital Library (archive.org) - Data Scraper

gettingtechnicl/internet-archive

Extract record data from Internet Archive Digital Library (archive.org) via its official public JSON API. Search by search query, collection, format, media type and export 19 structured fields per record as JSON, CSV or Excel - reliable, with no fragile HTML scraping.

Terry Gluff

Internet Archive Search — Wayback Machine Advanced Query Tool

maged120/archive-org-advanced-search

Search the Internet Archive (archive.org) with full advanced filter support — date range, media type, language, subject, and more. Returns metadata from archived web pages, books, audio, and video.