Internet Archive Search Scraper
Pricing
from $16.00 / 1,000 result items
Internet Archive Search Scraper
Search the Internet Archive's 50M+ item catalog of texts, audio, movies, software, web pages, and images. Filter by collection, media type, creator, and date. Pull identifiers, titles, descriptions, downloads, and rich metadata.
Pricing
from $16.00 / 1,000 result items
Rating
0.0
(0)
Developer
ParseForge
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share

📚 Internet Archive Search Scraper
🚀 Export the world's largest open library in seconds. Search 50M+ items across texts, audio, movies, software, web, images, and data. No login, no manual paging, no Lucene crash courses required.
🕒 Last updated: 2026-05-22 · 📊 21 fields per record · 📚 50M+ items · 🎬 8 media types · 🌐 archive.org corpus
The Internet Archive Search Scraper exports the open library catalog and returns 21 fields per record, including identifier, title, full description, creator, language, subject tags, collection memberships, publish date, lifetime and weekly download counts, file inventories, total byte size, license URL, and direct links to the item details page and metadata feed. The underlying source is the world's largest publicly accessible digital library, maintained since 1996.
The catalog covers 50 million+ items across eight media types (texts, audio, movies, software, web captures, images, datasets, and collections). This Actor lets you slice the corpus with Lucene-style queries plus structured filters for collection, media type, creator, and date range, then download the result as CSV, Excel, JSON, or XML in under five minutes.
| 🎯 Target Audience | 💡 Primary Use Cases |
|---|---|
| Librarians, digital archivists, journalists, OSINT researchers, academic historians, ML dataset curators, documentary filmmakers | Citation discovery, training-corpus assembly, historical media research, source verification, public-domain media sourcing, archival preservation audits |
📋 What the Archive Search Scraper does
A single configurable workflow with four filter layers:
- 🔎 Lucene query. Free-text or fielded queries like
subject:photography AND mediatype:image. - 📦 Collection filter. Restrict to one Internet Archive collection like
nasaorlibrivoxaudio. - 🎬 Media-type filter. Texts, audio, movies, software, web, image, data, or collection.
- 📅 Date range. Filter by item publish date with
dateFromanddateTo(YYYY-MM-DD). - 📚 Per-item metadata. Optional deep fetch returns the full file list, rich subject tags, and license URL.
Each record bundles identifiers (Archive ID, details URL, metadata URL), descriptive metadata (title, creator, language, description, subject tags), classification (media type, collection memberships), engagement (lifetime, weekly, and monthly download counts), file inventory (count and total byte size), and licensing.
💡 Why it matters: the Archive is the largest public corpus of cultural and reference material on Earth, but its native search interface assumes you already know Lucene. This Actor exposes that same query layer with structured filters and clean records, ready for analysis, ingestion, or archival back-up.
🎬 Full Demo
🚧 Coming soon: a 3-minute walkthrough showing how to go from sign-up to a downloaded dataset.
⚙️ Input
| Input | Type | Default | Behavior |
|---|---|---|---|
maxItems | integer | 10 | Records to return. Free plan caps at 10, paid plan at 1,000,000. |
searchQuery | string | "mars rover" | Lucene-style query. Plain words OK. Field syntax like creator:NASA supported. |
collection | string | "" | Restrict to one collection slug like nasa, opensource_movies, or librivoxaudio. |
mediaType | enum | "" | One of 8 media types: texts, audio, movies, software, web, image, data, collection. |
creator | string | "" | Filter by creator name like NASA or Library of Congress. |
dateFrom | string | "" | Earliest publish date in YYYY-MM-DD. |
dateTo | string | "" | Latest publish date in YYYY-MM-DD. |
fetchDetails | boolean | true | Fetch full per-item metadata. Slower but richer. |
Example: NASA photo collection from 2020 to 2025.
{"maxItems": 100,"creator": "NASA","mediaType": "image","dateFrom": "2020-01-01","dateTo": "2025-12-31"}
Example: classic LibriVox audiobooks.
{"maxItems": 50,"collection": "librivoxaudio","mediaType": "audio"}
⚠️ Good to Know: Lucene field names are case-sensitive (
subject:notSubject:). EnablefetchDetailsfor file inventories and the full subject-tag list, otherwise records contain index-level metadata only. For very large dumps (100,000+ items), schedule the run during off-peak hours to be a good steward of the public catalog.
📊 Output
Each record contains 21 fields. Download the dataset as CSV, Excel, JSON, or XML.
🧾 Schema
| Field | Type | Example |
|---|---|---|
🖼️ thumbnailUrl | string | "https://archive.org/services/img/PIA23499" |
🆔 identifier | string | "PIA23499" |
📝 title | string | "Mars 2020 Rover Selfie" |
🎬 mediaType | string | "image" |
👤 creator | array | ["NASA/JPL-Caltech"] |
📅 date | string | "2021-04-06" |
📅 publishDate | ISO 8601 | "2021-04-06T00:00:00.000Z" |
📖 description | string | "NASA's Perseverance rover took this selfie..." |
📦 collection | array | ["nasa", "image"] |
🏷️ subject | array | ["mars", "rover", "perseverance"] |
🌐 language | array | ["English"] |
⬇️ downloads | number | 12450 |
📊 week | number | 87 |
📊 month | number | 342 |
📁 filesCount | number | 8 |
💾 totalSizeBytes | number | 52428800 |
📜 licenseUrl | string | null | "https://creativecommons.org/publicdomain/mark/1.0/" |
🔗 detailsUrl | string | "https://archive.org/details/PIA23499" |
🧾 metadataUrl | string | "https://archive.org/metadata/PIA23499" |
🕒 scrapedAt | ISO 8601 | "2026-05-22T00:00:00.000Z" |
⚠️ error | string | null | null |
📦 Sample records
✨ Why choose this Actor
| Capability | |
|---|---|
| 📚 | Massive corpus. Access 50 million+ items across texts, audio, movies, software, web captures, and images. |
| 🔎 | Lucene-grade search. Free text or fielded queries, combined with structured filters for collection, media type, creator, and date. |
| 📦 | Rich per-item metadata. Optional deep fetch returns file lists, byte sizes, license URLs, and full subject tags. |
| 📊 | Engagement signals. Lifetime, weekly, and monthly download counts surface what people actually use. |
| 🌐 | All media types. One Actor for texts, audio, movies, software, and data. No source switching. |
| 🔁 | Always fresh. Every run reads the live catalog so newly uploaded items flow through. |
| 🚫 | No authentication. Public open library. No login or token. |
📊 The Archive is the closest thing we have to a public memory of the digital age. Querying it well is a superpower for librarians, journalists, and ML teams.
📈 How it compares to alternatives
| Approach | Cost | Coverage | Refresh | Filters | Setup |
|---|---|---|---|---|---|
| ⭐ Archive Search Scraper (this Actor) | $5 free credit, then pay-per-use | 50M+ items | Live per run | query, collection, media, creator, date | ⚡ 2 min |
| Manual archive.org search | Free | Full | Live | Few | 🐢 Hours |
| Commercial library aggregators | $$$/year | Smaller, curated | Daily | Many | ⏳ Days |
| Bulk torrent dumps | Free | Partial, stale | Rarely | None | 🕒 Variable |
Pick this Actor when you want structured Lucene-grade search results, rich metadata, and zero parsing work.
🚀 How to use
- 📝 Sign up. Create a free account with $5 credit (takes 2 minutes).
- 🌐 Open the Actor. Go to the Internet Archive Search Scraper page on the Apify Store.
- 🎯 Set input. Type a search query, optionally restrict to a collection or media type, and set
maxItems. - 🚀 Run it. Click Start and let the Actor collect your dataset.
- 📥 Download. Grab results in the Dataset tab as CSV, Excel, JSON, or XML.
⏱️ Total time from signup to downloaded dataset: 3-5 minutes. No coding required.
💼 Business use cases
🔌 Automating Archive Search Scraper
Control the scraper programmatically for scheduled runs and pipeline integrations:
- 🟢 Node.js. Install the
apify-clientNPM package. - 🐍 Python. Use the
apify-clientPyPI package. - 📚 See the Apify API documentation for full details.
The Apify Schedules feature lets you trigger this Actor on any cron interval. Run weekly to track newly added items in a watched collection, or daily during a research sprint.
🌟 Beyond business use cases
Archive data powers more than commercial workflows. The same records support research, education, civic projects, and personal initiatives.
🤖 Ask an AI assistant about this scraper
Open a ready-to-send prompt about this ParseForge actor in the AI of your choice:
- 💬 ChatGPT
- 🧠 Claude
- 🔍 Perplexity
- 🅒 Copilot
❓ Frequently Asked Questions
🧩 How does it work?
Type a search query, optionally add collection, media-type, creator, or date filters, click Start, and the Actor pulls structured records from the live catalog. With fetchDetails enabled, each record is enriched with the full per-item metadata feed.
📏 How accurate is the data?
Identifiers and detail URLs are stable. Subject tags and creator names are crowd-sourced and may include duplicates or typos. Download counts update continuously and reflect the catalog state at run time.
🔁 How often is the catalog refreshed?
The Archive accepts new uploads every minute of the day. Every run of this Actor reads the live catalog, so freshly uploaded items appear without waiting for a daily cron.
🎬 What media types can I query?
Eight: texts, audio, movies, software, web captures, images, datasets, and collections. Combine with a Lucene query for fine-grained slicing.
🔎 Do I need to know Lucene?
No. Plain-word queries work fine. Fielded syntax like creator:NASA or subject:photography AND mediatype:image is supported when you need precision.
📜 Are these items free to use?
Most are public domain or open-licensed. Always check the licenseUrl field on each record before commercial use. Some collections carry restrictive licenses despite being publicly viewable.
📁 Do I get the actual file downloads?
This Actor returns metadata, file counts, and total byte sizes. To pull individual files, follow the detailsUrl and use the per-file download links the Archive exposes there.
💼 Can I use this data commercially?
Yes for metadata. For the actual media, you must respect each item's license. The licenseUrl field surfaces the relevant statement on every record.
💳 Do I need a paid Apify plan to use this Actor?
No. The free Apify plan is enough for testing and small pulls (10 records per run). A paid plan lifts the limit and gives you access to scheduling, higher concurrency, and larger datasets.
🔁 What happens if a run fails or gets interrupted?
Apify automatically retries transient errors. If a run still fails, inspect the log in the Runs tab, fix the input, and re-run. Partial datasets from failed runs are preserved so you never lose progress.
🆘 What if I need help?
Our support team is here to help. Contact us through the Apify platform or use the Tally form linked below.
🔌 Integrate with any app
Internet Archive Search Scraper connects to any cloud service via Apify integrations:
- Make - Automate multi-step workflows
- Zapier - Connect with 5,000+ apps
- Slack - Get run notifications in your channels
- Airbyte - Pipe archive records into your warehouse
- GitHub - Trigger runs from commits and releases
- Google Drive - Export datasets straight to Sheets
You can also use webhooks to trigger downstream actions when a run finishes. Push fresh archival metadata into your knowledge base, or alert a research team in Slack.
🔗 Recommended Actors
- 📚 arXiv Scraper - Preprint papers across physics, math, and CS
- 📜 RFC Editor Index Scraper - IETF Internet standards catalog
- 🏛️ Met Museum Scraper - Metropolitan Museum of Art open-access objects
- 🔬 ClinicalTrials.gov Scraper - Registered medical trials with outcomes
- 🌍 REST Countries Info Scraper - 250+ countries with population, currencies, languages
💡 Pro Tip: browse the complete ParseForge collection for more reference-data scrapers.
🆘 Need Help? Open our contact form to request a new scraper, propose a custom data project, or report an issue.
⚠️ Disclaimer: this Actor is an independent tool and is not affiliated with, endorsed by, or sponsored by the Internet Archive or any of its contributors. All trademarks mentioned are the property of their respective owners. Only publicly available open archival data is collected.