Project Gutenberg Books Scraper
Pricing
from $13.00 / 1,000 result items
Project Gutenberg Books Scraper
Search 75,000+ free public-domain books from Project Gutenberg. Returns title, author with birth/death years, cover image, plain-text and EPUB download URLs, Kindle and HTML formats, subjects, bookshelves, language, copyright status, summaries and download counts. Filter by author or language.
Pricing
from $13.00 / 1,000 result items
Rating
0.0
(0)
Developer
ParseForge
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
a day ago
Last modified
Categories
Share

📚 Project Gutenberg Books Scraper
🚀 Search 75,000+ free public-domain books from Project Gutenberg.
🕒 Last updated: 2026-05-06 · 📊 28 fields per record · 75,000+ books · public-domain catalog · plain-text, EPUB, Kindle, HTML, PDF download URLs
The Project Gutenberg Books Scraper searches the Project Gutenberg catalog and returns structured records for any free public-domain ebook. Output includes title, author with birth/death years, cover image, plain-text and EPUB download URLs, Kindle and HTML formats, subjects, bookshelves, language, copyright status, summaries, and download counts.
Project Gutenberg has been digitizing public-domain texts since 1971 and now hosts 75,000+ books across 60+ languages. Filters run server-side, so a single run can isolate every Shakespeare play, all 19th-century French novels, or the most-downloaded books of all time.
| 🎯 Target Audience | 💡 Primary Use Cases |
|---|---|
| Researchers, NLP/ML teams, librarians, educators, content creators, ebook app developers | Building text corpora, NLP training datasets, public-domain ebook libraries, literary research, citation generation |
📋 What the Project Gutenberg Books Scraper does
Five filtering workflows in a single run:
- 🔍 Free-text search. Match by title, author, or general keywords.
- 👤 Author filter. Restrict to one author across all their works.
- 🏷️ Topic filter. Filter by subject (history, philosophy, science, fiction).
- 🌐 Language filter. ISO 639 language codes (en, fr, de, es, zh, ja).
- 📅 Author year filter. Filter authors by birth/death year for period studies.
💡 Why it matters: clean, server-side filtering removes the parser-and-pagination work from your team and keeps your dataset fresh on every run.
🎬 Full Demo
🚧 Coming soon: a 3-minute walkthrough showing how to go from sign-up to a downloaded dataset.
⚙️ Input
| Input | Type | Default | Behavior |
|---|---|---|---|
maxItems | integer | 10 | Records to return. Free plan caps at 10, paid plan up to 1,000,000. |
query | string | "shakespeare" | Free-text keyword search. |
language | string | "" | ISO 639 language code. |
topic | string | "" | Subject filter. |
authorYearStart | integer | null | Author born after this year. |
authorYearEnd | integer | null | Author died before this year. |
copyrightStatus | string | "" | `true`=copyrighted, `false`=public domain, empty=any. |
Example: every Shakespeare work.
{"maxItems": 100,"query": "shakespeare"}
Example: 19th-century French novels.
{"maxItems": 200,"language": "fr","authorYearStart": 1800,"authorYearEnd": 1900}
📊 Output
Each record contains 28 fields. Download the dataset as CSV, Excel, JSON, or XML.
🧾 Schema
| Field | Type | Example |
|---|---|---|
🖼️ coverUrl | string | null |
🆔 gutenbergId | string | "100" |
📛 title | string | "The Complete Works of William Shakespeare" |
👤 authorsText | string | "Shakespeare, William" |
👤 authors | array | [ { name, birthYear, deathYear } ] |
🏷️ subjects | array | ["Drama","English drama"] |
📁 bookshelves | array | ["Plays"] |
🌐 languages | array | ["en"] |
📋 copyright | boolean | false |
📥 downloadCount | number | 45230 |
📄 plainTextUrl | string | "https://www.gutenberg.org/files/100/100-0.txt" |
📕 epubUrl | string | "https://www.gutenberg.org/ebooks/100.epub3.images" |
📖 kindleUrl | string | "https://www.gutenberg.org/ebooks/100.kf8.images" |
🌐 htmlUrl | string | "https://www.gutenberg.org/files/100/100-h/100-h.htm" |
🔗 gutenbergUrl | string | "https://www.gutenberg.org/ebooks/100" |
📦 Sample records
✨ Why choose this Actor
| Capability | |
|---|---|
| 📚 | 75,000+ books. Every public-domain text Project Gutenberg has digitized since 1971. |
| 🌐 | 60+ languages. English dominates, but you can find French, German, Spanish, Chinese, and more. |
| 📄 | Multi-format URLs. Plain-text, EPUB, Kindle, HTML, and PDF when available. |
| 📥 | Download counts. Filter and rank by reader popularity. |
| ⚖️ | Public domain. Use commercially without restrictions in most jurisdictions. |
📈 How it compares to alternatives
| Approach | Cost | Coverage | Refresh | Filters | Setup |
|---|---|---|---|---|---|
| ⭐ This Actor | $5 free credit | 75,000+ books | Live per run | query, author, lang, topic, year | ⚡ 2 min |
| Manual Gutenberg browsing | Free | Manual | Live | Web filters only | 🕒 Manual |
| Standard Ebooks | Free | Curated subset | Slow | Limited | 🐢 Account |
| Internet Archive Texts | Free | Massive | Variable | Bulk only | 🐢 ETL |
Pick this Actor when you want broad coverage, server-side filtering, and no pipeline maintenance.
🚀 How to use
- 📝 Sign up. Create a free account with $5 credit (takes 2 minutes).
- 🌐 Open the Actor. Go to the Project Gutenberg Books Scraper page on the Apify Store.
- 🎯 Set input. Pick your filters and
maxItems. - 🚀 Run it. Click Start and let the Actor collect your data.
- 📥 Download. Grab your results in the Dataset tab as CSV, Excel, JSON, or XML.
⏱️ Total time from signup to downloaded dataset: 3-5 minutes. No coding required.
💼 Business use cases
🔌 Automating Project Gutenberg Books Scraper
Control the scraper programmatically for scheduled runs and pipeline integrations:
- 🟢 Node.js. Install the
apify-clientNPM package. - 🐍 Python. Use the
apify-clientPyPI package. - 📚 See the Apify API documentation for full details.
The Apify Schedules feature lets you trigger this Actor on any cron interval. Hourly, daily, or weekly refreshes keep downstream databases in sync automatically.
🌟 Beyond business use cases
Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.
🤖 Ask an AI assistant about this scraper
Open a ready-to-send prompt about this ParseForge actor in the AI of your choice:
- 💬 ChatGPT
- 🧠 Claude
- 🔍 Perplexity
- 🅒 Copilot
❓ Frequently Asked Questions
🧩 How does it work?
Provide a query, author, language, or topic filter. The Actor queries the Project Gutenberg catalog and emits one record per book.
📥 Can I download the actual book contents?
The Actor returns metadata and direct download URLs for plain-text, EPUB, Kindle, HTML, and PDF formats. Use those URLs to fetch the actual contents.
⚖️ Is everything truly public domain?
Yes for the vast majority. The copyright field flags the rare exceptions still under copyright in some jurisdictions.
📊 How many fields per record?
28, including title, authors with birth/death years, cover, all download URLs, subjects, bookshelves, language, and download counts.
🔁 Can I schedule runs?
Yes. New books and translations are added regularly. Schedule weekly to capture additions.
🌐 Which languages are supported?
60+, with strongest coverage in English, French, German, Spanish, Italian, Dutch, Portuguese, and Chinese.
👤 Does it include author biographies?
No, but it returns author birth/death years for period research.
💳 Do I need a paid Apify plan?
No. The free plan covers preview runs. A paid plan unlocks higher item counts and scheduling.
🆘 What if a run fails?
Apify retries transient errors. Partial datasets are preserved.
🎙️ Can I generate audiobooks from this?
Yes. Pull plain-text URLs and pipe through any text-to-speech engine.
🔌 Integrate with any app
Project Gutenberg Books Scraper connects to any cloud service via Apify integrations:
- Make - Automate multi-step workflows
- Zapier - Connect with 5,000+ apps
- Slack - Get run notifications in your channels
- Airbyte - Pipe data into your warehouse
- GitHub - Trigger runs from commits and releases
- Google Drive - Export datasets straight to Sheets
You can also use webhooks to trigger downstream actions when a run finishes.
🔗 Recommended Actors
- 📖 Open Library Books - 30M+ books and editions
- 🌐 Wikidata Entity Search - 100M+ open knowledge-graph entities
- 🎨 Openverse Media - 800M+ openly licensed images and audio
- 🎓 arXiv Scraper - Academic preprints
- 🎬 TVMaze TV Shows - TV show metadata
💡 Pro Tip: browse the complete ParseForge collection for more reference-data scrapers.
🆘 Need Help? Open our contact form to request a new scraper, propose a custom data project, or report an issue.
⚠️ Disclaimer: this Actor is an independent tool and is not affiliated with, endorsed by, or sponsored by Project Gutenberg, the Gutendex project, or any contributing volunteers. All trademarks mentioned are the property of their respective owners. Only publicly available open data is collected.