Project Gutenberg Books Scraper avatar

Project Gutenberg Books Scraper

Pricing

from $13.00 / 1,000 result items

Go to Apify Store
Project Gutenberg Books Scraper

Project Gutenberg Books Scraper

Search 75,000+ free public-domain books from Project Gutenberg. Returns title, author with birth/death years, cover image, plain-text and EPUB download URLs, Kindle and HTML formats, subjects, bookshelves, language, copyright status, summaries and download counts. Filter by author or language.

Pricing

from $13.00 / 1,000 result items

Rating

0.0

(0)

Developer

ParseForge

ParseForge

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a day ago

Last modified

Share

ParseForge Banner

📚 Project Gutenberg Books Scraper

🚀 Search 75,000+ free public-domain books from Project Gutenberg.

🕒 Last updated: 2026-05-06 · 📊 28 fields per record · 75,000+ books · public-domain catalog · plain-text, EPUB, Kindle, HTML, PDF download URLs

The Project Gutenberg Books Scraper searches the Project Gutenberg catalog and returns structured records for any free public-domain ebook. Output includes title, author with birth/death years, cover image, plain-text and EPUB download URLs, Kindle and HTML formats, subjects, bookshelves, language, copyright status, summaries, and download counts.

Project Gutenberg has been digitizing public-domain texts since 1971 and now hosts 75,000+ books across 60+ languages. Filters run server-side, so a single run can isolate every Shakespeare play, all 19th-century French novels, or the most-downloaded books of all time.

🎯 Target Audience💡 Primary Use Cases
Researchers, NLP/ML teams, librarians, educators, content creators, ebook app developersBuilding text corpora, NLP training datasets, public-domain ebook libraries, literary research, citation generation

📋 What the Project Gutenberg Books Scraper does

Five filtering workflows in a single run:

  • 🔍 Free-text search. Match by title, author, or general keywords.
  • 👤 Author filter. Restrict to one author across all their works.
  • 🏷️ Topic filter. Filter by subject (history, philosophy, science, fiction).
  • 🌐 Language filter. ISO 639 language codes (en, fr, de, es, zh, ja).
  • 📅 Author year filter. Filter authors by birth/death year for period studies.

💡 Why it matters: clean, server-side filtering removes the parser-and-pagination work from your team and keeps your dataset fresh on every run.


🎬 Full Demo

🚧 Coming soon: a 3-minute walkthrough showing how to go from sign-up to a downloaded dataset.


⚙️ Input

InputTypeDefaultBehavior
maxItemsinteger10Records to return. Free plan caps at 10, paid plan up to 1,000,000.
querystring"shakespeare"Free-text keyword search.
languagestring""ISO 639 language code.
topicstring""Subject filter.
authorYearStartintegernullAuthor born after this year.
authorYearEndintegernullAuthor died before this year.
copyrightStatusstring""`true`=copyrighted, `false`=public domain, empty=any.

Example: every Shakespeare work.

{
"maxItems": 100,
"query": "shakespeare"
}

Example: 19th-century French novels.

{
"maxItems": 200,
"language": "fr",
"authorYearStart": 1800,
"authorYearEnd": 1900
}

📊 Output

Each record contains 28 fields. Download the dataset as CSV, Excel, JSON, or XML.

🧾 Schema

FieldTypeExample
🖼️ coverUrlstringnull
🆔 gutenbergIdstring"100"
📛 titlestring"The Complete Works of William Shakespeare"
👤 authorsTextstring"Shakespeare, William"
👤 authorsarray[ { name, birthYear, deathYear } ]
🏷️ subjectsarray["Drama","English drama"]
📁 bookshelvesarray["Plays"]
🌐 languagesarray["en"]
📋 copyrightbooleanfalse
📥 downloadCountnumber45230
📄 plainTextUrlstring"https://www.gutenberg.org/files/100/100-0.txt"
📕 epubUrlstring"https://www.gutenberg.org/ebooks/100.epub3.images"
📖 kindleUrlstring"https://www.gutenberg.org/ebooks/100.kf8.images"
🌐 htmlUrlstring"https://www.gutenberg.org/files/100/100-h/100-h.htm"
🔗 gutenbergUrlstring"https://www.gutenberg.org/ebooks/100"

📦 Sample records


✨ Why choose this Actor

Capability
📚75,000+ books. Every public-domain text Project Gutenberg has digitized since 1971.
🌐60+ languages. English dominates, but you can find French, German, Spanish, Chinese, and more.
📄Multi-format URLs. Plain-text, EPUB, Kindle, HTML, and PDF when available.
📥Download counts. Filter and rank by reader popularity.
⚖️Public domain. Use commercially without restrictions in most jurisdictions.

📈 How it compares to alternatives

ApproachCostCoverageRefreshFiltersSetup
⭐ This Actor$5 free credit75,000+ booksLive per runquery, author, lang, topic, year⚡ 2 min
Manual Gutenberg browsingFreeManualLiveWeb filters only🕒 Manual
Standard EbooksFreeCurated subsetSlowLimited🐢 Account
Internet Archive TextsFreeMassiveVariableBulk only🐢 ETL

Pick this Actor when you want broad coverage, server-side filtering, and no pipeline maintenance.


🚀 How to use

  1. 📝 Sign up. Create a free account with $5 credit (takes 2 minutes).
  2. 🌐 Open the Actor. Go to the Project Gutenberg Books Scraper page on the Apify Store.
  3. 🎯 Set input. Pick your filters and maxItems.
  4. 🚀 Run it. Click Start and let the Actor collect your data.
  5. 📥 Download. Grab your results in the Dataset tab as CSV, Excel, JSON, or XML.

⏱️ Total time from signup to downloaded dataset: 3-5 minutes. No coding required.


💼 Business use cases

🤖 NLP & ML

  • Build training corpora for language models
  • Authorship-attribution datasets
  • Style-transfer corpora
  • Multilingual training data

📚 Libraries & Education

  • Build classroom ebook collections
  • Curriculum-aligned reading lists
  • Free supplementary materials for K-12
  • Library catalog enrichment

📰 Content & Publishing

  • Republish public-domain works
  • Generate audiobook scripts
  • Create curated newsletters
  • Build literary discovery apps

🔬 Research & Academia

  • Citation generation
  • Distant-reading studies
  • Genre evolution analysis
  • Translation corpora

🔌 Automating Project Gutenberg Books Scraper

Control the scraper programmatically for scheduled runs and pipeline integrations:

  • 🟢 Node.js. Install the apify-client NPM package.
  • 🐍 Python. Use the apify-client PyPI package.
  • 📚 See the Apify API documentation for full details.

The Apify Schedules feature lets you trigger this Actor on any cron interval. Hourly, daily, or weekly refreshes keep downstream databases in sync automatically.


🌟 Beyond business use cases

Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.

🎓 Research and academia

  • Reproducible literary corpora
  • Versioned text snapshots
  • Computational linguistics studies
  • Course material with primary sources

🎨 Personal and creative

  • Personal ebook collections
  • Indie reading-app side projects
  • Newsletter on classic literature
  • Hobbyist literary databases

🤝 Non-profit and civic

  • Library digitization projects
  • Reading-list contributions
  • Cultural-preservation outreach
  • Multilingual literacy programs

🧪 Experimentation

  • Train tokenizers on diverse text
  • Test text-mining pipelines
  • Prototype text-recommendation engines
  • Build literary-analysis dashboards

🤖 Ask an AI assistant about this scraper

Open a ready-to-send prompt about this ParseForge actor in the AI of your choice:


❓ Frequently Asked Questions

🧩 How does it work?

Provide a query, author, language, or topic filter. The Actor queries the Project Gutenberg catalog and emits one record per book.

📥 Can I download the actual book contents?

The Actor returns metadata and direct download URLs for plain-text, EPUB, Kindle, HTML, and PDF formats. Use those URLs to fetch the actual contents.

⚖️ Is everything truly public domain?

Yes for the vast majority. The copyright field flags the rare exceptions still under copyright in some jurisdictions.

📊 How many fields per record?

28, including title, authors with birth/death years, cover, all download URLs, subjects, bookshelves, language, and download counts.

🔁 Can I schedule runs?

Yes. New books and translations are added regularly. Schedule weekly to capture additions.

🌐 Which languages are supported?

60+, with strongest coverage in English, French, German, Spanish, Italian, Dutch, Portuguese, and Chinese.

👤 Does it include author biographies?

No, but it returns author birth/death years for period research.

💳 Do I need a paid Apify plan?

No. The free plan covers preview runs. A paid plan unlocks higher item counts and scheduling.

🆘 What if a run fails?

Apify retries transient errors. Partial datasets are preserved.

🎙️ Can I generate audiobooks from this?

Yes. Pull plain-text URLs and pipe through any text-to-speech engine.


🔌 Integrate with any app

Project Gutenberg Books Scraper connects to any cloud service via Apify integrations:

  • Make - Automate multi-step workflows
  • Zapier - Connect with 5,000+ apps
  • Slack - Get run notifications in your channels
  • Airbyte - Pipe data into your warehouse
  • GitHub - Trigger runs from commits and releases
  • Google Drive - Export datasets straight to Sheets

You can also use webhooks to trigger downstream actions when a run finishes.


💡 Pro Tip: browse the complete ParseForge collection for more reference-data scrapers.


🆘 Need Help? Open our contact form to request a new scraper, propose a custom data project, or report an issue.


⚠️ Disclaimer: this Actor is an independent tool and is not affiliated with, endorsed by, or sponsored by Project Gutenberg, the Gutendex project, or any contributing volunteers. All trademarks mentioned are the property of their respective owners. Only publicly available open data is collected.