Pricing

from $10.00 / 1,000 result items

Tatoeba Sentence Corpus Scraper

Extract Tatoeba sentence corpus with millions of bilingual example sentences. Capture sentence ID, language, text, owner, audio URL, translations, tags, and license. Export to JSON, CSV, or Excel for language learning, NLP training data, translation memory, and linguistic research.

Pricing

from $10.00 / 1,000 result items

Rating

0.0

(0)

Developer

ParseForge

Actor stats

Bookmarked

Total users

Monthly active users

6 days ago

Last modified

🗣️ Tatoeba Sentence Corpus Scraper

🚀 Export the world's largest open multilingual sentence corpus in seconds. Pull 12,000,000+ example sentences across 400+ languages with translations, audio, contributor info, and CC-BY licence metadata. No login, no manual CSV stitching.

The Tatoeba Sentence Corpus Scraper taps into the Tatoeba community catalog and returns 14 structured fields per sentence, including the original text, language code and name, translation list, audio links, contributor handle, correctness score, and licence. Tatoeba has been collaboratively edited by linguists, polyglots, and language learners since 2006, and ships under a permissive Creative Commons licence.

The catalog covers every major living language family plus dozens of constructed, classical, and minority languages, from Mandarin and Spanish down to Latin, Esperanto, and revived regional tongues. This Actor turns that into a clean CSV, Excel, JSON, or XML dataset in under five minutes, with all filtering done server-side so you skip the parsing entirely.

🎯 Target Audience	💡 Primary Use Cases
Linguists, language-learning app builders, translation researchers, NLP engineers, lexicographers, ESL teachers, audio dataset curators	Parallel corpus mining, flashcard sourcing, translation memory seeding, speech model training, idiom and proverb research, classroom example banks

📋 What the Tatoeba Scraper does

Four sentence-mining workflows in a single run:

🔎 Keyword search. Find every sentence containing a target word or phrase.
🌐 Source language filter. Pick a single source language out of 400+ (Tatoeba uses ISO 639-3 codes).
↔️ Target translation filter. Restrict to sentences that have a translation into a chosen language.
🏷️ Tag filter. Pull only sentences tagged with concepts like proverb, idiom, greeting, or any community label.

Each record includes the sentence ID, raw text, language code and human-readable name, every linked translation (with that translation's language), audio file URLs when available, contributor handle, correctness score, licence, and the canonical Tatoeba page link.

💡 Why it matters: clean, licence-clear parallel sentences are the raw material of every translation memory, language-learning flashcard, and speech model training set. Building your own pipeline against the Tatoeba site means writing fragile HTML parsers and respecting rate limits by hand. This Actor delivers the same data structured and ready to import.

📊 Data fields

Each record includes: audioUrls, contributor, correctness, direction, hasAudio, language, languageName, license, scrapedAt, sentenceId, text, translationCount, translations, url. These field names come straight from the actor's dataset schema, so what you see here is what lands in your dataset.

🚀 How to use

📝 Sign up. Create a free account with $5 credit (takes 2 minutes).
🌐 Open the Actor. Go to the Tatoeba Sentence Corpus Scraper page on the Apify Store.
🎯 Set input. Pick a source language, optional keyword, optional target language and tags, set maxItems.
🚀 Run it. Click Start and let the Actor collect your sentences.
📥 Download. Grab your results from the Dataset tab as CSV, Excel, JSON, or XML.

⏱️ Total time from signup to downloaded corpus: 3-5 minutes. No coding required.

🔗 Recommended Actors

🌐 MyMemory Translation Scraper - Translate text across 70+ language pairs
📚 LibriVox Audiobooks Scraper - Public-domain audiobooks with reader credits
🏛️ Library of Congress Scraper - 170M+ digitized cultural records
📰 ArXiv Scraper - Academic preprints with metadata
📖 Figshare Scraper - Open research datasets and figures

💡 Pro Tip: browse the complete ParseForge collection for more reference-data scrapers.

⚠️ Disclaimer: this Actor is an independent tool and is not affiliated with, endorsed by, or sponsored by the Tatoeba Project or its contributors. All trademarks mentioned are the property of their respective owners. Only publicly available open corpus data is collected, under the project's Creative Commons licence.

🆘 Need Help?

If you hit a bug, have questions about setup, or need a scraper we haven't built yet, open our contact form or write to parseforge@protonmail.com. We also take on paid custom data projects.

For faster answers, join our Discord. It's the best place to get support and suggest new actors.

RAG Text Chunker — heading & sentence aware, Japanese ready

shoebill-dev27/rag-text-chunker

Split Markdown or plain text into retrieval-ready chunks for RAG pipelines: cuts at headings, packs whole sentences up to a size limit with optional overlap, and tags every chunk with its heading breadcrumb. Handles Japanese sentence boundaries. No LLM cost.

Shinobu Otani

DeepL Translate Scraper - Low-cost💲🔥🌐🔤

delectable_incubator/deepl-translate-scraper-low-cost

Scrape DeepL translation results 🌐🔎 with a powerful language data scraper. Extract original text, translated text, language pairs, timestamps, and translation metadata. Ideal for multilingual research, language learning, localization analysis, NLP projects, and structured translation datasets 📊🚀

Prime Scrape

Sentiment and Subject / Topic Analysis

ai_founder/sentiment-and-subject-topic-analysis

Artificial Intelligence divides the text into sentences and analyzes the topic, subtopic and sentiment for each sentence.

Ai Founder

Text Chunker: Split Text & Documents into Chunks for RAG

raional/text-chunker

Split long text or documents into properly sized, sentence-aware chunks with overlap for embeddings, vector databases, and RAG pipelines. Choose recursive, sentence-boundary, or fixed-token chunking. Fetch from URLs or paste text directly. Powered by Chonkie.

Raion Al

RAG Corpus Quality Auditor

civicdataworks/rag-corpus-quality-auditor

Score web pages for RAG corpus quality: text length, heading structure, boilerplate hints, token estimate, metadata, and warnings.

Rowan Mercer

MyMemory Translation Scraper

parseforge/mymemory-translation-scraper

Query MyMemory, the world's largest translation memory, with billions of human and machine translations. Translate text across 100+ languages with quality match scores, source attribution, and reference segments. Export to JSON, CSV, or Excel for localization and language research.

ParseForge

DeepL Translate Scraper 🌐🔤

scrapestorm/deepl-translate-scraper

Gather DeepL translation results by keywords 🌐. Access detailed translations with original text, translated text, language pairs 🔄, timestamps ⏰, and more. Ideal for language learning, research, and multilingual projects 📊. Perfect for translators, researchers, and language enthusiasts.

Storm_Scraper

5.0

✨AI Rewording & Sentence Simplifier Tool - $5/1k requests

dev00/ai-rewording-sentence-simplifier

Simplify difficult English sentences, paragraphs, and documents. Adjustable difficulty levels and display modes.

dev00

Google Translation Scraper

dev_bodex/google-translation-scraper

This Google Translation Scraper Actor automates extracting translations for any input text from Google Translate. Built with Node.js and Puppeteer, it efficiently retrieves translations in multiple languages, providing structured data for use in language apps, research, or educational projects.

Eniola Bode

Duolingo Language Data Scraper | Course Vocabulary Export

parseforge/duolingo-language-data-scraper

Export Duolingo language course skills, lexemes and translations. Specify source and target language codes to pull the vocabulary set learners encounter. Useful for linguistics research, language app builders and translation tooling. CSV, Excel, JSON or XML.

ParseForge