Tatoeba Sentence Corpus Scraper
Pricing
from $10.00 / 1,000 result items
Tatoeba Sentence Corpus Scraper
Extract Tatoeba sentence corpus with millions of bilingual example sentences. Capture sentence ID, language, text, owner, audio URL, translations, tags, and license. Export to JSON, CSV, or Excel for language learning, NLP training data, translation memory, and linguistic research.
Pricing
from $10.00 / 1,000 result items
Rating
0.0
(0)
Developer
ParseForge
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
a day ago
Last modified
Categories
Share

π£οΈ Tatoeba Sentence Corpus Scraper
π Export the world's largest open multilingual sentence corpus in seconds. Pull 12,000,000+ example sentences across 400+ languages with translations, audio, contributor info, and CC-BY licence metadata. No login, no manual CSV stitching.
π Last updated: 2026-05-23 Β· π 14 fields per record Β· π£οΈ 12M+ sentences Β· π 400+ languages Β· π Audio + translations
The Tatoeba Sentence Corpus Scraper taps into the Tatoeba community catalog and returns 14 structured fields per sentence, including the original text, language code and name, translation list, audio links, contributor handle, correctness score, and licence. Tatoeba has been collaboratively edited by linguists, polyglots, and language learners since 2006, and ships under a permissive Creative Commons licence.
The catalog covers every major living language family plus dozens of constructed, classical, and minority languages, from Mandarin and Spanish down to Latin, Esperanto, and revived regional tongues. This Actor turns that into a clean CSV, Excel, JSON, or XML dataset in under five minutes, with all filtering done server-side so you skip the parsing entirely.
| π― Target Audience | π‘ Primary Use Cases |
|---|---|
| Linguists, language-learning app builders, translation researchers, NLP engineers, lexicographers, ESL teachers, audio dataset curators | Parallel corpus mining, flashcard sourcing, translation memory seeding, speech model training, idiom and proverb research, classroom example banks |
π What the Tatoeba Scraper does
Four sentence-mining workflows in a single run:
- π Keyword search. Find every sentence containing a target word or phrase.
- π Source language filter. Pick a single source language out of 400+ (Tatoeba uses ISO 639-3 codes).
- βοΈ Target translation filter. Restrict to sentences that have a translation into a chosen language.
- π·οΈ Tag filter. Pull only sentences tagged with concepts like
proverb,idiom,greeting, or any community label.
Each record includes the sentence ID, raw text, language code and human-readable name, every linked translation (with that translation's language), audio file URLs when available, contributor handle, correctness score, licence, and the canonical Tatoeba page link.
π‘ Why it matters: clean, licence-clear parallel sentences are the raw material of every translation memory, language-learning flashcard, and speech model training set. Building your own pipeline against the Tatoeba site means writing fragile HTML parsers and respecting rate limits by hand. This Actor delivers the same data structured and ready to import.
π¬ Full Demo
π§ Coming soon: a 3-minute walkthrough showing how to go from sign-up to a downloaded sentence corpus.
βοΈ Input
| Input | Type | Default | Behavior |
|---|---|---|---|
| maxItems | integer | 10 | Sentences to return. Free plan caps at 10, paid plan at 1,000,000. |
| query | string | "hello" | Keyword search. Empty browses the chosen language without a filter. |
| fromLanguage | string | "eng" | Source language (ISO 639-3). 50 most common languages exposed. |
| toLanguage | string | "" | Target translation language (ISO 639-3). Empty returns all translations. |
| tags | array | [] | Filter by community tag names like proverb, idiom. |
Example: 50 English sentences containing "morning" with Spanish translations.
{"maxItems": 50,"query": "morning","fromLanguage": "eng","toLanguage": "spa"}
Example: 100 Japanese proverbs with translations into any language.
{"maxItems": 100,"query": "","fromLanguage": "jpn","tags": ["proverb"]}
β οΈ Good to Know: Tatoeba is a community-edited corpus. Correctness scores reflect peer review, but expect occasional informal or regional phrasings. For production translation memory, weight by
correctnessand prefer sentences with multiple contributor confirmations.
π Output
Each sentence record contains 14 fields. Download the dataset as CSV, Excel, JSON, or XML.
π§Ύ Schema
| Field | Type | Example |
|---|---|---|
π sentenceId | number | 1276 |
π¬ text | string | "Let's try something." |
π language | string | "eng" |
πΊοΈ languageName | string | "English" |
βοΈ direction | string | "source" |
β
correctness | number | 1 |
π license | string | "CC BY 2.0 FR" |
π€ contributor | string | "CK" |
π hasAudio | boolean | true |
π§ audioUrls | array | ["https://tatoeba.org/audio/download/1276"] |
π’ translationCount | number | 12 |
π translations | array | [{"id":1277,"text":"Probemos algo.","language":"spa"}] |
π url | string | "https://tatoeba.org/eng/sentences/show/1276" |
π scrapedAt | ISO 8601 | "2026-05-23T00:00:00.000Z" |
π¦ Sample records
β¨ Why choose this Actor
| Capability | |
|---|---|
| π | 400+ language coverage. Major world languages, classical languages, constructed languages, and revived minority tongues. |
| π― | Combined filters. Source language, target language, keyword, and tag filters apply together in a single run. |
| π | Audio links included. Native-speaker recordings are flagged and linked when the contributor uploaded one. |
| π | Clear licensing. Every record carries its Creative Commons licence string. |
| β‘ | Fast. 10 sentences in under 5 seconds, 10,000 records in under 2 minutes. |
| π | Always fresh. Every run hits the live catalog so new community submissions are picked up. |
| π« | No authentication. Public corpus, no API key required. |
π Parallel sentence corpora are the backbone of every translation memory, flashcard deck, and language-learning curriculum on the market.
π How it compares to alternatives
| Approach | Cost | Coverage | Refresh | Filters | Setup |
|---|---|---|---|---|---|
| β Tatoeba Scraper (this Actor) | $5 free credit, then pay-per-use | 12M+ sentences, 400+ languages | Live per run | language, tag, keyword | β‘ 2 min |
| Commercial translation memories | $500+/month | Domain-specific, limited languages | Quarterly | Industry slice | π’ Days |
| Custom site scraper | Free engineering | Manual | Depends on cron | Hand-built | β³ Weeks |
| Static corpus dumps | Free | Full but stale | Quarterly tarball | None | π Hours of parsing |
Pick this Actor when you want fresh community data, built-in filters, and a clean tabular result with zero parser maintenance.
π How to use
- π Sign up. Create a free account with $5 credit (takes 2 minutes).
- π Open the Actor. Go to the Tatoeba Sentence Corpus Scraper page on the Apify Store.
- π― Set input. Pick a source language, optional keyword, optional target language and tags, set
maxItems. - π Run it. Click Start and let the Actor collect your sentences.
- π₯ Download. Grab your results from the Dataset tab as CSV, Excel, JSON, or XML.
β±οΈ Total time from signup to downloaded corpus: 3-5 minutes. No coding required.
πΌ Business use cases
π Automating Tatoeba Scraper
Control the scraper programmatically for scheduled runs and pipeline integrations:
- π’ Node.js. Install the
apify-clientNPM package. - π Python. Use the
apify-clientPyPI package. - π See the Apify API documentation for full details.
The Apify Schedules feature lets you trigger this Actor on any cron interval. Weekly refreshes keep your translation memory and flashcard banks in sync with the latest community additions.
π Beyond business use cases
Data like this powers more than commercial workflows. The same structured sentences support research, education, civic projects, and personal initiatives.
π€ Ask an AI assistant about this scraper
Open a ready-to-send prompt about this ParseForge actor in the AI of your choice:
- π¬ ChatGPT
- π§ Claude
- π Perplexity
- π Copilot
β Frequently Asked Questions
π§© How does it work?
Choose a source language, optional keyword, optional target language, and optional tags. Click Start and the Actor pulls matching sentences with translations, audio links, and licence info attached to each record.
π How accurate are the translations?
Tatoeba sentences are peer-reviewed by the community. The correctness field reflects that review. Native-speaker contributions are common in major languages, while smaller languages may have a smaller pool of confirmations.
π How often is the corpus refreshed?
The Tatoeba project accepts new sentences and edits continuously. Every Actor run hits the live catalog, so fresh contributions appear in your dataset right away.
π Which languages are supported?
The corpus spans 400+ languages. The input form exposes the 50 most populated languages by ISO 639-3 code (English, Spanish, Mandarin, Japanese, Arabic, German, French, and more). Less common languages can still be reached via translation links.
π Does every sentence have audio?
No. Audio is optional and depends on community uploads. The hasAudio flag tells you per record, and audioUrls carries the file links when present.
β° Can I schedule regular runs?
Yes. Use Apify Schedules to trigger this Actor on any cron interval (daily, weekly, monthly) and keep your downstream corpus in sync.
βοΈ Is this data legal to use?
Yes. Tatoeba sentences are published under a Creative Commons CC BY licence. Attribute the corpus and the individual contributor handles where applicable, and you can use the data commercially or non-commercially.
πΌ Can I use this data commercially?
Yes. CC BY allows commercial reuse with attribution. Bundle the licence string and contributor names in your downstream product, and you are good to go.
π³ Do I need a paid Apify plan to use this Actor?
No. The free Apify plan covers testing and small runs (10 records per run). A paid plan lifts the cap and unlocks scheduling, larger datasets, and higher concurrency.
π What happens if a run fails or gets interrupted?
Apify retries transient errors automatically. If a run still fails, inspect the log in the Runs tab, fix the input, and restart. Partial datasets are preserved so you never lose progress.
π What if I need help?
Our support team is here for you. Use the Apify platform messaging or the Tally form linked below.
π Integrate with any app
Tatoeba Sentence Corpus Scraper connects to any cloud service via Apify integrations:
- Make - Automate multi-step workflows
- Zapier - Connect with 5,000+ apps
- Slack - Get run notifications in your channels
- Airbyte - Pipe sentence data into your warehouse
- GitHub - Trigger runs from commits and releases
- Google Drive - Export datasets straight to Sheets
You can also use webhooks to trigger downstream actions when a run finishes. Push fresh sentences into your translation memory or alert your team in Slack.
π Recommended Actors
- π MyMemory Translation Scraper - Translate text across 70+ language pairs
- π LibriVox Audiobooks Scraper - Public-domain audiobooks with reader credits
- ποΈ Library of Congress Scraper - 170M+ digitized cultural records
- π° ArXiv Scraper - Academic preprints with metadata
- π Figshare Scraper - Open research datasets and figures
π‘ Pro Tip: browse the complete ParseForge collection for more reference-data scrapers.
π Need Help? Open our contact form to request a new scraper, propose a custom data project, or report an issue.
β οΈ Disclaimer: this Actor is an independent tool and is not affiliated with, endorsed by, or sponsored by the Tatoeba Project or its contributors. All trademarks mentioned are the property of their respective owners. Only publicly available open corpus data is collected, under the project's Creative Commons licence.