Tatoeba Sentence Corpus Scraper avatar

Tatoeba Sentence Corpus Scraper

Pricing

from $10.00 / 1,000 result items

Go to Apify Store
Tatoeba Sentence Corpus Scraper

Tatoeba Sentence Corpus Scraper

Extract Tatoeba sentence corpus with millions of bilingual example sentences. Capture sentence ID, language, text, owner, audio URL, translations, tags, and license. Export to JSON, CSV, or Excel for language learning, NLP training data, translation memory, and linguistic research.

Pricing

from $10.00 / 1,000 result items

Rating

0.0

(0)

Developer

ParseForge

ParseForge

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a day ago

Last modified

Share

ParseForge Banner

πŸ—£οΈ Tatoeba Sentence Corpus Scraper

πŸš€ Export the world's largest open multilingual sentence corpus in seconds. Pull 12,000,000+ example sentences across 400+ languages with translations, audio, contributor info, and CC-BY licence metadata. No login, no manual CSV stitching.

πŸ•’ Last updated: 2026-05-23 Β· πŸ“Š 14 fields per record Β· πŸ—£οΈ 12M+ sentences Β· 🌍 400+ languages Β· πŸ”Š Audio + translations

The Tatoeba Sentence Corpus Scraper taps into the Tatoeba community catalog and returns 14 structured fields per sentence, including the original text, language code and name, translation list, audio links, contributor handle, correctness score, and licence. Tatoeba has been collaboratively edited by linguists, polyglots, and language learners since 2006, and ships under a permissive Creative Commons licence.

The catalog covers every major living language family plus dozens of constructed, classical, and minority languages, from Mandarin and Spanish down to Latin, Esperanto, and revived regional tongues. This Actor turns that into a clean CSV, Excel, JSON, or XML dataset in under five minutes, with all filtering done server-side so you skip the parsing entirely.

🎯 Target AudienceπŸ’‘ Primary Use Cases
Linguists, language-learning app builders, translation researchers, NLP engineers, lexicographers, ESL teachers, audio dataset curatorsParallel corpus mining, flashcard sourcing, translation memory seeding, speech model training, idiom and proverb research, classroom example banks

πŸ“‹ What the Tatoeba Scraper does

Four sentence-mining workflows in a single run:

  • πŸ”Ž Keyword search. Find every sentence containing a target word or phrase.
  • 🌐 Source language filter. Pick a single source language out of 400+ (Tatoeba uses ISO 639-3 codes).
  • ↔️ Target translation filter. Restrict to sentences that have a translation into a chosen language.
  • 🏷️ Tag filter. Pull only sentences tagged with concepts like proverb, idiom, greeting, or any community label.

Each record includes the sentence ID, raw text, language code and human-readable name, every linked translation (with that translation's language), audio file URLs when available, contributor handle, correctness score, licence, and the canonical Tatoeba page link.

πŸ’‘ Why it matters: clean, licence-clear parallel sentences are the raw material of every translation memory, language-learning flashcard, and speech model training set. Building your own pipeline against the Tatoeba site means writing fragile HTML parsers and respecting rate limits by hand. This Actor delivers the same data structured and ready to import.


🎬 Full Demo

🚧 Coming soon: a 3-minute walkthrough showing how to go from sign-up to a downloaded sentence corpus.


βš™οΈ Input

InputTypeDefaultBehavior
maxItemsinteger10Sentences to return. Free plan caps at 10, paid plan at 1,000,000.
querystring"hello"Keyword search. Empty browses the chosen language without a filter.
fromLanguagestring"eng"Source language (ISO 639-3). 50 most common languages exposed.
toLanguagestring""Target translation language (ISO 639-3). Empty returns all translations.
tagsarray[]Filter by community tag names like proverb, idiom.

Example: 50 English sentences containing "morning" with Spanish translations.

{
"maxItems": 50,
"query": "morning",
"fromLanguage": "eng",
"toLanguage": "spa"
}

Example: 100 Japanese proverbs with translations into any language.

{
"maxItems": 100,
"query": "",
"fromLanguage": "jpn",
"tags": ["proverb"]
}

⚠️ Good to Know: Tatoeba is a community-edited corpus. Correctness scores reflect peer review, but expect occasional informal or regional phrasings. For production translation memory, weight by correctness and prefer sentences with multiple contributor confirmations.


πŸ“Š Output

Each sentence record contains 14 fields. Download the dataset as CSV, Excel, JSON, or XML.

🧾 Schema

FieldTypeExample
πŸ†” sentenceIdnumber1276
πŸ’¬ textstring"Let's try something."
🌐 languagestring"eng"
πŸ—ΊοΈ languageNamestring"English"
↔️ directionstring"source"
βœ… correctnessnumber1
πŸ“œ licensestring"CC BY 2.0 FR"
πŸ‘€ contributorstring"CK"
πŸ”Š hasAudiobooleantrue
🎧 audioUrlsarray["https://tatoeba.org/audio/download/1276"]
πŸ”’ translationCountnumber12
🌍 translationsarray[{"id":1277,"text":"Probemos algo.","language":"spa"}]
πŸ”— urlstring"https://tatoeba.org/eng/sentences/show/1276"
πŸ•’ scrapedAtISO 8601"2026-05-23T00:00:00.000Z"

πŸ“¦ Sample records


✨ Why choose this Actor

Capability
🌍400+ language coverage. Major world languages, classical languages, constructed languages, and revived minority tongues.
🎯Combined filters. Source language, target language, keyword, and tag filters apply together in a single run.
πŸ”ŠAudio links included. Native-speaker recordings are flagged and linked when the contributor uploaded one.
πŸ“œClear licensing. Every record carries its Creative Commons licence string.
⚑Fast. 10 sentences in under 5 seconds, 10,000 records in under 2 minutes.
πŸ”Always fresh. Every run hits the live catalog so new community submissions are picked up.
🚫No authentication. Public corpus, no API key required.

πŸ“Š Parallel sentence corpora are the backbone of every translation memory, flashcard deck, and language-learning curriculum on the market.


πŸ“ˆ How it compares to alternatives

ApproachCostCoverageRefreshFiltersSetup
⭐ Tatoeba Scraper (this Actor)$5 free credit, then pay-per-use12M+ sentences, 400+ languagesLive per runlanguage, tag, keyword⚑ 2 min
Commercial translation memories$500+/monthDomain-specific, limited languagesQuarterlyIndustry slice🐒 Days
Custom site scraperFree engineeringManualDepends on cronHand-built⏳ Weeks
Static corpus dumpsFreeFull but staleQuarterly tarballNoneπŸ•’ Hours of parsing

Pick this Actor when you want fresh community data, built-in filters, and a clean tabular result with zero parser maintenance.


πŸš€ How to use

  1. πŸ“ Sign up. Create a free account with $5 credit (takes 2 minutes).
  2. 🌐 Open the Actor. Go to the Tatoeba Sentence Corpus Scraper page on the Apify Store.
  3. 🎯 Set input. Pick a source language, optional keyword, optional target language and tags, set maxItems.
  4. πŸš€ Run it. Click Start and let the Actor collect your sentences.
  5. πŸ“₯ Download. Grab your results from the Dataset tab as CSV, Excel, JSON, or XML.

⏱️ Total time from signup to downloaded corpus: 3-5 minutes. No coding required.


πŸ’Ό Business use cases

πŸ“± Language-Learning Apps

  • Daily-phrase decks for streak-based apps
  • CEFR-style example banks per skill level
  • Idiom and proverb add-on packs
  • Audio prompts for pronunciation drills

πŸ€– NLP & Machine Translation

  • Seed parallel corpora for transformer fine-tuning
  • Build evaluation sets for translation quality
  • Augment domain corpora with everyday phrasing
  • Train sentence embedding models

πŸŽ“ Linguistic Research

  • Comparative syntax studies across language families
  • Lexicographic exemplar collection
  • Sociolinguistic surveys of register and dialect
  • Reproducible corpus pulls with versioned licence info

πŸŽ™οΈ Speech & Audio Pipelines

  • Voice-acted line banks for text-to-speech eval
  • Pronunciation dictionaries with native audio
  • Low-resource language audio collection
  • Forced-alignment training material

πŸ”Œ Automating Tatoeba Scraper

Control the scraper programmatically for scheduled runs and pipeline integrations:

  • 🟒 Node.js. Install the apify-client NPM package.
  • 🐍 Python. Use the apify-client PyPI package.
  • πŸ“š See the Apify API documentation for full details.

The Apify Schedules feature lets you trigger this Actor on any cron interval. Weekly refreshes keep your translation memory and flashcard banks in sync with the latest community additions.


🌟 Beyond business use cases

Data like this powers more than commercial workflows. The same structured sentences support research, education, civic projects, and personal initiatives.

πŸŽ“ Research and academia

  • Comparative linguistics theses and papers
  • Open-data exercises for NLP coursework
  • Sociolinguistic survey corpora
  • Reproducible studies citing exact dataset pulls

🎨 Personal and creative

  • Polyglot vocabulary journals and Anki decks
  • Multilingual quote walls and printables
  • Bilingual children's book drafts
  • Hobbyist phrasebook apps for travel

🀝 Non-profit and civic

  • Language-revitalization materials for minority tongues
  • Refugee-resettlement phrasebooks and trainings
  • Free ESL classroom example banks
  • Open-source translation projects for NGOs

πŸ§ͺ Experimentation

  • Train sentence-similarity models
  • Prototype voice assistants in low-resource languages
  • Benchmark embedding models across language pairs
  • Test bilingual interface copy with real example data

πŸ€– Ask an AI assistant about this scraper

Open a ready-to-send prompt about this ParseForge actor in the AI of your choice:


❓ Frequently Asked Questions

🧩 How does it work?

Choose a source language, optional keyword, optional target language, and optional tags. Click Start and the Actor pulls matching sentences with translations, audio links, and licence info attached to each record.

πŸ“ How accurate are the translations?

Tatoeba sentences are peer-reviewed by the community. The correctness field reflects that review. Native-speaker contributions are common in major languages, while smaller languages may have a smaller pool of confirmations.

πŸ” How often is the corpus refreshed?

The Tatoeba project accepts new sentences and edits continuously. Every Actor run hits the live catalog, so fresh contributions appear in your dataset right away.

🌐 Which languages are supported?

The corpus spans 400+ languages. The input form exposes the 50 most populated languages by ISO 639-3 code (English, Spanish, Mandarin, Japanese, Arabic, German, French, and more). Less common languages can still be reached via translation links.

πŸ”Š Does every sentence have audio?

No. Audio is optional and depends on community uploads. The hasAudio flag tells you per record, and audioUrls carries the file links when present.

⏰ Can I schedule regular runs?

Yes. Use Apify Schedules to trigger this Actor on any cron interval (daily, weekly, monthly) and keep your downstream corpus in sync.

Yes. Tatoeba sentences are published under a Creative Commons CC BY licence. Attribute the corpus and the individual contributor handles where applicable, and you can use the data commercially or non-commercially.

πŸ’Ό Can I use this data commercially?

Yes. CC BY allows commercial reuse with attribution. Bundle the licence string and contributor names in your downstream product, and you are good to go.

πŸ’³ Do I need a paid Apify plan to use this Actor?

No. The free Apify plan covers testing and small runs (10 records per run). A paid plan lifts the cap and unlocks scheduling, larger datasets, and higher concurrency.

πŸ” What happens if a run fails or gets interrupted?

Apify retries transient errors automatically. If a run still fails, inspect the log in the Runs tab, fix the input, and restart. Partial datasets are preserved so you never lose progress.

πŸ†˜ What if I need help?

Our support team is here for you. Use the Apify platform messaging or the Tally form linked below.


πŸ”Œ Integrate with any app

Tatoeba Sentence Corpus Scraper connects to any cloud service via Apify integrations:

  • Make - Automate multi-step workflows
  • Zapier - Connect with 5,000+ apps
  • Slack - Get run notifications in your channels
  • Airbyte - Pipe sentence data into your warehouse
  • GitHub - Trigger runs from commits and releases
  • Google Drive - Export datasets straight to Sheets

You can also use webhooks to trigger downstream actions when a run finishes. Push fresh sentences into your translation memory or alert your team in Slack.


πŸ’‘ Pro Tip: browse the complete ParseForge collection for more reference-data scrapers.


πŸ†˜ Need Help? Open our contact form to request a new scraper, propose a custom data project, or report an issue.


⚠️ Disclaimer: this Actor is an independent tool and is not affiliated with, endorsed by, or sponsored by the Tatoeba Project or its contributors. All trademarks mentioned are the property of their respective owners. Only publicly available open corpus data is collected, under the project's Creative Commons licence.