Wikidata Lexemes Scraper avatar

Wikidata Lexemes Scraper

Pricing

from $10.00 / 1,000 result items

Go to Apify Store
Wikidata Lexemes Scraper

Wikidata Lexemes Scraper

Search and extract Wikidata Lexemes (L-namespace). Returns lemma, language QID, lexical category, senses, glosses, statements, and optional inflected forms for each lexeme. Distinct from Q-entities.

Pricing

from $10.00 / 1,000 result items

Rating

0.0

(0)

Developer

ParseForge

ParseForge

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

ParseForge Banner

🧬 Wikidata Lexemes Scraper

🚀 Export structured lexicographic data in seconds. Pull lemmas, lexical categories, grammatical forms, and senses from the Wikidata Lexeme namespace across hundreds of languages. No API key, no registration, no SPARQL skills required.

🕒 Last updated: 2026-05-22 · 📊 14 fields per record · 🧬 1.4M+ lexemes · 🌐 1,500+ languages · 🔤 30+ lexical categories

The Wikidata Lexemes Scraper queries the L-namespace on wikidata.org and returns 14 fields per record, including lexeme ID, lemma, language QID, lexical category QID, every documented sense, every inflected form, statement metadata, and a link back to the canonical Wikidata Lexeme page. The L-namespace is a structured, machine-readable companion to Wiktionary and powers downstream dictionaries, linguistic research, and language documentation.

The dataset covers more than 1.4 million lexemes spanning over 1,500 languages, from major world languages down to documented endangered and historical languages. This Actor turns the namespace into downloadable CSV, Excel, JSON, or XML in under five minutes. Lemma search, language filter, and lexical category filter all run from the same input form.

🎯 Target Audience💡 Primary Use Cases
Linguists, NLP engineers, dictionary builders, language-documentation teams, computational lexicographers, knowledge-graph engineersMultilingual lemma dictionaries, training data for morphology models, inflection tables for language apps, structured glosses for translation pipelines

📋 What the Wikidata Lexemes Scraper does

Three lookup workflows in a single run:

  • 🔍 Lemma search. Query the L-namespace by any lemma string in any UI language.
  • 🌐 Language filter. Restrict results to a single language QID such as Q1860 English or Q150 French.
  • 🔤 Lexical category filter. Limit to a part of speech via QID, for example Q1084 noun or Q24905 verb.

Each record includes the lexeme ID, lemma, lemma language code, language QID, lexical category QID, a short description, every documented sense with multilingual glosses, every inflected form with grammatical feature QIDs, statement count and properties, last-modified timestamp, the canonical Lexeme URL, and the scrape timestamp.

💡 Why it matters: structured lexicographic data powers morphological analyzers, inflection tables, multilingual search, and machine translation. Building your own pipeline means writing SPARQL queries, handling pagination across the L-namespace, and joining sense and form data by hand. This Actor skips all of that and refreshes on every run.


🎬 Full Demo

🚧 Coming soon: a 3-minute walkthrough showing how to go from sign-up to a downloaded dataset.


⚙️ Input

InputTypeDefaultBehavior
maxItemsinteger10Records to return. Free plan caps at 10, paid plan at 1,000,000.
searchQuerystring"run"Lemma string to search for in the Lexeme namespace.
searchLanguagestring"en"UI language code used for the search index.
lexicalCategoryQidstring""Restrict to a part-of-speech QID. Q1084 noun, Q24905 verb, Q34698 adjective.
languageQidstring""Restrict to a single language QID. Q1860 English, Q150 French, Q1568 Spanish.
includeFormsbooleanfalseEmbed the inflected forms list on each lexeme record.

Example: 50 English verbs starting with "run".

{
"maxItems": 50,
"searchQuery": "run",
"searchLanguage": "en",
"lexicalCategoryQid": "Q24905",
"languageQid": "Q1860",
"includeForms": true
}

Example: French nouns containing "maison".

{
"maxItems": 25,
"searchQuery": "maison",
"searchLanguage": "fr",
"lexicalCategoryQid": "Q1084",
"languageQid": "Q150"
}

⚠️ Good to Know: Wikidata coverage of the L-namespace is uneven across languages. English, French, German, and Russian are well populated. Endangered and minor languages may have just a handful of lexemes. Always sanity-check counts before downstream use.


📊 Output

Each lexeme record contains 14 fields. Download the dataset as CSV, Excel, JSON, or XML.

🧾 Schema

FieldTypeExample
🆔 lexemeIdstring"L1234"
🔤 lemmastring"run"
🌐 lemmaLanguagestring"en"
🏷️ languageQidstring"Q1860"
🔢 lexicalCategoryQidstring"Q24905"
📝 descriptionstring | null"English verb"
🔢 senseCountnumber3
📖 sensesobject[][{"id":"L1234-S1","glosses":{"en":"to move quickly..."}}]
🔢 formCountnumber5
🔁 formsobject[][{"id":"L1234-F1","representations":{"en":"running"},"grammaticalFeatures":["Q1230649"]}]
🧾 statementCountnumber12
🏷️ statementPropertiesstring[]["P5402","P5238","P5187"]
🕒 lastModifiedISO 8601"2025-11-04T08:22:17Z"
🔗 lexemeUrlstring"https://www.wikidata.org/wiki/Lexeme:L1234"
🕒 scrapedAtISO 8601"2026-05-22T10:00:00.000Z"

📦 Sample records


✨ Why choose this Actor

Capability
🧬Structured lexicography. Lemma, language QID, lexical category, senses, forms, and statements per record.
🌐Multilingual. Covers 1,500+ languages, with cross-language glosses on every sense.
🔁Inflection-ready. Forms array carries grammatical feature QIDs for plurals, tenses, declensions, and more.
Fast. 10 lexemes in under 5 seconds, 1,000 lexemes in under three minutes.
🏷️Knowledge-graph native. Every QID joins directly with Q-entities and Wikidata properties.
🔁Always fresh. Each run hits the live Wikidata L-namespace, so the dataset reflects current contributions.
🚫No authentication. Works against the public Wikidata index. No login or API key needed.

📊 Structured lexicographic data is the foundation of morphological analyzers, translation pipelines, and language documentation projects.


📈 How it compares to alternatives

ApproachCostCoverageRefreshSetup
⭐ Wikidata Lexemes Scraper (this Actor)$5 free credit, then pay-per-use1.4M+ lexemes, 1,500+ languagesLive per run⚡ 2 min
Hand-written SPARQL on the public endpointFree + engineeringFullBuild it yourself🛠️ Days
Wikidata JSON dumpsFreeFull, stale by weeksMonthly🐢 Hours
Commercial lexicographic APIs$99+/monthCurated subsetDaily⏳ Hours

Pick this Actor when you want structured lexicographic records without writing SPARQL or hosting your own SPARQL engine.


🚀 How to use

  1. 📝 Sign up. Create a free account with $5 credit (takes 2 minutes).
  2. 🌐 Open the Actor. Go to the Wikidata Lexemes Scraper page on the Apify Store.
  3. 🎯 Set input. Enter a lemma search, pick a language QID and lexical category QID, and set maxItems.
  4. 🚀 Run it. Click Start and let the Actor collect your data.
  5. 📥 Download. Grab your results in the Dataset tab as CSV, Excel, JSON, or XML.

⏱️ Total time from signup to downloaded dataset: 3-5 minutes. No coding required.


💼 Business use cases

🤖 NLP & Translation

  • Morphological analyzers from forms arrays
  • Lemma dictionaries for tokenizers
  • Training data for translation alignment
  • Cross-language gloss lookups

📱 Language Apps

  • Verb conjugation tables from forms
  • Plural and gender data for nouns
  • Lemma autocomplete with QID joins
  • Multilingual definition popovers

📚 Lexicographers

  • Coverage gap analyses by language
  • Comparative entries across senses
  • Reference corpus for academic papers
  • Source dataset for derived dictionaries

🧠 Knowledge-Graph Engineers

  • Lexeme-to-entity links for entity linking
  • Property-rich nodes for semantic search
  • Wiki-anchored multilingual glossaries
  • Structured nodes for AI knowledge stores

🔌 Automating Wikidata Lexemes Scraper

Control the scraper programmatically for scheduled runs and pipeline integrations:

  • 🟢 Node.js. Install the apify-client NPM package.
  • 🐍 Python. Use the apify-client PyPI package.
  • 📚 See the Apify API documentation for full details.

The Apify Schedules feature lets you trigger this Actor on any cron interval. Weekly or monthly pulls keep downstream lexicographic stores in sync automatically.


🌟 Beyond business use cases

Lexicographic data powers more than commercial workflows. The same structured records support research, education, civic projects, and language documentation.

🎓 Research and academia

  • Computational linguistics dissertations
  • Cross-language morphology studies
  • Reproducible datasets cited in papers
  • Open-data exercises on language coverage

🎨 Personal and creative

  • Indie language-learning side projects
  • Conjugation game prototypes
  • Writer reference tools and glossaries
  • Hobbyist lexicographic databases

🤝 Non-profit and civic

  • Endangered-language documentation
  • Community translation projects
  • Educational glossaries for schools
  • Civic literacy programs

🧪 Experimentation

  • Train morphological generators
  • Prototype agents that inflect words
  • Build glossary chrome extensions
  • Test language-app UX with real data

🤖 Ask an AI assistant about this scraper

Open a ready-to-send prompt about this ParseForge actor in the AI of your choice:


❓ Frequently Asked Questions

🧩 How does it work?

Enter a lemma search, pick optional language and lexical category QIDs, click Start, and the Actor queries the Wikidata L-namespace and emits a clean structured record per matching lexeme. No browser automation, no SPARQL, no captchas.

🆔 What are lexemes and how are they different from Q-entities?

Wikidata stores three kinds of entries: Items (Q-prefix) for things, Properties (P-prefix) for predicates, and Lexemes (L-prefix) for words. Lexemes have lemmas, languages, lexical categories, senses, and forms, distinct from regular knowledge-graph items.

🔁 How often is the dataset refreshed?

Wikidata contributors edit lexemes continuously. Every run of this Actor pulls live data, so the dataset reflects current contributions as of run time.

🌐 Which languages are supported?

Over 1,500 languages have at least one lexeme. English, French, German, Russian, and several other major languages are well populated. Coverage of endangered and minor languages varies.

⏰ Can I schedule regular runs?

Yes. Use Apify Schedules to run this Actor on any cron interval (daily, weekly, monthly) and keep a downstream lexicographic store in sync.

Wikidata content is released under Creative Commons CC0. There are no attribution requirements, though citing Wikidata is encouraged.

💼 Can I use this data commercially?

Yes. CC0 permits commercial use with no restrictions. You are responsible for downstream compliance with any specific use-case requirements.

💳 Do I need a paid Apify plan to use this Actor?

No. The free Apify plan is enough for testing and small searches (10 records per run). A paid plan lifts the limit and gives you access to scheduling, higher concurrency, and larger searches.

🔁 What happens if a search returns no matches?

A diagnostic record is pushed with an error field. Try a broader lemma string or remove the language/lexical-category filter.

🔤 Does it include phonology or audio?

This Actor returns lemmas, senses, forms, and statement metadata. For audio pronunciation or IPA, reach out via the contact form below to request a companion pronunciation scraper.

🆘 What if I need help?

Our support team is here to help. Contact us through the Apify platform or use the Tally form linked below.


🔌 Integrate with any app

Wikidata Lexemes Scraper connects to any cloud service via Apify integrations:

  • Make - Automate multi-step workflows
  • Zapier - Connect with 5,000+ apps
  • Slack - Get run notifications in your channels
  • Airbyte - Pipe lexeme data into your warehouse
  • GitHub - Trigger runs from commits and releases
  • Google Drive - Export datasets straight to Sheets

You can also use webhooks to trigger downstream actions when a run finishes. Push fresh lexeme records into your morphological analyzer, or alert your team in Slack.


💡 Pro Tip: browse the complete ParseForge collection for more reference-data scrapers.


🆘 Need Help? Open our contact form to request a new scraper, propose a custom data project, or report an issue.


⚠️ Disclaimer: this Actor is an independent tool and is not affiliated with, endorsed by, or sponsored by Wikidata, the Wikimedia Foundation, or any of its contributors. All trademarks mentioned are the property of their respective owners. Only publicly available open lexicographic data is collected.