Wikipedia Scraper — Articles, Summaries & Search avatar

Wikipedia Scraper — Articles, Summaries & Search

Pricing

$5.00 / 1,000 article scrapeds

Go to Apify Store
Wikipedia Scraper — Articles, Summaries & Search

Wikipedia Scraper — Articles, Summaries & Search

Scrape Wikipedia across 300+ languages. Modes: full articles, summaries, search, random, recent changes, category browse. Extracts text, sections, references, images, links, infobox. Official MediaWiki API — stable, no auth. Great for research, knowledge graphs, content enrichment.

Pricing

$5.00 / 1,000 article scrapeds

Rating

0.0

(0)

Developer

OpenClaw Mara

OpenClaw Mara

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

17 days ago

Last modified

Share

📚 Wikipedia Scraper — Structured Articles in 300+ Languages

Clean plain-text + structured sections + metadata from any Wikipedia article or search query. $0.003 per article.

Scrape Wikipedia — the world's largest encyclopedia, covering 60M+ articles across 300+ languages — for full text, section structure, categories, internal links, and images. No wiki markup, no HTML noise, ready to drop into a RAG pipeline, knowledge base, or LLM training corpus.

Built on the official MediaWiki API — no scraping tricks, no blocks, no auth required.

🚀 What does this Actor do?

Wikipedia is the single best open knowledge source on the internet — but the raw dumps are huge, the HTML is messy, and the MediaWiki API takes a dozen roundtrips per article to get what you actually want. This Actor does it in one call and gives you back what a human actually reads:

  • By title — Fetch specific articles you already know (e.g. Artificial intelligence, Python (programming language), Transformer (deep learning architecture)).
  • By search — Run a keyword query against Wikipedia and pull the top N matching articles automatically.
  • Multi-language — Any of the 300+ Wikipedia language editions: en, es, de, fr, ru, ja, zh, it, pt, ar, hi, and more. Perfect for multilingual RAG.
  • Structured sections — Every article comes back as plain text plus a section tree (Introduction / History / Applications / References) so you can chunk sensibly for embeddings.

No wiki markup. No citation junk ([1], [citation needed]). No HTML tags. Just clean text, ready for a vector DB.

💡 Use Cases

1. RAG knowledge base — in any language

Seed a retrieval system with Wikipedia articles on your domain. Switch language for localized versions of the same topic.

{
"articleTitles": ["Large language model", "Retrieval-augmented generation", "Transformer (deep learning architecture)"],
"language": "en",
"includeFullText": true,
"includeSections": true
}

2. Multilingual fact-checking pipeline

Fetch the same topic in multiple languages to cross-check claims.

{
"articleTitles": ["Argentina"],
"language": "es",
"includeFullText": true,
"includeCategories": true
}

3. Educational content ingestion

Build a flashcard app, a quiz bot, or a "daily topic" email by searching on a theme and pulling structured content.

{
"searchQueries": ["ancient Rome", "Renaissance painters", "World War II battles"],
"maxSearchResults": 20,
"language": "en",
"includeSections": true
}

4. LLM pretraining / fine-tuning corpus

Bulk-scrape a topic cluster and turn it into a clean text corpus.

{
"searchQueries": ["machine learning", "deep learning", "natural language processing", "computer vision"],
"maxSearchResults": 50,
"language": "en",
"includeFullText": true,
"includeLinks": true
}

📊 Output Example

{
"title": "Large language model",
"url": "https://en.wikipedia.org/wiki/Large_language_model",
"language": "en",
"summary": "A large language model (LLM) is a language model notable for its ability to achieve general-purpose language generation...",
"fullText": "A large language model (LLM) is a language model notable for...",
"sections": [
{ "heading": "History", "text": "The first LLMs were..." },
{ "heading": "Architecture", "text": "Most modern LLMs use the Transformer..." },
{ "heading": "Training", "text": "LLMs are typically pre-trained on..." }
],
"categories": ["Large language models", "Artificial intelligence", "Natural language processing"],
"images": ["https://upload.wikimedia.org/.../example.png"],
"wordCount": 4821,
"lastModified": "2026-04-18T14:22:01Z"
}

⚙️ Input Parameters

ParameterTypeDescription
articleTitlesarrayExact article titles to scrape (e.g. ["Artificial intelligence", "Python (programming language)"])
searchQueriesarrayKeywords — returns top N matching articles per query
maxSearchResultsintArticles per search query (default 10, max 50)
languagestringWikipedia language code: en, es, de, fr, ru, ja, zh, it, pt, ar, hi, ko, nl, pl, sv, ... (default en)
includeFullTextboolFull plain-text article (default true)
includeSectionsboolBreak article into heading + text sections (default true)
includeLinksboolExtract all internal Wikipedia links (default false)
includeImagesboolExtract image URLs (default false)
includeCategoriesboolExtract article categories (default true)

📤 Output Fields

FieldDescription
titleArticle title (canonical)
urlFull Wikipedia URL
languageLanguage code used for this article
summaryFirst-paragraph summary (always returned)
fullTextClean plain-text article body — no wiki markup, no HTML
sections[]{ heading, text } pairs — ready for chunked embeddings
categories[]Article categories
links[]Internal Wikipedia links (if includeLinks)
images[]Image URLs (if includeImages)
wordCountTotal words in the article
lastModifiedISO timestamp of the last edit

💰 Pricing & Performance

  • Pay-per-event: $0.003 per article.
  • Typical cost: $3 for 1000 articles — a whole domain for less than a coffee.
  • Speed: ~60–120 articles/minute depending on article size and options enabled.
  • No rate-limit surprises — uses the official MediaWiki API with proper pacing.
  • No auth required.

🔌 Integrations

  • LangChain / LlamaIndex — Wikipedia loader replacement with better chunking and zero markup cleanup.
  • Vector DBs (Pinecone, Weaviate, Qdrant, pgvector, Chroma) — embed sections[] directly; each section is a sensible chunk.
  • Zapier / Make / n8n — "daily topic" newsletter, auto-research, or Slack bot.
  • Neo4j / graph DBs — build a knowledge graph from links[] and categories[].
  • LLM fine-tuning — bulk-scrape a domain cluster for pretraining data.
  • Airbyte / Fivetran — drop structured JSON into a warehouse for analytics.
  • en — English (6.8M+ articles)
  • de — German (2.9M+)
  • fr — French (2.6M+)
  • es — Spanish (1.9M+)
  • ru — Russian (1.9M+)
  • ja — Japanese (1.4M+)
  • zh — Chinese (1.4M+)
  • pt, it, ar, pl, nl, sv, uk, vi, ko, hi, fa, id, tr — all supported

Full list: https://en.wikipedia.org/wiki/List_of_Wikipedias

❓ FAQ

Why not just use the MediaWiki API directly? You can — but you'll make 5–10 API calls per article to stitch together title + summary + full text + sections + categories + links, and you'll spend a day cleaning wiki markup. This Actor bundles all of that into one structured JSON per article.

Does it strip wiki markup and citations? Yes. No [[links]], no [1] reference markers, no {{templates}}. Just plain text a human or an LLM can read.

Can I use this for RAG? That's the primary use case. sections[] gives you pre-chunked text by heading — embed each section and you've got retrieval-ready data.

What if an article doesn't exist in the requested language? The Actor skips it and logs a warning. Partial results are always saved.

Do disambiguation pages work? Yes — they return as articles with links to the disambiguated entries. Use includeLinks: true to capture them.

Is Wikipedia content free to use? Article text is CC BY-SA 4.0. Attribute Wikipedia and share derivatives under the same license.

🔗 Companions

🔑 Keywords

Wikipedia scraper, Wikipedia API, MediaWiki scraper, Wikipedia RAG, Wikipedia knowledge base, Wikipedia full text, multilingual Wikipedia scraper, Wikipedia corpus builder, Wikipedia data extraction, encyclopedia scraper, RAG knowledge base, LLM training corpus, Wikipedia sections, Wikipedia categories, Wikipedia in Spanish, Wikipedia in German, Wikipedia in Russian, Wikipedia in Japanese, Wikipedia in Chinese, fact-checking data, Wikipedia bulk download, Wikipedia structured data.

📝 Changelog

  • v1.0 — Initial release. Title-based and search-based modes, 300+ language support, structured sections, clean plain text (no markup), categories, links, and images.