Pricing

$5.00 / 1,000 article scrapeds

Wikipedia Scraper — Articles, Summaries & Search

Scrape Wikipedia across 300+ languages. Modes: full articles, summaries, search, random, recent changes, category browse. Extracts text, sections, references, images, links, infobox. Official MediaWiki API — stable, no auth. Great for research, knowledge graphs, content enrichment.

Pricing

$5.00 / 1,000 article scrapeds

Rating

0.0

(0)

Developer

OpenClaw Mara

Actor stats

Bookmarked

Total users

Monthly active users

10 days ago

Last modified

📚 Wikipedia Scraper — Structured Articles in 300+ Languages

Clean plain-text + structured sections + metadata from any Wikipedia article or search query. $0.003 per article.

Scrape Wikipedia — the world's largest encyclopedia, covering 60M+ articles across 300+ languages — for full text, section structure, categories, internal links, and images. No wiki markup, no HTML noise, ready to drop into a RAG pipeline, knowledge base, or LLM training corpus.

Built on the official MediaWiki API — no scraping tricks, no blocks, no auth required.

🚀 What does this Actor do?

Wikipedia is the single best open knowledge source on the internet — but the raw dumps are huge, the HTML is messy, and the MediaWiki API takes a dozen roundtrips per article to get what you actually want. This Actor does it in one call and gives you back what a human actually reads:

By title — Fetch specific articles you already know (e.g. Artificial intelligence, Python (programming language), Transformer (deep learning architecture)).
By search — Run a keyword query against Wikipedia and pull the top N matching articles automatically.
Multi-language — Any of the 300+ Wikipedia language editions: en, es, de, fr, ru, ja, zh, it, pt, ar, hi, and more. Perfect for multilingual RAG.
Structured sections — Every article comes back as plain text plus a section tree (Introduction / History / Applications / References) so you can chunk sensibly for embeddings.

No wiki markup. No citation junk ([1], [citation needed]). No HTML tags. Just clean text, ready for a vector DB.

💡 Use Cases

1. RAG knowledge base — in any language

Seed a retrieval system with Wikipedia articles on your domain. Switch language for localized versions of the same topic.

{
  "articleTitles": ["Large language model", "Retrieval-augmented generation", "Transformer (deep learning architecture)"],
  "language": "en",
  "includeFullText": true,
  "includeSections": true
}

2. Multilingual fact-checking pipeline

Fetch the same topic in multiple languages to cross-check claims.

{
  "articleTitles": ["Argentina"],
  "language": "es",
  "includeFullText": true,
  "includeCategories": true
}

3. Educational content ingestion

Build a flashcard app, a quiz bot, or a "daily topic" email by searching on a theme and pulling structured content.

{
  "searchQueries": ["ancient Rome", "Renaissance painters", "World War II battles"],
  "maxSearchResults": 20,
  "language": "en",
  "includeSections": true
}

4. LLM pretraining / fine-tuning corpus

Bulk-scrape a topic cluster and turn it into a clean text corpus.

{
  "searchQueries": ["machine learning", "deep learning", "natural language processing", "computer vision"],
  "maxSearchResults": 50,
  "language": "en",
  "includeFullText": true,
  "includeLinks": true
}

📊 Output Example

{
  "title": "Large language model",
  "url": "https://en.wikipedia.org/wiki/Large_language_model",
  "language": "en",
  "summary": "A large language model (LLM) is a language model notable for its ability to achieve general-purpose language generation...",
  "fullText": "A large language model (LLM) is a language model notable for...",
  "sections": [
    { "heading": "History", "text": "The first LLMs were..." },
    { "heading": "Architecture", "text": "Most modern LLMs use the Transformer..." },
    { "heading": "Training", "text": "LLMs are typically pre-trained on..." }
  ],
  "categories": ["Large language models", "Artificial intelligence", "Natural language processing"],
  "images": ["https://upload.wikimedia.org/.../example.png"],
  "wordCount": 4821,
  "lastModified": "2026-04-18T14:22:01Z"
}

⚙️ Input Parameters

Parameter	Type	Description
`articleTitles`	array	Exact article titles to scrape (e.g. `["Artificial intelligence", "Python (programming language)"]`)
`searchQueries`	array	Keywords — returns top N matching articles per query
`maxSearchResults`	int	Articles per search query (default 10, max 50)
`language`	string	Wikipedia language code: `en`, `es`, `de`, `fr`, `ru`, `ja`, `zh`, `it`, `pt`, `ar`, `hi`, `ko`, `nl`, `pl`, `sv`, ... (default `en`)
`includeFullText`	bool	Full plain-text article (default `true`)
`includeSections`	bool	Break article into heading + text sections (default `true`)
`includeLinks`	bool	Extract all internal Wikipedia links (default `false`)
`includeImages`	bool	Extract image URLs (default `false`)
`includeCategories`	bool	Extract article categories (default `true`)

📤 Output Fields

Field	Description
`title`	Article title (canonical)
`url`	Full Wikipedia URL
`language`	Language code used for this article
`summary`	First-paragraph summary (always returned)
`fullText`	Clean plain-text article body — no wiki markup, no HTML
`sections[]`	`{ heading, text }` pairs — ready for chunked embeddings
`categories[]`	Article categories
`links[]`	Internal Wikipedia links (if `includeLinks`)
`images[]`	Image URLs (if `includeImages`)
`wordCount`	Total words in the article
`lastModified`	ISO timestamp of the last edit

💰 Pricing & Performance

Pay-per-event: $0.003 per article.
Typical cost: $3 for 1000 articles — a whole domain for less than a coffee.
Speed: ~60–120 articles/minute depending on article size and options enabled.
No rate-limit surprises — uses the official MediaWiki API with proper pacing.
No auth required.

🔌 Integrations

LangChain / LlamaIndex — Wikipedia loader replacement with better chunking and zero markup cleanup.
Vector DBs (Pinecone, Weaviate, Qdrant, pgvector, Chroma) — embed sections[] directly; each section is a sensible chunk.
Zapier / Make / n8n — "daily topic" newsletter, auto-research, or Slack bot.
Neo4j / graph DBs — build a knowledge graph from links[] and categories[].
LLM fine-tuning — bulk-scrape a domain cluster for pretraining data.
Airbyte / Fivetran — drop structured JSON into a warehouse for analytics.

🌐 Popular Language Editions

en — English (6.8M+ articles)
de — German (2.9M+)
fr — French (2.6M+)
es — Spanish (1.9M+)
ru — Russian (1.9M+)
ja — Japanese (1.4M+)
zh — Chinese (1.4M+)
pt, it, ar, pl, nl, sv, uk, vi, ko, hi, fa, id, tr — all supported

Full list: https://en.wikipedia.org/wiki/List_of_Wikipedias

❓ FAQ

Why not just use the MediaWiki API directly? You can — but you'll make 5–10 API calls per article to stitch together title + summary + full text + sections + categories + links, and you'll spend a day cleaning wiki markup. This Actor bundles all of that into one structured JSON per article.

Does it strip wiki markup and citations? Yes. No [[links]], no [1] reference markers, no {{templates}}. Just plain text a human or an LLM can read.

Can I use this for RAG? That's the primary use case. sections[] gives you pre-chunked text by heading — embed each section and you've got retrieval-ready data.

What if an article doesn't exist in the requested language? The Actor skips it and logs a warning. Partial results are always saved.

Do disambiguation pages work? Yes — they return as articles with links to the disambiguated entries. Use includeLinks: true to capture them.

Is Wikipedia content free to use? Article text is CC BY-SA 4.0. Attribute Wikipedia and share derivatives under the same license.

🔗 Companions

arXiv Paper Scraper — Academic papers for research-heavy corpora.
Semantic Scholar Scraper — Citation graphs and influence metrics.
Crossref Scraper — DOI metadata for scholarly articles.

🔑 Keywords

Wikipedia scraper, Wikipedia API, MediaWiki scraper, Wikipedia RAG, Wikipedia knowledge base, Wikipedia full text, multilingual Wikipedia scraper, Wikipedia corpus builder, Wikipedia data extraction, encyclopedia scraper, RAG knowledge base, LLM training corpus, Wikipedia sections, Wikipedia categories, Wikipedia in Spanish, Wikipedia in German, Wikipedia in Russian, Wikipedia in Japanese, Wikipedia in Chinese, fact-checking data, Wikipedia bulk download, Wikipedia structured data.

📝 Changelog

v1.0 — Initial release. Title-based and search-based modes, 300+ language support, structured sections, clean plain text (no markup), categories, links, and images.

Wikipedia Scraper

automation-lab/wikipedia-scraper

Search and extract Wikipedia articles — titles, summaries, full content, categories, and images. Uses the free MediaWiki API.

Stas Persiianenko

Wikipedia Article Scraper

crawlerbros/wikipedia-scraper

Extract structured data from Wikipedia articles. Get summaries, categories, images, metadata, and descriptions using Wikipedia's official API. Supports 300+ languages.

Crawler Bros

5.0

Wikipedia Scraper - Articles, Search & Recent Changes

legend006/wikipedia-scraper

Scrape Wikipedia articles by title, run keyword searches, pull recent changes, or extract entire categories — across any of 300+ language editions. Returns clean text, summaries, references, links, and metadata. Built for AI/LLM training datasets, NLP research, and knowledge-graph building.

NIJ KANANI

Wikipedia Article Extractor

glassventures/wikipedia-article-extractor

Extract Wikipedia articles via MediaWiki API. Get full text, summaries, sections, categories, images, links. Multi-language. Perfect for AI/ML training data and RAG.

Glass Ventures

Wikipedia Article Scraper

cloud9_ai/wikipedia-scraper

Scrape Wikipedia articles by search keyword or exact title. Returns summaries, full article text, categories, and links. Supports 300+ languages.

cloud9

Wikipedia Scraper

gio21/wikipedia-scraper

Search Wikipedia and return article summaries or full text via the public REST API. Supports 300+ languages. Useful for knowledge extraction, research, content generation, and entity enrichment.

Gio

Wikipedia Data Extractor - Articles & Summaries

vernacular_reservoir/wikipedia-data-extractor

Extract structured data from Wikipedia articles by topic or keyword. Get title, summary, description, thumbnail, coordinates and related links. Supports all Wikipedia languages. No API key required.

Aleksandrs

Wikipedia Article Search

ryanclinton/wikipedia-article-search

Search and retrieve structured data from Wikipedia articles across 15 language editions. This Apify actor queries the MediaWiki Search API to find relevant articles, then enriches each result with plain-text summaries, descriptions, Wikidata IDs, and thumbnail images via the Wikipedia REST API.

Ryan Clinton

Wikipedia Pro Scraper - Sections, Infobox, References

wetyr_corporation/wikipedia-pro-scraper

Wikipedia scraper for AI/RAG. Extracts structured sections, infobox key-value data, references. Multilingual, batch-friendly. Ready for vector databases.

WETYR

Wikipedia Article Scraper - Search & Extract Content

klondikeking/wikipedia-article-scraper

Search and extract Wikipedia article metadata, summaries, and content via the official MediaWiki API. No scraping overhead — pure API integration with high reliability.