Wikipedia Scraper
Pricing
Pay per event
Wikipedia Scraper
Search and extract Wikipedia articles — titles, summaries, full content, categories, and images. Uses the free MediaWiki API.
Pricing
Pay per event
Rating
0.0
(0)
Developer
Stas Persiianenko
Actor stats
0
Bookmarked
4
Total users
2
Monthly active users
10 hours ago
Last modified
Share
Extract Wikipedia articles by keyword search. Get titles, full summaries, URLs, word counts, thumbnails, and last edit dates from any of Wikipedia's 300+ language editions.
What does Wikipedia Scraper do?
Wikipedia Scraper searches Wikipedia using the official MediaWiki API and extracts structured data from matching articles. For each search keyword, it returns article metadata including the introductory extract (summary), word count, page size, thumbnail image, and direct URL.
The scraper uses Wikipedia's built-in search API, so results match what you'd find searching on Wikipedia itself — ranked by relevance with support for all Wikipedia languages.
Who is it for?
- 🎓 Academic researchers — extracting structured knowledge from Wikipedia articles at scale
- 🤖 NLP engineers — building training datasets from Wikipedia text and metadata
- 📊 Data analysts — collecting factual data and statistics from Wikipedia pages
- 💻 App developers — enriching applications with Wikipedia content and summaries
- 📝 Content creators — gathering reference material and structured facts for writing
Why scrape Wikipedia?
Wikipedia is the world's largest free encyclopedia with over 60 million articles across 300+ languages. It's a primary source for:
- Knowledge base construction — build reference datasets for AI training, chatbots, or research databases
- LLM and RAG pipelines — feed clean, structured article text into retrieval-augmented generation systems, fine-tuning datasets, or AI agent knowledge bases
- Content enrichment — add Wikipedia summaries to product catalogs, educational platforms, or content management systems
- Research and analysis — analyze article coverage, word counts, and edit patterns across topics
- Multilingual data — gather information in any language Wikipedia supports
- SEO and content strategy — understand topic coverage and find content gaps
How much does it cost to scrape Wikipedia?
Wikipedia Scraper uses pay-per-event pricing:
| Event | Price |
|---|---|
| Run started | $0.001 |
| Article extracted | $0.001 per article |
Example costs:
- 10 articles on "machine learning": ~$0.011
- 100 articles on "history": ~$0.101
- 500 articles across 5 keywords: ~$0.506
Platform costs are minimal — a typical run uses under $0.001 in compute. Wikipedia's API is fast and does not require proxies.
Input parameters
| Parameter | Type | Description | Default |
|---|---|---|---|
searchQueries | string[] | Keywords to search on Wikipedia. Each keyword runs a separate search. | Required |
language | string | Wikipedia language code (e.g., en, de, fr, es, ja, zh) | "en" |
maxResultsPerSearch | integer | Maximum articles per keyword (1–500) | 50 |
Input example
{"searchQueries": ["artificial intelligence", "quantum computing"],"language": "en","maxResultsPerSearch": 20}
Output example
Each article is returned as a JSON object:
{"pageId": 1164,"title": "Artificial intelligence","extract": "Artificial intelligence (AI) is the capability of computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making...","url": "https://en.wikipedia.org/wiki/Artificial_intelligence","wordCount": 26473,"size": 266568,"lastEdited": "2026-03-02T11:28:15Z","thumbnail": "https://upload.wikimedia.org/wikipedia/commons/thumb/...","scrapedAt": "2026-03-03T04:08:23.785Z"}
Output fields
| Field | Type | Description |
|---|---|---|
pageId | number | Wikipedia internal page identifier |
title | string | Article title |
extract | string | Introductory summary (plain text, no HTML) |
url | string | Direct link to the Wikipedia article |
wordCount | number | Total word count of the article |
size | number | Article size in bytes |
lastEdited | string | ISO timestamp of the last edit |
thumbnail | string | URL to article thumbnail image (if available) |
scrapedAt | string | ISO timestamp when the data was extracted |
Supported languages
Wikipedia Scraper supports all 300+ Wikipedia language editions. Use the standard language code:
| Code | Language | Articles |
|---|---|---|
en | English | 6.9M+ |
de | German | 2.9M+ |
fr | French | 2.6M+ |
es | Spanish | 2.0M+ |
ja | Japanese | 1.4M+ |
ru | Russian | 1.9M+ |
zh | Chinese | 1.4M+ |
pt | Portuguese | 1.1M+ |
it | Italian | 1.8M+ |
ar | Arabic | 1.2M+ |
Any valid Wikipedia language code works — see the full list.
How to scrape Wikipedia articles
- Open Wikipedia Scraper on Apify.
- Enter one or more search keywords in the
searchQueriesfield. - Set the
languagecode (e.g.,en,de,fr) for the Wikipedia edition you want. - Adjust
maxResultsPerSearchto control how many articles per keyword (default: 50). - Click Start and wait for the scrape to finish.
- Download articles as JSON, CSV, or Excel from the Dataset tab.
Using the Apify API
Python
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run = client.actor("automation-lab/wikipedia-scraper").call(run_input={"searchQueries": ["climate change", "renewable energy"],"language": "en","maxResultsPerSearch": 20,})for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(f"{item['title']} — {item['wordCount']} words")print(f" {item['url']}")print(f" {item['extract'][:200]}...")
Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });const run = await client.actor('automation-lab/wikipedia-scraper').call({searchQueries: ['climate change', 'renewable energy'],language: 'en',maxResultsPerSearch: 20,});const { items } = await client.dataset(run.defaultDatasetId).listItems();items.forEach(item => {console.log(`${item.title} — ${item.wordCount} words`);console.log(` ${item.url}`);});
REST API
curl -X POST "https://api.apify.com/v2/acts/automation-lab/wikipedia-scraper/runs?token=YOUR_API_TOKEN" \-H "Content-Type: application/json" \-d '{"searchQueries": ["artificial intelligence"],"language": "en","maxResultsPerSearch": 10}'
Integrations
Connect Wikipedia Scraper to hundreds of apps using built-in integrations:
- Google Sheets — export article data to spreadsheets
- Slack / Microsoft Teams — get notifications when scraping completes
- Zapier / Make — trigger workflows with scraped Wikipedia data
- Amazon S3 / Google Cloud Storage — store large datasets in cloud storage
- Webhook — send results to your own API endpoint
Tips and best practices
- Use specific keywords — more specific searches return more relevant results. "Quantum entanglement" is better than "quantum".
- Batch keywords efficiently — combine related keywords in one run to save on startup costs.
- Language parameter — set the language code to search non-English Wikipedias. Results, summaries, and URLs will all be in the selected language.
- Word count filtering — use the
wordCountfield to filter out stub articles (typically < 500 words). - Rate limits — Wikipedia's API is generous but has rate limits. The scraper handles pagination and batching automatically.
- Extracts are summaries — the
extractfield contains only the article's introduction, not the full text. For full articles, follow theurllink. - Max 500 results per keyword — this is a Wikipedia API limit. For broader coverage, use multiple related keywords.
Legality
Scraping publicly available data is generally legal according to the US Court of Appeals ruling (HiQ Labs v. LinkedIn). This actor only accesses publicly available information and does not require authentication. Always review and comply with the target website's Terms of Service before scraping. For personal data, ensure compliance with GDPR, CCPA, and other applicable privacy regulations.
FAQ
Q: Does this scraper get the full article text?
A: The extract field contains the article's introductory section in plain text. For complete article content, you can use the url to access the full page.
Q: How fast is it? A: Very fast. Wikipedia's API is highly optimized. A typical run extracting 50 articles completes in under 5 seconds.
Q: Does it need proxies? A: No. Wikipedia's API is open and does not block automated requests. The scraper identifies itself with a proper User-Agent header.
Q: Can I search in multiple languages at once? A: Each run uses one language. To search multiple languages, run the scraper once per language.
Use with Claude AI (MCP)
This actor is available as a tool in Claude AI through the Model Context Protocol (MCP). Add it to Claude Desktop, Cursor, Windsurf, or any MCP-compatible client.
Setup for Claude Code
$claude mcp add --transport http apify "https://mcp.apify.com"
Setup for Claude Desktop, Cursor, or VS Code
Add this to your MCP config file:
{"mcpServers": {"apify": {"url": "https://mcp.apify.com"}}}
Example prompts
- "Search Wikipedia for articles about quantum computing and give me the summaries"
- "Fetch Wikipedia articles on these 5 historical events and compare their word counts"
- "Look up Wikipedia articles on machine learning in both English and German and extract the introductions"
Learn more in the Apify MCP documentation.
The extract is truncated or too short.
The extract field contains only the article's introductory section, not the full text. This is by design to keep responses fast and costs low. Use the url field to access the complete article.
I'm getting irrelevant results for my search query.
Wikipedia's search API ranks by relevance, which may include loosely related articles. Use more specific keywords (e.g., "quantum entanglement" instead of "quantum") and reduce maxResultsPerSearch to get only the top matches.
Other research and news scrapers on Apify
- ArXiv Scraper -- search and extract academic papers from ArXiv
- CrossRef Scraper -- extract scholarly article metadata from CrossRef
- OpenAlex Scraper -- search and extract academic research data from OpenAlex
