Wikipedia Scraper - Articles, Summaries, Metadata
Pricing
from $2.00 / 1,000 article scrapeds
Wikipedia Scraper - Articles, Summaries, Metadata
Extract Wikipedia articles including full content, summary, thumbnails, categories, external links, coordinates, and Wikidata IDs. Multi-language support for 12+ languages. Export data, run via API, schedule and monitor runs, or integrate with other tools.
Pricing
from $2.00 / 1,000 article scrapeds
Rating
0.0
(0)
Developer
Alessandro Santamaria
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
15 hours ago
Last modified
Categories
Share
Scrape Wikipedia articles at scale — full content, summaries, images, categories, and Wikidata links.
Build AI training datasets, knowledge graphs, research corpora, or enrich your app with encyclopedic facts. Powered by the official MediaWiki REST API for clean, reliable, respectful data extraction.
Features
- 12+ Languages — English, German, French, Spanish, Italian, Portuguese, Dutch, Polish, Russian, Japanese, Chinese, Arabic
- Full content extraction — plain text, cleaned HTML, and per-section breakdown with titles and heading levels
- Summaries & descriptions — one-line descriptions and first-paragraph extracts
- Images — thumbnails, main image, and all article images
- Structured metadata — categories, external links, references/citations
- Wikidata linking — every article comes with its Q-ID for entity resolution
- Geo coordinates — lat/lng for places, landmarks, and geographic entities
- Pageviews — 30-day view counts from the Wikimedia pageviews API
- Disambiguation detection — flag ambiguous pages before ingesting
- Search — find articles by keyword, not just by title
- No auth, no anti-bot — uses the public MediaWiki API; no tokens, no captchas
Input
{"titles": ["Berlin", "Albert_Einstein", "Machine_learning"],"searchQuery": "quantum physics","urls": ["https://en.wikipedia.org/wiki/Quantum_computing"],"language": "en","includeFullContent": true,"includeImages": true,"includeReferences": false,"maxSearchResults": 10}
| Field | Type | Description |
|---|---|---|
titles | array | Direct Wikipedia article titles |
searchQuery | string | Keyword search (returns top N matches) |
urls | array | Wikipedia URLs — title is auto-extracted |
language | enum | Wiki edition: en, de, fr, es, it, pt, nl, pl, ru, ja, zh, ar |
includeFullContent | bool | Fetch full article body + sections (default true) |
includeImages | bool | Include all image URLs (default true) |
includeReferences | bool | Include citations (default false) |
maxSearchResults | int | Cap on search results (default 10) |
Output Example
Real output for Berlin (English Wikipedia):
{"title": "Berlin","url": "https://en.wikipedia.org/wiki/Berlin","language": "en","page_id": 3354,"revision_id": 1234567890,"extract": "Berlin is the capital and largest city of Germany by both area and population...","description": "Capital and largest city of Germany","content_full": "Berlin is the capital and largest city of Germany...","content_html": "<section>...</section>","thumbnail_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/.../Berlin.jpg","main_image_url": "https://upload.wikimedia.org/wikipedia/commons/.../Berlin.jpg","images": ["https://upload.wikimedia.org/..."],"sections": [{ "title": "History", "level": 2, "text": "The earliest evidence of settlements..." },{ "title": "Geography", "level": 2, "text": "Berlin is in northeastern Germany..." }],"categories": ["Berlin", "Capitals in Europe", "Cities in Germany"],"external_links": ["https://www.berlin.de/", "..."],"coordinates": { "lat": 52.52, "lng": 13.405 },"wikidata_id": "Q64","last_modified": "2026-04-01T12:34:56Z","word_count": 15842,"view_count_30d": 1482391,"is_disambiguation": false,"scraped_at": "2026-04-07T10:00:00Z"}
Use Cases
- AI/LLM training data — Build high-quality, well-structured datasets for fine-tuning language models. Wikipedia is the gold standard for encyclopedic corpora.
- Knowledge graphs — Link entities in your database to Wikidata Q-IDs. Every article comes with its canonical identifier, coordinates, and categories.
- Academic research — Extract literature review material, cross-reference citations, and build topic-specific corpora across languages.
- Content generation — Enrich articles, product pages, and blog posts with verified encyclopedia facts. Add "Did you know" boxes and related topic links.
- Fact-checking pipelines — Verify claims against Wikipedia extracts and last-modified timestamps. Flag disambiguation pages automatically.
- Travel content — Pull city, landmark, and attraction data with coordinates for travel blogs, booking sites, and map overlays.
- Biographies — Scrape person articles for journalism, CRM enrichment, or historical datasets. Link people to their Wikidata records.
Pricing
Pay-per-event: you only pay for articles you actually extract.
| Event | Price |
|---|---|
enrichment-start | $0.001 |
enrichment-result | $0.002 per article |
Example costs:
- 100 articles — ~$0.20
- 1,000 articles — ~$2.00
- 10,000 articles — ~$20.00
No proxy costs — Wikipedia is a public API.
Issues & Feedback
Found a bug or want a feature? Open an issue.
Related Actors
- HTML to Markdown — Convert scraped HTML into LLM-ready Markdown
- RSS Feed Reader — Bulk parse RSS, Atom and JSON feeds
- Website Content Crawler — Crawl full websites and extract text
- Google Maps Scraper — Business listings, reviews, and geo data