Wikipedia Scraper
Pricing
from $0.60 / 1,000 results
Wikipedia Scraper
[๐ฐ $0.6 / 1K] Search Wikipedia or fetch exact articles by URL or title, and extract clean structured data โ summaries, full plain text, categories, 30-day pageviews, thumbnails, coordinates, and language counts โ across 300+ language editions.
Pricing
from $0.60 / 1,000 results
Rating
0.0
(0)
Developer
SolidCode
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Turn any Wikipedia article into clean, structured data โ summaries, full article text, categories, 30-day pageviews, Wikidata descriptions, thumbnails, geo-coordinates, and more, across 300+ language editions. Search by keyword or fetch exact articles you already know by URL or title. Built for researchers, data teams, and content & SEO analysts who need ready-to-use Wikipedia datasets without copy-pasting articles by hand or stitching together the raw API.
Why This Scraper?
- 300+ language editions โ the picker surfaces 40 of the largest editions (English, German, French, Spanish, Japanese, Chinese, Arabic, and more); paste any non-English Wikipedia URL and the actor scrapes that edition automatically.
- Two ways in: keyword search or exact fetch โ run a keyword search across an edition, or pull specific articles by full URL (
https://en.wikipedia.org/wiki/Alan_Turing) or bare title ("Alan Turing"). Mix both in a single run. - 30-day pageview totals on every row โ a real popularity signal, not a guess: the trailing ~30-day view count for each article, ideal for ranking topics by actual reader demand.
- Sort by relevance or popularity โ order search hits by best textual match, or by most-referenced (incoming-link-weighted) to surface the canonical, authoritative article first.
- Full plain-text article body on demand โ toggle
fullTextto capture the entire article as clean plain text (not just the intro), ready for NLP, summarization, or training corpora. - Wikidata one-line descriptions โ the short canonical descriptor (e.g. "English mathematician and computer scientist") pulled straight from Wikidata, perfect for tooltips and entity labels.
- Category taxonomy per article โ the full list of categories each article belongs to ("British computer scientists", "1912 births") for topic mapping and classification.
- Geo-coordinates for places & landmarks โ latitude/longitude on every article that has a location, so cities, monuments, and venues drop straight onto a map.
- Cross-edition reach signal โ
langCounttells you in how many language editions an article exists, a quick indicator of global notability.
Use Cases
Research & Academia
- Build structured corpora of articles on a topic for literature reviews and citation context
- Compare how the same subject is covered across language editions
- Track article size and last-edited dates to study how topics evolve
- Rank subjects by 30-day readership to find what audiences actually care about
SEO & Content
- Pull authoritative summaries and Wikidata descriptions for entity-rich content
- Identify high-traffic Wikipedia topics worth targeting in articles and FAQs
- Map category taxonomies to plan topic clusters and internal linking
- Surface the most-referenced canonical article for any keyword
Data Enrichment
- Enrich CRM and product records with one-line Wikidata descriptions
- Add geo-coordinates to place names for mapping and location intelligence
- Attach thumbnails and canonical URLs to people, companies, and landmarks
- Resolve ambiguous names to the most popular matching article
Machine Learning & NLP
- Build full-text training datasets with clean plain-text article bodies
- Generate summary/full-text pairs for summarization model fine-tuning
- Create multilingual datasets by pulling the same topics across editions
- Label corpora with category tags and word/byte-size metadata
Market & Competitive Intelligence
- Monitor pageview trends for brands, products, and public figures
- Track which companies and technologies are gaining reader attention
- Benchmark notability across markets using cross-edition coverage counts
Getting Started
Simple Keyword Search
Search one edition and return the best-matching articles:
{"searchQueries": ["artificial intelligence"],"maxResultsPerSearch": 50}
Fetch Exact Articles
Pull specific articles by full URL or bare title โ including non-English editions:
{"articleUrls": ["https://en.wikipedia.org/wiki/Alan_Turing","Marie Curie","https://de.wikipedia.org/wiki/Albert_Einstein"],"includeCategories": true}
Advanced โ Popularity-Sorted Full-Text Dataset
Search several topics, sort by popularity, and capture the full article body:
{"searchQueries": ["machine learning", "neural network", "deep learning"],"language": "en","maxResultsPerSearch": 200,"sortBy": "popularity","fullText": true,"includeCategories": true}
Input Reference
What to Scrape
| Parameter | Type | Default | Description |
|---|---|---|---|
searchQueries | string[] | [] | Keywords to look up on Wikipedia (e.g. "climate change"). Each query runs its own search and returns the best-matching articles. Leave empty if you only want exact articles. |
articleUrls | string[] | [] | Exact articles to fetch. Paste full Wikipedia URLs (e.g. https://en.wikipedia.org/wiki/Alan_Turing) or just a page title (e.g. "Alan Turing"). Full URLs set their own language automatically. |
language | select | English | Which Wikipedia edition to use for searches and bare titles. 40 common editions are listed; full URLs override this. |
Results
| Parameter | Type | Default | Description |
|---|---|---|---|
maxResultsPerSearch | integer | 50 | Maximum articles to return per search query (1โ500). Articles fetched directly by URL or title are added on top. Recommended 50โ200 for fast, affordable runs. |
sortBy | select | Relevance (best match) | Order search results. Options: "Relevance (best match)" or "Popularity (most referenced)". |
Content
| Parameter | Type | Default | Description |
|---|---|---|---|
fullText | boolean | false | Return the complete plain-text article body instead of just the intro summary. Richer data, larger dataset. |
includeCategories | boolean | true | Include the list of categories each article belongs to (e.g. "1912 births"). Helpful for classification and topic mapping. |
Output
Each article is one flat row. Here is a representative result:
{"pageId": 1208,"title": "Alan Turing","language": "en","url": "https://en.wikipedia.org/wiki/Alan_Turing","summary": "Alan Mathison Turing was an English mathematician, computer scientist, logician, and cryptanalyst...","fullText": "Alan Mathison Turing was an English mathematician... (full plain-text body when fullText is enabled)","wikidataDescription": "English mathematician and computer scientist (1912โ1954)","categories": ["British computer scientists", "1912 births", "Alumni of King's College, Cambridge"],"thumbnail": "https://upload.wikimedia.org/wikipedia/commons/thumb/.../300px-Alan_Turing.jpg","wordCount": 12840,"size": 198342,"pageviews": 415203,"langCount": 142,"coordinates": null,"lastEdited": "2026-06-10T08:14:32Z","matchedQuery": "artificial intelligence","scrapedAt": "2026-06-13T14:30:00Z"}
Core Fields
| Field | Type | Description |
|---|---|---|
pageId | integer | Wikipedia page identifier |
title | string | Article title |
language | string | Language code of the edition this article came from |
url | string | Canonical article URL |
matchedQuery | string|null | The search query that surfaced this article (null for direct URL/title fetches) |
scrapedAt | string | ISO timestamp when the row was produced |
Content
| Field | Type | Description |
|---|---|---|
summary | string | Intro extract as clean plain text |
fullText | string|null | Full plain-text article body โ populated only when fullText is enabled |
wikidataDescription | string|null | One-line canonical description from Wikidata |
categories | string[] | Categories the article belongs to โ populated when includeCategories is on |
thumbnail | string|null | Lead image URL (null when the article has no lead image) |
Popularity & Metadata
| Field | Type | Description |
|---|---|---|
pageviews | integer|null | Total reader views over the trailing ~30 days |
langCount | integer|null | Number of language editions this article exists in |
wordCount | integer | Word count of the returned text |
size | integer | Article size in bytes |
lastEdited | string | ISO timestamp of the most recent revision |
Geo
| Field | Type | Description |
|---|---|---|
coordinates | object|null | { "lat": โฆ, "lon": โฆ } for articles with a location; null otherwise |
Tips for Best Results
- Paste a non-English URL to reach any of the 300+ editions โ the language picker lists 40 common editions, but pasting
https://ja.wikipedia.org/wiki/...orhttps://fi.wikipedia.org/wiki/...intoarticleUrlsscrapes that edition directly, no picker needed. - Leave
fullTextoff for fast, cheap summary runs โ it pulls the entire article body and grows your dataset substantially. Turn it on only when you need full text for analysis or training. - Sort by popularity to find the canonical article โ for ambiguous keywords, "Popularity (most referenced)" puts the authoritative, most-linked article first, ahead of niche or disambiguation pages.
- Use
pageviewsto rank topics by real demand โ it reflects actual 30-day readership, a far stronger signal than search rank for prioritizing content or research. - Mix search and exact fetch in one run โ combine broad
searchQuerieswith a hand-picked list ofarticleUrlsto cover both discovery and known must-have articles. - Start with 50 results per query to test โ confirm the data fits your needs, then raise
maxResultsPerSearch(up to 500) for the full pull.
Pricing
From $0.60 per 1,000 results โ flat pay-per-result, matching the lowest tier in this category while shipping more fields per article. Bronze, Silver, and Gold subscribers pay progressively less; the table below shows total cost at each discount tier.
| Results | No discount | Bronze | Silver | Gold |
|---|---|---|---|---|
| 100 | $0.072 | $0.068 | $0.064 | $0.060 |
| 1,000 | $0.72 | $0.68 | $0.64 | $0.60 |
| 10,000 | $7.20 | $6.80 | $6.40 | $6.00 |
| 100,000 | $72.00 | $68.00 | $64.00 | $60.00 |
A "result" is any article row in the output dataset. No compute or time-based charges โ you pay per result, plus a small fixed per-run start fee.
Integrations
Export data in JSON, CSV, Excel, XML, or RSS. Connect to 1,500+ apps via:
- Zapier / Make / n8n โ Workflow automation
- Google Sheets โ Direct spreadsheet export
- Slack / Email โ Notifications on new results
- Webhooks โ Trigger custom APIs on run completion
- Apify API โ Full programmatic access
Legal & Ethical Use
This actor collects publicly available content from Wikipedia for legitimate research, analysis, and data enrichment. Wikipedia article text is published under the Creative Commons Attribution-ShareAlike (CC BY-SA) license โ when you reuse or republish it, provide proper attribution and share derivative text under the same license. Users are responsible for complying with applicable laws and the Wikimedia Foundation's Terms of Use. Be respectful of the volunteer-run platform and use the data responsibly.