Wikipedia Scraper avatar

Wikipedia Scraper

Pricing

from $0.60 / 1,000 results

Go to Apify Store
Wikipedia Scraper

Wikipedia Scraper

[๐Ÿ’ฐ $0.6 / 1K] Search Wikipedia or fetch exact articles by URL or title, and extract clean structured data โ€” summaries, full plain text, categories, 30-day pageviews, thumbnails, coordinates, and language counts โ€” across 300+ language editions.

Pricing

from $0.60 / 1,000 results

Rating

0.0

(0)

Developer

SolidCode

SolidCode

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Turn any Wikipedia article into clean, structured data โ€” summaries, full article text, categories, 30-day pageviews, Wikidata descriptions, thumbnails, geo-coordinates, and more, across 300+ language editions. Search by keyword or fetch exact articles you already know by URL or title. Built for researchers, data teams, and content & SEO analysts who need ready-to-use Wikipedia datasets without copy-pasting articles by hand or stitching together the raw API.

Why This Scraper?

  • 300+ language editions โ€” the picker surfaces 40 of the largest editions (English, German, French, Spanish, Japanese, Chinese, Arabic, and more); paste any non-English Wikipedia URL and the actor scrapes that edition automatically.
  • Two ways in: keyword search or exact fetch โ€” run a keyword search across an edition, or pull specific articles by full URL (https://en.wikipedia.org/wiki/Alan_Turing) or bare title ("Alan Turing"). Mix both in a single run.
  • 30-day pageview totals on every row โ€” a real popularity signal, not a guess: the trailing ~30-day view count for each article, ideal for ranking topics by actual reader demand.
  • Sort by relevance or popularity โ€” order search hits by best textual match, or by most-referenced (incoming-link-weighted) to surface the canonical, authoritative article first.
  • Full plain-text article body on demand โ€” toggle fullText to capture the entire article as clean plain text (not just the intro), ready for NLP, summarization, or training corpora.
  • Wikidata one-line descriptions โ€” the short canonical descriptor (e.g. "English mathematician and computer scientist") pulled straight from Wikidata, perfect for tooltips and entity labels.
  • Category taxonomy per article โ€” the full list of categories each article belongs to ("British computer scientists", "1912 births") for topic mapping and classification.
  • Geo-coordinates for places & landmarks โ€” latitude/longitude on every article that has a location, so cities, monuments, and venues drop straight onto a map.
  • Cross-edition reach signal โ€” langCount tells you in how many language editions an article exists, a quick indicator of global notability.

Use Cases

Research & Academia

  • Build structured corpora of articles on a topic for literature reviews and citation context
  • Compare how the same subject is covered across language editions
  • Track article size and last-edited dates to study how topics evolve
  • Rank subjects by 30-day readership to find what audiences actually care about

SEO & Content

  • Pull authoritative summaries and Wikidata descriptions for entity-rich content
  • Identify high-traffic Wikipedia topics worth targeting in articles and FAQs
  • Map category taxonomies to plan topic clusters and internal linking
  • Surface the most-referenced canonical article for any keyword

Data Enrichment

  • Enrich CRM and product records with one-line Wikidata descriptions
  • Add geo-coordinates to place names for mapping and location intelligence
  • Attach thumbnails and canonical URLs to people, companies, and landmarks
  • Resolve ambiguous names to the most popular matching article

Machine Learning & NLP

  • Build full-text training datasets with clean plain-text article bodies
  • Generate summary/full-text pairs for summarization model fine-tuning
  • Create multilingual datasets by pulling the same topics across editions
  • Label corpora with category tags and word/byte-size metadata

Market & Competitive Intelligence

  • Monitor pageview trends for brands, products, and public figures
  • Track which companies and technologies are gaining reader attention
  • Benchmark notability across markets using cross-edition coverage counts

Getting Started

Search one edition and return the best-matching articles:

{
"searchQueries": ["artificial intelligence"],
"maxResultsPerSearch": 50
}

Fetch Exact Articles

Pull specific articles by full URL or bare title โ€” including non-English editions:

{
"articleUrls": [
"https://en.wikipedia.org/wiki/Alan_Turing",
"Marie Curie",
"https://de.wikipedia.org/wiki/Albert_Einstein"
],
"includeCategories": true
}

Advanced โ€” Popularity-Sorted Full-Text Dataset

Search several topics, sort by popularity, and capture the full article body:

{
"searchQueries": ["machine learning", "neural network", "deep learning"],
"language": "en",
"maxResultsPerSearch": 200,
"sortBy": "popularity",
"fullText": true,
"includeCategories": true
}

Input Reference

What to Scrape

ParameterTypeDefaultDescription
searchQueriesstring[][]Keywords to look up on Wikipedia (e.g. "climate change"). Each query runs its own search and returns the best-matching articles. Leave empty if you only want exact articles.
articleUrlsstring[][]Exact articles to fetch. Paste full Wikipedia URLs (e.g. https://en.wikipedia.org/wiki/Alan_Turing) or just a page title (e.g. "Alan Turing"). Full URLs set their own language automatically.
languageselectEnglishWhich Wikipedia edition to use for searches and bare titles. 40 common editions are listed; full URLs override this.

Results

ParameterTypeDefaultDescription
maxResultsPerSearchinteger50Maximum articles to return per search query (1โ€“500). Articles fetched directly by URL or title are added on top. Recommended 50โ€“200 for fast, affordable runs.
sortByselectRelevance (best match)Order search results. Options: "Relevance (best match)" or "Popularity (most referenced)".

Content

ParameterTypeDefaultDescription
fullTextbooleanfalseReturn the complete plain-text article body instead of just the intro summary. Richer data, larger dataset.
includeCategoriesbooleantrueInclude the list of categories each article belongs to (e.g. "1912 births"). Helpful for classification and topic mapping.

Output

Each article is one flat row. Here is a representative result:

{
"pageId": 1208,
"title": "Alan Turing",
"language": "en",
"url": "https://en.wikipedia.org/wiki/Alan_Turing",
"summary": "Alan Mathison Turing was an English mathematician, computer scientist, logician, and cryptanalyst...",
"fullText": "Alan Mathison Turing was an English mathematician... (full plain-text body when fullText is enabled)",
"wikidataDescription": "English mathematician and computer scientist (1912โ€“1954)",
"categories": ["British computer scientists", "1912 births", "Alumni of King's College, Cambridge"],
"thumbnail": "https://upload.wikimedia.org/wikipedia/commons/thumb/.../300px-Alan_Turing.jpg",
"wordCount": 12840,
"size": 198342,
"pageviews": 415203,
"langCount": 142,
"coordinates": null,
"lastEdited": "2026-06-10T08:14:32Z",
"matchedQuery": "artificial intelligence",
"scrapedAt": "2026-06-13T14:30:00Z"
}

Core Fields

FieldTypeDescription
pageIdintegerWikipedia page identifier
titlestringArticle title
languagestringLanguage code of the edition this article came from
urlstringCanonical article URL
matchedQuerystring|nullThe search query that surfaced this article (null for direct URL/title fetches)
scrapedAtstringISO timestamp when the row was produced

Content

FieldTypeDescription
summarystringIntro extract as clean plain text
fullTextstring|nullFull plain-text article body โ€” populated only when fullText is enabled
wikidataDescriptionstring|nullOne-line canonical description from Wikidata
categoriesstring[]Categories the article belongs to โ€” populated when includeCategories is on
thumbnailstring|nullLead image URL (null when the article has no lead image)

Popularity & Metadata

FieldTypeDescription
pageviewsinteger|nullTotal reader views over the trailing ~30 days
langCountinteger|nullNumber of language editions this article exists in
wordCountintegerWord count of the returned text
sizeintegerArticle size in bytes
lastEditedstringISO timestamp of the most recent revision

Geo

FieldTypeDescription
coordinatesobject|null{ "lat": โ€ฆ, "lon": โ€ฆ } for articles with a location; null otherwise

Tips for Best Results

  • Paste a non-English URL to reach any of the 300+ editions โ€” the language picker lists 40 common editions, but pasting https://ja.wikipedia.org/wiki/... or https://fi.wikipedia.org/wiki/... into articleUrls scrapes that edition directly, no picker needed.
  • Leave fullText off for fast, cheap summary runs โ€” it pulls the entire article body and grows your dataset substantially. Turn it on only when you need full text for analysis or training.
  • Sort by popularity to find the canonical article โ€” for ambiguous keywords, "Popularity (most referenced)" puts the authoritative, most-linked article first, ahead of niche or disambiguation pages.
  • Use pageviews to rank topics by real demand โ€” it reflects actual 30-day readership, a far stronger signal than search rank for prioritizing content or research.
  • Mix search and exact fetch in one run โ€” combine broad searchQueries with a hand-picked list of articleUrls to cover both discovery and known must-have articles.
  • Start with 50 results per query to test โ€” confirm the data fits your needs, then raise maxResultsPerSearch (up to 500) for the full pull.

Pricing

From $0.60 per 1,000 results โ€” flat pay-per-result, matching the lowest tier in this category while shipping more fields per article. Bronze, Silver, and Gold subscribers pay progressively less; the table below shows total cost at each discount tier.

ResultsNo discountBronzeSilverGold
100$0.072$0.068$0.064$0.060
1,000$0.72$0.68$0.64$0.60
10,000$7.20$6.80$6.40$6.00
100,000$72.00$68.00$64.00$60.00

A "result" is any article row in the output dataset. No compute or time-based charges โ€” you pay per result, plus a small fixed per-run start fee.

Integrations

Export data in JSON, CSV, Excel, XML, or RSS. Connect to 1,500+ apps via:

  • Zapier / Make / n8n โ€” Workflow automation
  • Google Sheets โ€” Direct spreadsheet export
  • Slack / Email โ€” Notifications on new results
  • Webhooks โ€” Trigger custom APIs on run completion
  • Apify API โ€” Full programmatic access

This actor collects publicly available content from Wikipedia for legitimate research, analysis, and data enrichment. Wikipedia article text is published under the Creative Commons Attribution-ShareAlike (CC BY-SA) license โ€” when you reuse or republish it, provide proper attribution and share derivative text under the same license. Users are responsible for complying with applicable laws and the Wikimedia Foundation's Terms of Use. Be respectful of the volunteer-run platform and use the data responsibly.