Wikipedia Data Scraper Pro
Pricing
$10.00/month + usage
Wikipedia Data Scraper Pro
An automated crawler that extracts textual content and metadata from Wikipedia pages for building knowledge bases.
Pricing
$10.00/month + usage
Rating
0.0
(0)
Developer

Jamshaid Arif
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Wikipedia Scraper
Extract structured data from Wikipedia at any scale — articles, categories, sections, links, categories, and multilingual translations — without managing infrastructure.
What Does This Actor Do?
Wikipedia Scraper fetches public Wikipedia data through the official MediaWiki API. It supports three modes:
| Mode | Use Case |
|---|---|
| 📄 Article Pages | Scrape one or many articles by title |
| 📂 Category Crawl | Collect every article under a category (and its subcategories) |
| 🌐 Translation Comparison | Fetch the same article across multiple language editions |
Every result is pushed to the Apify Dataset as a structured record you can download as JSON, CSV, Excel, or XML.
Output Fields
Each dataset item contains:
| Field | Type | Description |
|---|---|---|
title | string | Wikipedia article title |
language | string | Language code (en, de, fr, …) |
pageId | integer | Wikipedia internal page ID |
url | string | Full URL to the article |
scrapedAt | ISO date | Timestamp of extraction |
summary | string | Full lead section text |
summaryPreview | string | First 200 characters of summary |
sections | array | Nested section tree (title + text + subsections) |
links | object | Outbound wiki links { title → url } |
categories | object | Article categories { name → url } |
translations | object | Other language editions { lang → { title, url } } |
numSections | integer | Count of top-level sections |
numLinks | integer | Count of outbound links returned |
numCategories | integer | Count of categories returned |
numTranslations | integer | Count of available language editions |
status | string | ok, not_found, network_error, or error |
When Translation Comparison mode is used, items also include a comparisonBaseTitle field identifying the base English article.
Input Configuration
Mode: Article Pages
{"mode": "page","topics": ["Python (programming language)", "Alan Turing", "Machine learning"],"language": "en","includeSections": true,"includeLinks": true,"includeCategories": true,"includeTranslations": false,"includeFullText": false}
Mode: Category Crawl
{"mode": "category","categoryTitle": "Category:Machine learning","language": "en","categoryMaxDepth": 1,"maxPages": 50,"includeSections": true,"includeLinks": false}
Mode: Translation Comparison
{"mode": "translations","comparisonTitle": "Artificial intelligence","translationLanguages": ["en", "de", "fr", "ja", "ar", "es", "zh", "ru"]}
Usage Examples
Using the Apify API (Python)
import apify_clientclient = apify_client.ApifyClient("YOUR_API_TOKEN")run = client.actor("YOUR_ACTOR_ID").call(run_input={"mode": "page","topics": ["Deep learning", "Neural network", "Transformer (machine learning model)"],"language": "en","includeSections": True,"includeLinks": True})for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(item["title"], "→", item["url"])print(" Summary:", item["summaryPreview"])print(" Sections:", item["numSections"])
Using the Apify API (JavaScript/Node.js)
const { ApifyClient } = require('apify-client');const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });const run = await client.actor('YOUR_ACTOR_ID').call({mode: 'category',categoryTitle: 'Category:Physics',categoryMaxDepth: 1,maxPages: 30,});const { items } = await client.dataset(run.defaultDatasetId).listItems();items.forEach(item => console.log(item.title, item.url));
Using the Apify CLI
# Install CLInpm install -g apify-cli# Run locally (requires .actor/ directory)apify run --input='{"mode":"page","topics":["Quantum computing"]}'# Deploy to Apify platformapify push
Sections Structure Example
When includeSections is true, each article item contains a sections array:
{"sections": [{"level": 1,"title": "History","text": "Python was conceived in the late 1980s by Guido van Rossum...","subsections": [{"level": 2,"title": "Early development","text": "Python 0.9.0 was published to alt.sources in February 1991...","subsections": []}]},{"level": 1,"title": "Design philosophy","text": "Python is a multi-paradigm programming language...","subsections": []}]}
Rate Limiting & Politeness
This actor follows the Wikimedia User-Agent policy:
- Uses a descriptive
User-Agentheader identifying itself asWikipedia Scraper / Apify Actor - Introduces a configurable delay (default 0.5 s) between every API call
- Respects Wikipedia's public API — no login or authentication required
- Does not scrape HTML; uses the official MediaWiki REST API exclusively
If you encounter rate-limiting errors, increase the Request Delay setting to 1.0–2.0 seconds.
Performance & Memory
| Input size | Recommended memory |
|---|---|
| 1–20 articles | 256 MB |
| 20–100 articles | 512 MB |
| Category crawl (100+ pages) | 1024 MB |
Limitations
- Wikipedia's API caps some response sizes (links, categories). This actor returns up to 50 links and 50 categories per page.
- Some Wikipedia editions have incomplete
langlinksmetadata. - Full-text extraction (
includeFullText: true) significantly increases dataset size. Enable only when needed. - Wikipedia may throttle aggressive requests. Keep
requestDelay≥ 0.5 seconds.
Legal & Attribution
This actor accesses only publicly available Wikipedia content through the official MediaWiki API, in compliance with Wikipedia's Terms of Use and Creative Commons Attribution-ShareAlike 4.0 License.
All extracted content remains subject to Wikipedia's licensing. When republishing Wikipedia content, you must attribute Wikipedia and link to the original article.