Wikipedia Data Scraper Pro avatar

Wikipedia Data Scraper Pro

Pricing

$10.00/month + usage

Go to Apify Store
Wikipedia Data Scraper Pro

Wikipedia Data Scraper Pro

An automated crawler that extracts textual content and metadata from Wikipedia pages for building knowledge bases.

Pricing

$10.00/month + usage

Rating

0.0

(0)

Developer

Jamshaid Arif

Jamshaid Arif

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Wikipedia Scraper

Extract structured data from Wikipedia at any scale — articles, categories, sections, links, categories, and multilingual translations — without managing infrastructure.


What Does This Actor Do?

Wikipedia Scraper fetches public Wikipedia data through the official MediaWiki API. It supports three modes:

ModeUse Case
📄 Article PagesScrape one or many articles by title
📂 Category CrawlCollect every article under a category (and its subcategories)
🌐 Translation ComparisonFetch the same article across multiple language editions

Every result is pushed to the Apify Dataset as a structured record you can download as JSON, CSV, Excel, or XML.


Output Fields

Each dataset item contains:

FieldTypeDescription
titlestringWikipedia article title
languagestringLanguage code (en, de, fr, …)
pageIdintegerWikipedia internal page ID
urlstringFull URL to the article
scrapedAtISO dateTimestamp of extraction
summarystringFull lead section text
summaryPreviewstringFirst 200 characters of summary
sectionsarrayNested section tree (title + text + subsections)
linksobjectOutbound wiki links { title → url }
categoriesobjectArticle categories { name → url }
translationsobjectOther language editions { lang → { title, url } }
numSectionsintegerCount of top-level sections
numLinksintegerCount of outbound links returned
numCategoriesintegerCount of categories returned
numTranslationsintegerCount of available language editions
statusstringok, not_found, network_error, or error

When Translation Comparison mode is used, items also include a comparisonBaseTitle field identifying the base English article.


Input Configuration

Mode: Article Pages

{
"mode": "page",
"topics": ["Python (programming language)", "Alan Turing", "Machine learning"],
"language": "en",
"includeSections": true,
"includeLinks": true,
"includeCategories": true,
"includeTranslations": false,
"includeFullText": false
}

Mode: Category Crawl

{
"mode": "category",
"categoryTitle": "Category:Machine learning",
"language": "en",
"categoryMaxDepth": 1,
"maxPages": 50,
"includeSections": true,
"includeLinks": false
}

Mode: Translation Comparison

{
"mode": "translations",
"comparisonTitle": "Artificial intelligence",
"translationLanguages": ["en", "de", "fr", "ja", "ar", "es", "zh", "ru"]
}

Usage Examples

Using the Apify API (Python)

import apify_client
client = apify_client.ApifyClient("YOUR_API_TOKEN")
run = client.actor("YOUR_ACTOR_ID").call(run_input={
"mode": "page",
"topics": ["Deep learning", "Neural network", "Transformer (machine learning model)"],
"language": "en",
"includeSections": True,
"includeLinks": True
})
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item["title"], "→", item["url"])
print(" Summary:", item["summaryPreview"])
print(" Sections:", item["numSections"])

Using the Apify API (JavaScript/Node.js)

const { ApifyClient } = require('apify-client');
const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const run = await client.actor('YOUR_ACTOR_ID').call({
mode: 'category',
categoryTitle: 'Category:Physics',
categoryMaxDepth: 1,
maxPages: 30,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach(item => console.log(item.title, item.url));

Using the Apify CLI

# Install CLI
npm install -g apify-cli
# Run locally (requires .actor/ directory)
apify run --input='{"mode":"page","topics":["Quantum computing"]}'
# Deploy to Apify platform
apify push

Sections Structure Example

When includeSections is true, each article item contains a sections array:

{
"sections": [
{
"level": 1,
"title": "History",
"text": "Python was conceived in the late 1980s by Guido van Rossum...",
"subsections": [
{
"level": 2,
"title": "Early development",
"text": "Python 0.9.0 was published to alt.sources in February 1991...",
"subsections": []
}
]
},
{
"level": 1,
"title": "Design philosophy",
"text": "Python is a multi-paradigm programming language...",
"subsections": []
}
]
}

Rate Limiting & Politeness

This actor follows the Wikimedia User-Agent policy:

  • Uses a descriptive User-Agent header identifying itself as Wikipedia Scraper / Apify Actor
  • Introduces a configurable delay (default 0.5 s) between every API call
  • Respects Wikipedia's public API — no login or authentication required
  • Does not scrape HTML; uses the official MediaWiki REST API exclusively

If you encounter rate-limiting errors, increase the Request Delay setting to 1.02.0 seconds.


Performance & Memory

Input sizeRecommended memory
1–20 articles256 MB
20–100 articles512 MB
Category crawl (100+ pages)1024 MB

Limitations

  • Wikipedia's API caps some response sizes (links, categories). This actor returns up to 50 links and 50 categories per page.
  • Some Wikipedia editions have incomplete langlinks metadata.
  • Full-text extraction (includeFullText: true) significantly increases dataset size. Enable only when needed.
  • Wikipedia may throttle aggressive requests. Keep requestDelay ≥ 0.5 seconds.

This actor accesses only publicly available Wikipedia content through the official MediaWiki API, in compliance with Wikipedia's Terms of Use and Creative Commons Attribution-ShareAlike 4.0 License.

All extracted content remains subject to Wikipedia's licensing. When republishing Wikipedia content, you must attribute Wikipedia and link to the original article.