Wikipedia Scraper - Articles, Search & Recent Changes avatar

Wikipedia Scraper - Articles, Search & Recent Changes

Pricing

from $0.10 / 1,000 results

Go to Apify Store
Wikipedia Scraper - Articles, Search & Recent Changes

Wikipedia Scraper - Articles, Search & Recent Changes

Scrape Wikipedia articles by title, run keyword searches, pull recent changes, or extract entire categories — across any of 300+ language editions. Returns clean text, summaries, references, links, and metadata. Built for AI/LLM training datasets, NLP research, and knowledge-graph building.

Pricing

from $0.10 / 1,000 results

Rating

0.0

(0)

Developer

NIJ KANANI

NIJ KANANI

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Share

📚 Wikipedia Scraper

Scrape Wikipedia articles, search results, recent edits, and categories — across all 300+ language editions. Returns clean plain-text content, summaries, references, and rich metadata.

🎯 Built for AI/LLM training datasets, NLP research, knowledge-graph construction, journalism, and education.

Sample dataset output

Input form

Run log — clean success


✨ What you can do

  • 📄 Fetch articles by title — clean plain-text body, summary, sections, references
  • 🔎 Search — full-text search across an entire language edition
  • 📡 Recent changes — live feed of edits (title, user, comment, revid)
  • 📁 Pull entire categories — all members of Category:Machine_learning, etc.
  • 🌐 Any languageen, es, fr, de, ja, zh, hi, ar, etc.
  • 📦 Rich output: links (internal+external), categories, sections, last-modified

🚀 Quick start

{
"mode": "articles",
"language": "en",
"titles": ["Artificial intelligence", "Large language model"],
"includeContent": true,
"includeReferences": false
}

📥 Input

FieldUsed in modeDescription
modeallarticles / search / recentchanges / category
languageallWiki edition code (en, de, ja...)
titlesarticlesArticle titles
searchQueriessearchKeywords or phrases
categorycategoryCategory name without Category: prefix
maxItemsallCap per query
includeContentarticles, search, categoryFull plain-text body
includeReferencesarticles, search, categoryExternal + internal links + sections

📤 Output (per item)

{
"mode": "articles",
"title": "Artificial intelligence",
"language": "en",
"pageId": 1164,
"summary": "Artificial intelligence (AI) refers to...",
"content": "Full article text...",
"wordCount": 12873,
"sections": ["Goals", "History", "Methods"],
"externalLinks": ["https://..."],
"internalLinks": ["Machine learning", "Neural network"],
"categories": ["Artificial intelligence", "Cybernetics"],
"url": "https://en.wikipedia.org/wiki/Artificial_intelligence",
"lastModified": "2026-04-30T...",
"scrapedAt": "2026-05-06T..."
}

🎯 Use cases

WhoWhy
🤖 LLM teamsPretraining + fine-tuning datasets across languages
📚 NLP researchersMultilingual corpora, named-entity benchmarks
📰 JournalistsTopic deep-dives + fact-checking pipelines
🎓 EducatorsAuto-build study material from any topic
🧠 Knowledge graphsWikipedia as an entity backbone

⚙️ Tech notes

  • Uses MediaWiki's official Action API + REST Summary API
  • No login, no key, no rate limits (within fair use)
  • Plain-text extraction via explaintext=1 — already cleaned, no HTML/wikitext
  • Recent-changes uses rctype=edit|new to skip log noise

❓ FAQ

Are full Wikipedia dumps better? For one-shot pre-training, yes (free at dumps.wikimedia.org). This Actor is for targeted scrapes — specific topics, ongoing freshness, multi-language slices, or recent-changes monitoring.

Schedule it? Yes. Recent changes mode is perfect for hourly Apify Schedules.

Hits rate limits? Almost never. MediaWiki's anonymous limit is generous and we add automatic retries with backoff.