Pricing

Pay per usage

Wikipedia Article Extractor

Extract Wikipedia articles via MediaWiki API. Get full text, summaries, sections, categories, images, links. Multi-language. Perfect for AI/ML training data and RAG.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Glass Ventures

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

What does Wikipedia Article Extractor do?

Wikipedia Article Extractor lets you pull clean, structured data from Wikipedia at scale. Unlike web scrapers that parse HTML, this actor uses the official MediaWiki API -- the same API that powers Wikipedia itself. This means you get reliable, well-formatted plain text without HTML artifacts, broken layouts, or anti-bot issues.

The actor supports four flexible input methods: direct article URLs, article titles, search queries, and category names. You can mix and match these to build exactly the dataset you need. Need every article in the "Machine learning" category? Just enter the category name. Want specific articles in Japanese? Switch the language and enter titles.

This is the ideal tool for AI/ML engineers building training datasets, researchers collecting knowledge bases, and developers building RAG (Retrieval-Augmented Generation) pipelines. The output is clean plain text optimized for LLM consumption.

Use Cases

AI/ML engineers -- Build high-quality training datasets from Wikipedia's vast knowledge base. Clean plain text output is ready for tokenization.
RAG pipeline developers -- Extract structured articles with sections for chunk-based retrieval in vector databases.
Researchers -- Collect articles on specific topics or entire categories for academic analysis, NLP research, or corpus building.
Content creators -- Research topics with comprehensive summaries, section breakdowns, and reference counts.
SEO professionals -- Analyze Wikipedia content structure, internal linking patterns, and category relationships.
Fact-checkers -- Quickly pull article text, references counts, and last-modified dates for verification workflows.
Knowledge base builders -- Create structured knowledge bases from Wikipedia categories with full metadata.

Features

4 input methods: URLs, article titles, search terms, category names -- or combine them all
Official MediaWiki API: No scraping needed. Reliable, fast, and respects Wikipedia's infrastructure
12+ languages: English, Spanish, French, German, Japanese, Portuguese, Italian, Russian, Chinese, Korean, Arabic, Hindi
AI-friendly output: Clean plain text perfect for LLM training data, RAG pipelines, and NLP tasks
Rich metadata: Word count, reference count, last modified date, page ID, categories
Structured sections: Article broken down by heading with hierarchy levels
Batch processing: Extract hundreds of articles in a single run
Category crawling: Automatically fetch all articles from a Wikipedia category
No proxy required: Wikipedia API is public and generous with rate limits
Exports to JSON, CSV, Excel, or connect via API

How much will it cost?

Wikipedia Article Extractor is free to use -- you only pay for Apify platform compute time, which is minimal since the actor uses the lightweight MediaWiki API (no browser needed).

Articles	Estimated Cost	Time
100	~$0.01	~1 min
1,000	~$0.05	~5 min
10,000	~$0.50	~30 min

Cost Component	Per 1,000 Articles
Platform compute (256 MB)	~$0.05
Proxy (optional)	$0.00
Total	~$0.05

How to use

Go to the Wikipedia Article Extractor page on Apify Store
Click "Start" or "Try for free"
Enter article URLs, titles, search terms, or category names
Select the Wikipedia language edition
Choose what data to include (full text, sections, categories, etc.)
Set the maximum number of articles
Click "Start" and wait for the results

Multi-language examples

Extract articles in different languages:

English: Enter title "Artificial intelligence" with language "en"
Spanish: Enter title "Inteligencia artificial" with language "es"
Japanese: Enter title "人工知能" with language "ja"
German: Enter title "Kunstliche Intelligenz" with language "de"

Or use URLs directly -- the language is auto-detected:

https://fr.wikipedia.org/wiki/Intelligence_artificielle
https://zh.wikipedia.org/wiki/人工智能

Input parameters

Parameter	Type	Description	Default
startUrls	array	Direct Wikipedia article URLs	-
articleTitles	array	Article titles (e.g., "Albert Einstein")	-
searchTerms	array	Search queries to find articles	-
categories	array	Category names to extract all articles from	-
language	string	Wikipedia language edition (en, es, fr, de, ja, etc.)	en
includeFullText	boolean	Extract complete article text	true
includeSections	boolean	Extract sections with headings	true
includeCategories	boolean	Extract article categories	true
includeLinks	boolean	Extract internal Wikipedia links	false
includeImages	boolean	Extract image URLs	false
maxItems	number	Maximum articles to extract (0 = unlimited)	100
proxyConfig	object	Optional proxy settings	-

Output

The actor produces a dataset with the following fields:

{
    "url": "https://en.wikipedia.org/wiki/Web_scraping",
    "title": "Web scraping",
    "pageId": 2696619,
    "language": "en",
    "summary": "Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites...",
    "fullText": "Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser...",
    "sections": [
        {
            "heading": "Introduction",
            "text": "Web scraping, web harvesting, or web data extraction...",
            "level": 1
        },
        {
            "heading": "Techniques",
            "text": "Web scraping is the process of automatically mining data...",
            "level": 2
        }
    ],
    "categories": [
        "Web scraping",
        "Data mining",
        "Web technology"
    ],
    "links": ["Data scraping", "Website", "Hypertext Transfer Protocol"],
    "images": ["https://commons.wikimedia.org/wiki/Special:FilePath/Example.png"],
    "lastModified": "2024-12-01T15:30:00Z",
    "wordCount": 4523,
    "referencesCount": 87,
    "scrapedAt": "2025-01-15T10:30:00.000Z"
}

Field	Type	Description
url	string	Wikipedia article URL
title	string	Article title
pageId	integer	Wikipedia internal page ID
language	string	Language code (en, es, fr, etc.)
summary	string	Article introduction/summary in plain text
fullText	string	Complete article text in plain text
sections	array	Sections with heading, text, and level
categories	array	Article categories
links	array	Internal Wikipedia links
images	array	Image URLs from Wikimedia Commons
lastModified	string	Last edit timestamp
wordCount	integer	Total word count
referencesCount	integer	Number of citations/references
scrapedAt	string	ISO 8601 extraction timestamp

How it works -- MediaWiki API

This actor uses the official MediaWiki API, which is the same API that powers Wikipedia's own interface, mobile apps, and third-party tools. Key endpoints used:

action=query&prop=extracts -- Retrieves article text as clean plain text (no HTML)
action=query&prop=categories|links|images -- Fetches article metadata
action=parse&prop=sections|wikitext -- Parses article structure and raw wikitext
action=query&list=search -- Searches for articles by keyword
action=query&list=categorymembers -- Lists all articles in a category

The MediaWiki API is public, free, and does not require authentication. It has generous rate limits and is the most reliable way to access Wikipedia data.

Integrations

Connect Wikipedia Article Extractor with other tools:

Apify API -- REST API for programmatic access
Webhooks -- Get notified when a run finishes
Zapier / Make -- Connect to 5,000+ apps
Google Sheets -- Export directly to spreadsheets
Vector databases -- Feed extracted text into Pinecone, Weaviate, Qdrant for RAG

API Example (Node.js)

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_TOKEN' });
const run = await client.actor('YOUR_USERNAME/wikipedia-article-extractor').call({
    articleTitles: ['Artificial intelligence', 'Machine learning', 'Deep learning'],
    language: 'en',
    includeFullText: true,
    maxItems: 100,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`Extracted ${items.length} articles`);

API Example (Python)

from apify_client import ApifyClient
client = ApifyClient('YOUR_TOKEN')
run = client.actor('YOUR_USERNAME/wikipedia-article-extractor').call(run_input={
    'articleTitles': ['Artificial intelligence', 'Machine learning', 'Deep learning'],
    'language': 'en',
    'includeFullText': True,
    'maxItems': 100,
})
items = client.dataset(run['defaultDatasetId']).list_items().items
print(f'Extracted {len(items)} articles')

API Example (cURL)

curl "https://api.apify.com/v2/acts/YOUR_USERNAME~wikipedia-article-extractor/runs" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -d '{
    "articleTitles": ["Artificial intelligence", "Machine learning"],
    "language": "en",
    "includeFullText": true,
    "maxItems": 100
  }'

Tips and tricks

Start with a small maxItems (5-10) to test your configuration before running large extractions
Use article titles for best reliability -- they map directly to the API with no ambiguity
Category extraction is powerful: a single category like "Machine learning" can yield hundreds of articles
Combine input methods: search for a topic, then extract entire categories found in the results
For AI training data, enable includeFullText and disable includeLinks and includeImages for clean text output
For RAG pipelines, enable includeSections to get pre-chunked content with headings
Wikipedia URLs auto-detect language, so you can mix English and French URLs in the same run
No proxy needed for most use cases -- the MediaWiki API is public and generous with rate limits

FAQ

Q: Does this actor require login credentials? A: No. The MediaWiki API is completely public and free to use. No authentication needed.

Q: How fast is the extraction? A: Approximately 100-200 articles per minute depending on article size and data options selected. The actor makes multiple API calls per article (text, metadata, sections).

Q: Can I extract articles in any language? A: The UI offers 12 popular languages, but you can use any Wikipedia language by providing URLs directly (e.g., https://sv.wikipedia.org/wiki/... for Swedish).

Q: What about rate limits? A: Wikipedia's API has generous rate limits. For very large extractions (10,000+ articles), the actor automatically paces requests. You can optionally configure a proxy to distribute requests.

Q: Can I extract talk pages or user pages? A: This actor is optimized for article (main namespace) pages. Talk pages and other namespaces may work via direct URLs but are not officially supported.

Q: Is the output suitable for LLM training? A: Yes. The plain text output is clean, well-structured, and free of HTML artifacts. It is ideal for tokenization and training.

Is it legal to extract data from Wikipedia?

Wikipedia content is released under the Creative Commons Attribution-ShareAlike License and the GNU Free Documentation License. This means you are free to share and adapt Wikipedia content, even for commercial purposes, as long as you provide attribution and share derivatives under the same license.

The MediaWiki API is the officially supported way to programmatically access Wikipedia data. Wikipedia actively encourages bulk data access through its API and database dumps. For more information, see Apify's blog on web scraping legality.

Limitations

Article text is plain text only (no HTML formatting, tables, or mathematical formulas)
Infobox data is not extracted as structured key-value pairs (raw wikitext can be complex)
Maximum of ~500 category members per category in a single pagination cycle
Very large articles (100,000+ words) may take longer to process
Search results are limited to 50 per query (Wikipedia API limit)

Changelog

v0.1 (2026-04-23) -- Initial release with URL, title, search, and category input methods. Multi-language support. Full text, sections, categories, links, images, and metadata extraction.

Wikipedia Scraper

automation-lab/wikipedia-scraper

Search and extract Wikipedia articles — titles, summaries, full content, categories, and images. Uses the free MediaWiki API.

Stas Persiianenko

Wikipedia Article Extractor

johnlenflure/wikipedia-extractor

Extract structured content from Wikipedia articles. Get summaries, sections, categories, infobox data, images, and internal links in any language.

Sinan Donmez

Wikipedia Scraper

leftwinglautus/wikipedia-scraper

Scrape Wikipedia articles via the official Wikipedia API. Search articles, get summaries, full content, and categories.

Moeeze Hassan

Wikipedia Scraper — Search Articles & Extract Content

puskin/wikipedia-scraper

Search Wikipedia articles, get summaries, and extract full page content via the free MediaWiki API. No authentication required — perfect for research, AI training data, and knowledge base building.

Giovanni Bucci

Wikipedia Article Scraper

crawlerbros/wikipedia-scraper

Extract structured data from Wikipedia articles. Get summaries, categories, images, metadata, and descriptions using Wikipedia's official API. Supports 300+ languages.

Crawler Bros

Wikipedia Scraper – Articles, Summaries & Extracts

ninhothedev/wikipedia-scraper

$0.5/1K 🔥 Fast Wikipedia scraper! Article title, summary, full text, links, images & categories in any language. JSON, CSV, Excel or API in seconds. Search or list titles & pull thousands of articles for research & AI training ⚡

ninhothedev

Wikipedia Scraper - Articles & Content for AI / RAG

flash_scraper/wikipedia-scraper

Search Wikipedia in any language and get clean article rows via the free MediaWiki API: title, URL, intro summary or full plaintext content, word count, categories, thumbnail & last-edited date. Perfect for RAG, LLM training data & research. No API key. Export CSV/JSON/Excel.

Flash Scrape

Wikipedia Scraper - Article Content Extractor

lulzasaur/wikipedia-scraper

Scrape Wikipedia articles. Search by topic and extract full structured content: summaries, sections, infobox data, categories, references, images, and edit history for any article.

lulz bot

Wikipedia Article Extractor (AI-ready)

changeable_acacia/wikipedia-article-extractor-ai-ready

Extracts clean JSON from any Wikipedia article for AI/RAG use.

SABYASACHI TRIPATHY

Wikipedia Category Scraper — Articles for RAG & AI

fast_api/wikipedia-category-scraper

Extract Wikipedia articles from any category with titles, URLs, extracts, page IDs, categories, and metadata. A fast MediaWiki API actor for building RAG datasets, knowledge bases, research corpora, and AI training data.