Wikipedia Article Extractor
Pricing
Pay per usage
Wikipedia Article Extractor
Extract Wikipedia articles via MediaWiki API. Get full text, summaries, sections, categories, images, links. Multi-language. Perfect for AI/ML training data and RAG.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Glass Ventures
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
a day ago
Last modified
Categories
Share
Extract structured data from any Wikipedia article using the official MediaWiki API. Get full article text, summaries, sections, categories, images, links, and metadata in any of 12+ languages.
What does Wikipedia Article Extractor do?
Wikipedia Article Extractor lets you pull clean, structured data from Wikipedia at scale. Unlike web scrapers that parse HTML, this actor uses the official MediaWiki API -- the same API that powers Wikipedia itself. This means you get reliable, well-formatted plain text without HTML artifacts, broken layouts, or anti-bot issues.
The actor supports four flexible input methods: direct article URLs, article titles, search queries, and category names. You can mix and match these to build exactly the dataset you need. Need every article in the "Machine learning" category? Just enter the category name. Want specific articles in Japanese? Switch the language and enter titles.
This is the ideal tool for AI/ML engineers building training datasets, researchers collecting knowledge bases, and developers building RAG (Retrieval-Augmented Generation) pipelines. The output is clean plain text optimized for LLM consumption.
Use Cases
- AI/ML engineers -- Build high-quality training datasets from Wikipedia's vast knowledge base. Clean plain text output is ready for tokenization.
- RAG pipeline developers -- Extract structured articles with sections for chunk-based retrieval in vector databases.
- Researchers -- Collect articles on specific topics or entire categories for academic analysis, NLP research, or corpus building.
- Content creators -- Research topics with comprehensive summaries, section breakdowns, and reference counts.
- SEO professionals -- Analyze Wikipedia content structure, internal linking patterns, and category relationships.
- Fact-checkers -- Quickly pull article text, references counts, and last-modified dates for verification workflows.
- Knowledge base builders -- Create structured knowledge bases from Wikipedia categories with full metadata.
Features
- 4 input methods: URLs, article titles, search terms, category names -- or combine them all
- Official MediaWiki API: No scraping needed. Reliable, fast, and respects Wikipedia's infrastructure
- 12+ languages: English, Spanish, French, German, Japanese, Portuguese, Italian, Russian, Chinese, Korean, Arabic, Hindi
- AI-friendly output: Clean plain text perfect for LLM training data, RAG pipelines, and NLP tasks
- Rich metadata: Word count, reference count, last modified date, page ID, categories
- Structured sections: Article broken down by heading with hierarchy levels
- Batch processing: Extract hundreds of articles in a single run
- Category crawling: Automatically fetch all articles from a Wikipedia category
- No proxy required: Wikipedia API is public and generous with rate limits
- Exports to JSON, CSV, Excel, or connect via API
How much will it cost?
Wikipedia Article Extractor is free to use -- you only pay for Apify platform compute time, which is minimal since the actor uses the lightweight MediaWiki API (no browser needed).
| Articles | Estimated Cost | Time |
|---|---|---|
| 100 | ~$0.01 | ~1 min |
| 1,000 | ~$0.05 | ~5 min |
| 10,000 | ~$0.50 | ~30 min |
| Cost Component | Per 1,000 Articles |
|---|---|
| Platform compute (256 MB) | ~$0.05 |
| Proxy (optional) | $0.00 |
| Total | ~$0.05 |
How to use
- Go to the Wikipedia Article Extractor page on Apify Store
- Click "Start" or "Try for free"
- Enter article URLs, titles, search terms, or category names
- Select the Wikipedia language edition
- Choose what data to include (full text, sections, categories, etc.)
- Set the maximum number of articles
- Click "Start" and wait for the results
Multi-language examples
Extract articles in different languages:
- English: Enter title "Artificial intelligence" with language "en"
- Spanish: Enter title "Inteligencia artificial" with language "es"
- Japanese: Enter title "人工知能" with language "ja"
- German: Enter title "Kunstliche Intelligenz" with language "de"
Or use URLs directly -- the language is auto-detected:
https://fr.wikipedia.org/wiki/Intelligence_artificiellehttps://zh.wikipedia.org/wiki/人工智能
Input parameters
| Parameter | Type | Description | Default |
|---|---|---|---|
| startUrls | array | Direct Wikipedia article URLs | - |
| articleTitles | array | Article titles (e.g., "Albert Einstein") | - |
| searchTerms | array | Search queries to find articles | - |
| categories | array | Category names to extract all articles from | - |
| language | string | Wikipedia language edition (en, es, fr, de, ja, etc.) | en |
| includeFullText | boolean | Extract complete article text | true |
| includeSections | boolean | Extract sections with headings | true |
| includeCategories | boolean | Extract article categories | true |
| includeLinks | boolean | Extract internal Wikipedia links | false |
| includeImages | boolean | Extract image URLs | false |
| maxItems | number | Maximum articles to extract (0 = unlimited) | 100 |
| proxyConfig | object | Optional proxy settings | - |
Output
The actor produces a dataset with the following fields:
{"url": "https://en.wikipedia.org/wiki/Web_scraping","title": "Web scraping","pageId": 2696619,"language": "en","summary": "Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites...","fullText": "Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser...","sections": [{"heading": "Introduction","text": "Web scraping, web harvesting, or web data extraction...","level": 1},{"heading": "Techniques","text": "Web scraping is the process of automatically mining data...","level": 2}],"categories": ["Web scraping","Data mining","Web technology"],"links": ["Data scraping", "Website", "Hypertext Transfer Protocol"],"images": ["https://commons.wikimedia.org/wiki/Special:FilePath/Example.png"],"lastModified": "2024-12-01T15:30:00Z","wordCount": 4523,"referencesCount": 87,"scrapedAt": "2025-01-15T10:30:00.000Z"}
| Field | Type | Description |
|---|---|---|
| url | string | Wikipedia article URL |
| title | string | Article title |
| pageId | integer | Wikipedia internal page ID |
| language | string | Language code (en, es, fr, etc.) |
| summary | string | Article introduction/summary in plain text |
| fullText | string | Complete article text in plain text |
| sections | array | Sections with heading, text, and level |
| categories | array | Article categories |
| links | array | Internal Wikipedia links |
| images | array | Image URLs from Wikimedia Commons |
| lastModified | string | Last edit timestamp |
| wordCount | integer | Total word count |
| referencesCount | integer | Number of citations/references |
| scrapedAt | string | ISO 8601 extraction timestamp |
How it works -- MediaWiki API
This actor uses the official MediaWiki API, which is the same API that powers Wikipedia's own interface, mobile apps, and third-party tools. Key endpoints used:
action=query&prop=extracts-- Retrieves article text as clean plain text (no HTML)action=query&prop=categories|links|images-- Fetches article metadataaction=parse&prop=sections|wikitext-- Parses article structure and raw wikitextaction=query&list=search-- Searches for articles by keywordaction=query&list=categorymembers-- Lists all articles in a category
The MediaWiki API is public, free, and does not require authentication. It has generous rate limits and is the most reliable way to access Wikipedia data.
Integrations
Connect Wikipedia Article Extractor with other tools:
- Apify API -- REST API for programmatic access
- Webhooks -- Get notified when a run finishes
- Zapier / Make -- Connect to 5,000+ apps
- Google Sheets -- Export directly to spreadsheets
- Vector databases -- Feed extracted text into Pinecone, Weaviate, Qdrant for RAG
API Example (Node.js)
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_TOKEN' });const run = await client.actor('YOUR_USERNAME/wikipedia-article-extractor').call({articleTitles: ['Artificial intelligence', 'Machine learning', 'Deep learning'],language: 'en',includeFullText: true,maxItems: 100,});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(`Extracted ${items.length} articles`);
API Example (Python)
from apify_client import ApifyClientclient = ApifyClient('YOUR_TOKEN')run = client.actor('YOUR_USERNAME/wikipedia-article-extractor').call(run_input={'articleTitles': ['Artificial intelligence', 'Machine learning', 'Deep learning'],'language': 'en','includeFullText': True,'maxItems': 100,})items = client.dataset(run['defaultDatasetId']).list_items().itemsprint(f'Extracted {len(items)} articles')
API Example (cURL)
curl "https://api.apify.com/v2/acts/YOUR_USERNAME~wikipedia-article-extractor/runs" \-X POST \-H "Content-Type: application/json" \-H "Authorization: Bearer YOUR_TOKEN" \-d '{"articleTitles": ["Artificial intelligence", "Machine learning"],"language": "en","includeFullText": true,"maxItems": 100}'
Tips and tricks
- Start with a small
maxItems(5-10) to test your configuration before running large extractions - Use article titles for best reliability -- they map directly to the API with no ambiguity
- Category extraction is powerful: a single category like "Machine learning" can yield hundreds of articles
- Combine input methods: search for a topic, then extract entire categories found in the results
- For AI training data, enable
includeFullTextand disableincludeLinksandincludeImagesfor clean text output - For RAG pipelines, enable
includeSectionsto get pre-chunked content with headings - Wikipedia URLs auto-detect language, so you can mix English and French URLs in the same run
- No proxy needed for most use cases -- the MediaWiki API is public and generous with rate limits
FAQ
Q: Does this actor require login credentials? A: No. The MediaWiki API is completely public and free to use. No authentication needed.
Q: How fast is the extraction? A: Approximately 100-200 articles per minute depending on article size and data options selected. The actor makes multiple API calls per article (text, metadata, sections).
Q: Can I extract articles in any language?
A: The UI offers 12 popular languages, but you can use any Wikipedia language by providing URLs directly (e.g., https://sv.wikipedia.org/wiki/... for Swedish).
Q: What about rate limits? A: Wikipedia's API has generous rate limits. For very large extractions (10,000+ articles), the actor automatically paces requests. You can optionally configure a proxy to distribute requests.
Q: Can I extract talk pages or user pages? A: This actor is optimized for article (main namespace) pages. Talk pages and other namespaces may work via direct URLs but are not officially supported.
Q: Is the output suitable for LLM training? A: Yes. The plain text output is clean, well-structured, and free of HTML artifacts. It is ideal for tokenization and training.
Is it legal to extract data from Wikipedia?
Wikipedia content is released under the Creative Commons Attribution-ShareAlike License and the GNU Free Documentation License. This means you are free to share and adapt Wikipedia content, even for commercial purposes, as long as you provide attribution and share derivatives under the same license.
The MediaWiki API is the officially supported way to programmatically access Wikipedia data. Wikipedia actively encourages bulk data access through its API and database dumps. For more information, see Apify's blog on web scraping legality.
Limitations
- Article text is plain text only (no HTML formatting, tables, or mathematical formulas)
- Infobox data is not extracted as structured key-value pairs (raw wikitext can be complex)
- Maximum of ~500 category members per category in a single pagination cycle
- Very large articles (100,000+ words) may take longer to process
- Search results are limited to 50 per query (Wikipedia API limit)
Changelog
- v0.1 (2026-04-23) -- Initial release with URL, title, search, and category input methods. Multi-language support. Full text, sections, categories, links, images, and metadata extraction.