# Wikipedia Scraper - Articles, Summaries, Metadata (`santamaria-automations/wikipedia-scraper`) Actor

Extract Wikipedia articles including full content, summary, thumbnails, categories, external links, coordinates, and Wikidata IDs. Multi-language support for 12+ languages. Export data, run via API, schedule and monitor runs, or integrate with other tools.

- **URL**: https://apify.com/santamaria-automations/wikipedia-scraper.md
- **Developed by:** [Alessandro Santamaria](https://apify.com/santamaria-automations) (community)
- **Categories:** AI, News, Automation
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $2.00 / 1,000 article scrapeds

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Wikipedia Scraper - Articles, Summaries, Metadata

Scrape Wikipedia articles at scale — full content, summaries, images, categories, and Wikidata links.

Build AI training datasets, knowledge graphs, research corpora, or enrich your app with encyclopedic facts. Powered by the official MediaWiki REST API for clean, reliable, respectful data extraction.

### Features

- **12+ Languages** — English, German, French, Spanish, Italian, Portuguese, Dutch, Polish, Russian, Japanese, Chinese, Arabic
- **Full content extraction** — plain text, cleaned HTML, and per-section breakdown with titles and heading levels
- **Summaries & descriptions** — one-line descriptions and first-paragraph extracts
- **Images** — thumbnails, main image, and all article images
- **Structured metadata** — categories, external links, references/citations
- **Wikidata linking** — every article comes with its Q-ID for entity resolution
- **Geo coordinates** — lat/lng for places, landmarks, and geographic entities
- **Pageviews** — 30-day view counts from the Wikimedia pageviews API
- **Disambiguation detection** — flag ambiguous pages before ingesting
- **Search** — find articles by keyword, not just by title
- **No auth, no anti-bot** — uses the public MediaWiki API; no tokens, no captchas

### Input

```json
{
  "titles": ["Berlin", "Albert_Einstein", "Machine_learning"],
  "searchQuery": "quantum physics",
  "urls": ["https://en.wikipedia.org/wiki/Quantum_computing"],
  "language": "en",
  "includeFullContent": true,
  "includeImages": true,
  "includeReferences": false,
  "maxSearchResults": 10
}
````

| Field | Type | Description |
|---|---|---|
| `titles` | array | Direct Wikipedia article titles |
| `searchQuery` | string | Keyword search (returns top N matches) |
| `urls` | array | Wikipedia URLs — title is auto-extracted |
| `language` | enum | Wiki edition: `en`, `de`, `fr`, `es`, `it`, `pt`, `nl`, `pl`, `ru`, `ja`, `zh`, `ar` |
| `includeFullContent` | bool | Fetch full article body + sections (default `true`) |
| `includeImages` | bool | Include all image URLs (default `true`) |
| `includeReferences` | bool | Include citations (default `false`) |
| `maxSearchResults` | int | Cap on search results (default `10`) |

### Output Example

Real output for `Berlin` (English Wikipedia):

```json
{
  "title": "Berlin",
  "url": "https://en.wikipedia.org/wiki/Berlin",
  "language": "en",
  "page_id": 3354,
  "revision_id": 1234567890,
  "extract": "Berlin is the capital and largest city of Germany by both area and population...",
  "description": "Capital and largest city of Germany",
  "content_full": "Berlin is the capital and largest city of Germany...",
  "content_html": "<section>...</section>",
  "thumbnail_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/.../Berlin.jpg",
  "main_image_url": "https://upload.wikimedia.org/wikipedia/commons/.../Berlin.jpg",
  "images": ["https://upload.wikimedia.org/..."],
  "sections": [
    { "title": "History", "level": 2, "text": "The earliest evidence of settlements..." },
    { "title": "Geography", "level": 2, "text": "Berlin is in northeastern Germany..." }
  ],
  "categories": ["Berlin", "Capitals in Europe", "Cities in Germany"],
  "external_links": ["https://www.berlin.de/", "..."],
  "coordinates": { "lat": 52.52, "lng": 13.405 },
  "wikidata_id": "Q64",
  "last_modified": "2026-04-01T12:34:56Z",
  "word_count": 15842,
  "view_count_30d": 1482391,
  "is_disambiguation": false,
  "scraped_at": "2026-04-07T10:00:00Z"
}
```

### Use Cases

- **AI/LLM training data** — Build high-quality, well-structured datasets for fine-tuning language models. Wikipedia is the gold standard for encyclopedic corpora.
- **Knowledge graphs** — Link entities in your database to Wikidata Q-IDs. Every article comes with its canonical identifier, coordinates, and categories.
- **Academic research** — Extract literature review material, cross-reference citations, and build topic-specific corpora across languages.
- **Content generation** — Enrich articles, product pages, and blog posts with verified encyclopedia facts. Add "Did you know" boxes and related topic links.
- **Fact-checking pipelines** — Verify claims against Wikipedia extracts and last-modified timestamps. Flag disambiguation pages automatically.
- **Travel content** — Pull city, landmark, and attraction data with coordinates for travel blogs, booking sites, and map overlays.
- **Biographies** — Scrape person articles for journalism, CRM enrichment, or historical datasets. Link people to their Wikidata records.

### Pricing

Pay-per-event: you only pay for articles you actually extract.

| Event | Price |
|---|---|
| `enrichment-start` | $0.001 |
| `enrichment-result` | $0.002 per article |

**Example costs:**

- **100 articles** — ~$0.20
- **1,000 articles** — ~$2.00
- **10,000 articles** — ~$20.00

No proxy costs — Wikipedia is a public API.

### Issues & Feedback

Found a bug or want a feature? [Open an issue](https://console.apify.com/actors/ACTOR_ID_PLACEHOLDER/issues).

### Related Actors

- [HTML to Markdown](https://apify.com/santamaria-automations/html-to-markdown) — Convert scraped HTML into LLM-ready Markdown
- [RSS Feed Reader](https://apify.com/santamaria-automations/rss-feed-reader) — Bulk parse RSS, Atom and JSON feeds
- [Website Content Crawler](https://apify.com/apify/website-content-crawler) — Crawl full websites and extract text
- [Google Maps Scraper](https://apify.com/santamaria-automations/google-maps-scraper) — Business listings, reviews, and geo data

# Actor input Schema

## `titles` (type: `array`):

List of Wikipedia article titles to scrape. Use underscores or spaces (e.g. 'Albert\_Einstein' or 'Albert Einstein'). Case-sensitive for the first letter matches MediaWiki canonical form.

## `searchQuery` (type: `string`):

Optional keyword search. Returns the top N matching articles (controlled by Max Search Results). Combine with Titles/URLs for a mixed batch.

## `urls` (type: `array`):

Direct Wikipedia article URLs. Title is extracted automatically (e.g. https://en.wikipedia.org/wiki/Berlin).

## `language` (type: `string`):

Wikipedia language edition to use.

## `includeFullContent` (type: `boolean`):

Fetch the full article body, sections, and word count. Disable for summary-only to reduce bandwidth.

## `includeImages` (type: `boolean`):

Include all image URLs found in the article body.

## `includeReferences` (type: `boolean`):

Include the full citations/references list. Off by default since it can be large.

## `maxSearchResults` (type: `integer`):

Maximum number of articles to fetch when Search Query is used.

## `proxyConfiguration` (type: `object`):

Optional proxy. Wikipedia is public — proxies are rarely needed.

## Actor input object example

```json
{
  "titles": [
    "Berlin",
    "Albert_Einstein",
    "Machine_learning"
  ],
  "language": "en",
  "includeFullContent": true,
  "includeImages": true,
  "includeReferences": false,
  "maxSearchResults": 10,
  "proxyConfiguration": {
    "useApifyProxy": false
  }
}
```

# Actor output Schema

## `articles` (type: `string`):

Dataset of Wikipedia article records

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "titles": [
        "Berlin",
        "Albert_Einstein",
        "Machine_learning"
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("santamaria-automations/wikipedia-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "titles": [
        "Berlin",
        "Albert_Einstein",
        "Machine_learning",
    ] }

# Run the Actor and wait for it to finish
run = client.actor("santamaria-automations/wikipedia-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "titles": [
    "Berlin",
    "Albert_Einstein",
    "Machine_learning"
  ]
}' |
apify call santamaria-automations/wikipedia-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=santamaria-automations/wikipedia-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Wikipedia Scraper - Articles, Summaries, Metadata",
        "description": "Extract Wikipedia articles including full content, summary, thumbnails, categories, external links, coordinates, and Wikidata IDs. Multi-language support for 12+ languages. Export data, run via API, schedule and monitor runs, or integrate with other tools.",
        "version": "1.0",
        "x-build-id": "PV7rIxfGqqofXcAKG"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/santamaria-automations~wikipedia-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-santamaria-automations-wikipedia-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/santamaria-automations~wikipedia-scraper/runs": {
            "post": {
                "operationId": "runs-sync-santamaria-automations-wikipedia-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/santamaria-automations~wikipedia-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-santamaria-automations-wikipedia-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "titles": {
                        "title": "Article Titles",
                        "type": "array",
                        "description": "List of Wikipedia article titles to scrape. Use underscores or spaces (e.g. 'Albert_Einstein' or 'Albert Einstein'). Case-sensitive for the first letter matches MediaWiki canonical form.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "searchQuery": {
                        "title": "Search Query",
                        "type": "string",
                        "description": "Optional keyword search. Returns the top N matching articles (controlled by Max Search Results). Combine with Titles/URLs for a mixed batch."
                    },
                    "urls": {
                        "title": "Wikipedia URLs",
                        "type": "array",
                        "description": "Direct Wikipedia article URLs. Title is extracted automatically (e.g. https://en.wikipedia.org/wiki/Berlin).",
                        "items": {
                            "type": "string"
                        }
                    },
                    "language": {
                        "title": "Language",
                        "enum": [
                            "en",
                            "de",
                            "fr",
                            "es",
                            "it",
                            "pt",
                            "nl",
                            "pl",
                            "ru",
                            "ja",
                            "zh",
                            "ar"
                        ],
                        "type": "string",
                        "description": "Wikipedia language edition to use.",
                        "default": "en"
                    },
                    "includeFullContent": {
                        "title": "Include Full Content",
                        "type": "boolean",
                        "description": "Fetch the full article body, sections, and word count. Disable for summary-only to reduce bandwidth.",
                        "default": true
                    },
                    "includeImages": {
                        "title": "Include Images",
                        "type": "boolean",
                        "description": "Include all image URLs found in the article body.",
                        "default": true
                    },
                    "includeReferences": {
                        "title": "Include References",
                        "type": "boolean",
                        "description": "Include the full citations/references list. Off by default since it can be large.",
                        "default": false
                    },
                    "maxSearchResults": {
                        "title": "Max Search Results",
                        "minimum": 1,
                        "maximum": 50,
                        "type": "integer",
                        "description": "Maximum number of articles to fetch when Search Query is used.",
                        "default": 10
                    },
                    "proxyConfiguration": {
                        "title": "Proxy Configuration",
                        "type": "object",
                        "description": "Optional proxy. Wikipedia is public — proxies are rarely needed.",
                        "default": {
                            "useApifyProxy": false
                        }
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
