Pricing

Pay per event

Wikipedia Scraper

Extract Wikipedia article text, summary, infobox, references, and categories via the Wikipedia API — one row per article, in any language — export to JSON or CSV. We handle title normalisation, redirects, retries, and rate-limit pacing so your dataset arrives clean.

Pricing

Pay per event

Rating

0.0

(0)

Developer

DevilScrapes

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

🎯 What this scrapes

Wikipedia is the world's most-cited knowledge base and the go-to seed corpus for RAG pipelines, NLP benchmarks, and knowledge graphs. The official REST API at en.wikipedia.org/api/rest_v1/ is reliable, but it hands you one article at a time — no bulk mode, no redirect-following, no scheduling, no structured output. This Actor takes a list of titles or URLs (in any Wikipedia language), normalises them, follows redirects, retries on transient failures, and writes one clean row per article: summary, plain-text body, infobox data, references, categories, and lead image.

Infobox-preservation is the feature most Wikipedia scraper tools skip because parsing them is genuinely fiddly. Structured facts — birth dates, populations, capitals, taxonomic ranks — are what make a Wikipedia-grounded RAG useful for question-answering, not just paragraph retrieval. We keep them.

Whether you need to bulk download Wikipedia articles for a RAG corpus, build a multilingual Wikipedia dataset, or run a scheduled refresh on a curated article list, this Actor handles the repetitive plumbing so you can focus on the downstream.

🔥 Features — what we handle for you

🛡️ Browser fingerprint rotation — curl-cffi impersonates real Chrome / Firefox / Safari TLS handshakes so the upstream sees a real browser client, not a Python script.
🌐 Proxy rotation via Apify Proxy — fresh session and exit IP on every block or throttle response.
🔁 Retries with exponential backoff — up to 5 attempts per article on 408 / 429 / 5xx, with Retry-After headers honoured precisely.
🧱 Rate-limit-aware pacing — when the upstream pushes back we slow down and surface partial progress; we never silently return an empty dataset.
🧊 Clean, typed dataset rows — Pydantic-validated, ISO-8601 timestamps, stable IDs, JSON / CSV / Excel export straight from the Apify Console.
💰 Pay-Per-Event pricing — you pay only for results that land in your dataset. No data, no charge.

💡 Use cases

RAG corpus seeding — download Wikipedia articles bulk for a domain-specific knowledge base (Renaissance art, medical terminology, legal concepts) and load directly into LangChain, LlamaIndex, Chroma, or Weaviate.
Wikipedia dataset for RAG — build a refreshable article corpus that stays current without downloading the 100 GB monthly XML dump.
Multilingual Wikipedia dataset — fetch the same article across 10+ languages, ID-aligned, for cross-lingual evaluation or translation benchmarks.
Wikipedia infobox extraction — pull structured facts (dates, coordinates, taxonomy, population) that most Wikipedia scraper tools discard.
Definition harvesting — pull the lead sentence for every term in a glossary or ontology.
Change monitoring — schedule weekly runs and diff last_modified timestamps to detect article updates.
Wikipedia text extraction API — replace ad-hoc wikipedia Python library calls with a managed, scalable pipeline that handles retries and output formatting for you.

⚙️ How to use it

Click Try for free at the top of the Store listing.
Paste your article titles (or full Wikipedia URLs) into the titles field — one per line.
Set the language code if you want a non-English Wikipedia (e.g. de, fr, ja).
Toggle includeFullText on to get the full plain-text body; toggle includeReferences on to capture the references list.
Click Start. Results stream into the run's dataset in real time.
Export from Storage → Dataset as JSON, CSV, or Excel — or fetch programmatically via the Apify API.

Tip: start with a small batch (5-10 titles) to validate the output shape before scaling to thousands.

📥 Input

Field	Type	Required	Default	Notes
`titles`	`array`	yes	`["Web scraping", "Apify"]`	Wikipedia article titles or full article URLs. Spaces are fine — they get URL-encoded automatically.
`language`	`string`	no	`"en"`	ISO 639-1 language code (`en`, `de`, `fr`, `ja`, etc.). Maps to `<language>.wikipedia.org`.
`includeFullText`	`boolean`	no	`true`	Fetch the full article body as plain text (footnotes stripped). One extra API call per article.
`includeReferences`	`boolean`	no	`false`	Fetch the references list via the `references` endpoint.
`concurrency`	`integer`	no	`4`	Parallel article fetches. Keep at 4 or below to stay within polite rate-limit bounds.
`proxyConfiguration`	`object`	no	`{"useApifyProxy": false}`	Proxy settings. We handle retries and session rotation; enable Apify Proxy for extra resilience on large runs.

Example input

{
  "titles": [
    "Web scraping",
    "Natural language processing",
    "Retrieval-augmented generation"
  ],
  "language": "en",
  "includeFullText": true,
  "includeReferences": false,
  "concurrency": 4,
  "proxyConfiguration": {
    "useApifyProxy": false
  }
}

📤 Output

Every row is one dataset item, one article.

Field	Type	Notes
`title`	`string`	Canonical Wikipedia title after redirect resolution.
`pageid`	`integer \| null`	Wikipedia internal page ID.
`language`	`string`	Wikipedia language code this row was fetched from.
`url`	`string`	Canonical article URL.
`summary`	`string \| null`	Lead-section summary, plain text.
`description`	`string \| null`	Wikidata short description (one line).
`extract_html`	`string \| null`	Lead-section HTML extract.
`fulltext`	`string \| null`	Plain-text article body; populated when `includeFullText=true`.
`thumbnail_url`	`string \| null`	Lead image thumbnail URL.
`original_image_url`	`string \| null`	Lead image full-resolution URL.
`categories`	`array`	Category names the article belongs to.
`references`	`array \| null`	References list; populated when `includeReferences=true`.
`last_modified`	`string \| null`	Last-edit timestamp, ISO-8601.
`scraped_at`	`string`	Timestamp when this row was recorded, ISO-8601.

Example output

{
  "title": "Web scraping",
  "pageid": 1323566,
  "language": "en",
  "url": "https://en.wikipedia.org/wiki/Web_scraping",
  "description": "Data extraction from websites",
  "summary": "Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.",
  "thumbnail_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/a/ae/Web_scraping.png/320px-Web_scraping.png",
  "categories": ["Web scraping", "Data mining", "Internet privacy"],
  "last_modified": "2025-04-12T08:34:21Z",
  "scraped_at": "2026-06-01T10:00:00Z"
}

💰 Pricing

Pay-Per-Event — you pay only when these events fire:

Event	USD	What triggers it
`actor-start`	$0.005	One-off warm-up charge per run
`result`	$0.002	Each article row written to the dataset

Example: 1 000 articles at the rates above is ~$2.00 total. No subscription, no minimum commitment, no card required to start — every new Apify account comes with $5 of free credit.

🚧 Limitations

Current revision only — we pull the live version of each article. Version history and revision diffs are a separate API surface and are out of scope here.
Infobox structured parsing — the Wikipedia REST API does not expose infobox fields as clean key-value JSON. We capture the infobox HTML where present; for deeply structured infobox facts use the Wikidata API alongside this Actor.
Concurrency ceiling — we cap concurrency at 16 to stay within polite bot-policy bounds. Very large batches (50k+ articles) run fine; they just take longer than an aggressive parallelised approach would.
Redirect chains — we follow one redirect hop. Circular redirects or chains longer than three hops are logged and skipped; the article title is still written to the dataset with a null body so you can see what was missed.
Not for real-time monitoring — Apify runs are asynchronous. For live change detection, schedule runs via the Apify Scheduler rather than polling the Actor directly.

❓ FAQ

Is this legal?

Yes — Wikipedia content is published under the CC BY-SA licence. Provide attribution when you republish or redistribute the content.

Does this work for non-English Wikipedia?

Yes — set language to the matching ISO 639-1 code. Any language that has its own Wikipedia subdomain (200+) is supported. The multilingual Wikipedia dataset use case is one of the top reasons people reach for this Actor.

How does this differ from the wikipedia Python library?

The wikipedia PyPI package is great for one-off lookups in a script. This Actor is for bulk Wikipedia article download: hundreds or thousands of articles in a single run, output already formatted as a clean dataset, with retries and scheduling handled for you. No local environment setup, no rate-limit babysitting.

Can I use this as a Wikipedia text extraction API?

Yes — use the Apify API to trigger runs programmatically and retrieve results via the dataset API. It's the managed version of rolling your own Wikipedia text extraction pipeline.

What is the download Wikipedia articles bulk workflow?

Paste your full list of article titles into the titles field (one per entry), set includeFullText=true, and click Start. The Actor fetches all articles in parallel (within polite rate limits), writes every row to the dataset, and you export once at the end. No paging, no pagination logic on your side.

What about the Wikipedia API wrapper — can I just use that instead?

The official en.wikipedia.org/api/rest_v1/ is a solid API. This Actor wraps it with bulk input handling, redirect resolution, retry logic, proxy rotation, structured output, and Apify scheduling — all the pieces you would otherwise build yourself around the raw API.

What about article history / revision diffs?

Out of scope — we surface only the current revision. If you need revision history, the MediaWiki Action API's revisions endpoint is the right tool.

💬 Your feedback

Spotted a bug, hit a weird edge case, or need a new field? Open an issue on the Actor's Issues tab in the Apify Console — we ship fixes weekly and we read every report.

Wikipedia Article Scraper

kayhermes/wikipedia-scraper

Khoa Nguyen

Wikipedia Article Extractor

rambunctious_fingerprint/wikipedia-extractor

Casey Marsh

Wikipedia Article Extractor

johnlenflure/wikipedia-extractor

Extract structured content from Wikipedia articles. Get summaries, sections, categories, infobox data, images, and internal links in any language.

Sinan Donmez

Wikipedia Scraper - Article Content Extractor

lulzasaur/wikipedia-scraper

Scrape Wikipedia articles. Search by topic and extract full structured content: summaries, sections, infobox data, categories, references, images, and edit history for any article.

lulz bot

Wikipedia Article Extractor (AI-ready)

changeable_acacia/wikipedia-article-extractor-ai-ready

Extracts clean JSON from any Wikipedia article for AI/RAG use.

SABYASACHI TRIPATHY

Wikipedia Article Scraper

crawlerbros/wikipedia-scraper

Extract structured data from Wikipedia articles. Get summaries, categories, images, metadata, and descriptions using Wikipedia's official API. Supports 300+ languages.

Crawler Bros

Wikipedia Article Data: Summary, Facts & Images

scrapemint/wikipedia-article-data

Pull clean data from Wikipedia articles in bulk. For each article: the summary text, short description, main image, map location, categories, number of language versions, page link and last edit date. Look up by title or by keyword search. Official Wikipedia API. No API key needed.

Ken M

Wikipedia Article Summary

vigilant_jasmine/wikipedia-article-summary

Fetch clean Wikipedia article summaries in any language via the official REST API: title, description, extract, thumbnail, GPS coordinates, Wikidata id and page URL. Free, batch-friendly.

DEV DEV