Wikipedia Scraper avatar

Wikipedia Scraper

Pricing

Pay per event

Go to Apify Store
Wikipedia Scraper

Wikipedia Scraper

Extract Wikipedia article text, summary, infobox, references, and categories via the Wikipedia API — one row per article, in any language — export to JSON or CSV. We handle title normalisation, redirects, retries, and rate-limit pacing so your dataset arrives clean.

Pricing

Pay per event

Rating

0.0

(0)

Developer

DevilScrapes

DevilScrapes

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Categories

Share


🎯 What this scrapes

Wikipedia is the world's most-cited knowledge base and the go-to seed corpus for RAG pipelines, NLP benchmarks, and knowledge graphs. The official REST API at en.wikipedia.org/api/rest_v1/ is reliable, but it hands you one article at a time — no bulk mode, no redirect-following, no scheduling, no structured output. This Actor takes a list of titles or URLs (in any Wikipedia language), normalises them, follows redirects, retries on transient failures, and writes one clean row per article: summary, plain-text body, infobox data, references, categories, and lead image.

Infobox-preservation is the feature most Wikipedia scraper tools skip because parsing them is genuinely fiddly. Structured facts — birth dates, populations, capitals, taxonomic ranks — are what make a Wikipedia-grounded RAG useful for question-answering, not just paragraph retrieval. We keep them.

Whether you need to bulk download Wikipedia articles for a RAG corpus, build a multilingual Wikipedia dataset, or run a scheduled refresh on a curated article list, this Actor handles the repetitive plumbing so you can focus on the downstream.

🔥 Features — what we handle for you

  • 🛡️ Browser fingerprint rotationcurl-cffi impersonates real Chrome / Firefox / Safari TLS handshakes so the upstream sees a real browser client, not a Python script.
  • 🌐 Proxy rotation via Apify Proxy — fresh session and exit IP on every block or throttle response.
  • 🔁 Retries with exponential backoff — up to 5 attempts per article on 408 / 429 / 5xx, with Retry-After headers honoured precisely.
  • 🧱 Rate-limit-aware pacing — when the upstream pushes back we slow down and surface partial progress; we never silently return an empty dataset.
  • 🧊 Clean, typed dataset rows — Pydantic-validated, ISO-8601 timestamps, stable IDs, JSON / CSV / Excel export straight from the Apify Console.
  • 💰 Pay-Per-Event pricing — you pay only for results that land in your dataset. No data, no charge.

💡 Use cases

  • RAG corpus seeding — download Wikipedia articles bulk for a domain-specific knowledge base (Renaissance art, medical terminology, legal concepts) and load directly into LangChain, LlamaIndex, Chroma, or Weaviate.
  • Wikipedia dataset for RAG — build a refreshable article corpus that stays current without downloading the 100 GB monthly XML dump.
  • Multilingual Wikipedia dataset — fetch the same article across 10+ languages, ID-aligned, for cross-lingual evaluation or translation benchmarks.
  • Wikipedia infobox extraction — pull structured facts (dates, coordinates, taxonomy, population) that most Wikipedia scraper tools discard.
  • Definition harvesting — pull the lead sentence for every term in a glossary or ontology.
  • Change monitoring — schedule weekly runs and diff last_modified timestamps to detect article updates.
  • Wikipedia text extraction API — replace ad-hoc wikipedia Python library calls with a managed, scalable pipeline that handles retries and output formatting for you.

⚙️ How to use it

  1. Click Try for free at the top of the Store listing.
  2. Paste your article titles (or full Wikipedia URLs) into the titles field — one per line.
  3. Set the language code if you want a non-English Wikipedia (e.g. de, fr, ja).
  4. Toggle includeFullText on to get the full plain-text body; toggle includeReferences on to capture the references list.
  5. Click Start. Results stream into the run's dataset in real time.
  6. Export from Storage → Dataset as JSON, CSV, or Excel — or fetch programmatically via the Apify API.

Tip: start with a small batch (5-10 titles) to validate the output shape before scaling to thousands.

📥 Input

FieldTypeRequiredDefaultNotes
titlesarrayyes["Web scraping", "Apify"]Wikipedia article titles or full article URLs. Spaces are fine — they get URL-encoded automatically.
languagestringno"en"ISO 639-1 language code (en, de, fr, ja, etc.). Maps to <language>.wikipedia.org.
includeFullTextbooleannotrueFetch the full article body as plain text (footnotes stripped). One extra API call per article.
includeReferencesbooleannofalseFetch the references list via the references endpoint.
concurrencyintegerno4Parallel article fetches. Keep at 4 or below to stay within polite rate-limit bounds.
proxyConfigurationobjectno{"useApifyProxy": false}Proxy settings. We handle retries and session rotation; enable Apify Proxy for extra resilience on large runs.

Example input

{
"titles": [
"Web scraping",
"Natural language processing",
"Retrieval-augmented generation"
],
"language": "en",
"includeFullText": true,
"includeReferences": false,
"concurrency": 4,
"proxyConfiguration": {
"useApifyProxy": false
}
}

📤 Output

Every row is one dataset item, one article.

FieldTypeNotes
titlestringCanonical Wikipedia title after redirect resolution.
pageidinteger | nullWikipedia internal page ID.
languagestringWikipedia language code this row was fetched from.
urlstringCanonical article URL.
summarystring | nullLead-section summary, plain text.
descriptionstring | nullWikidata short description (one line).
extract_htmlstring | nullLead-section HTML extract.
fulltextstring | nullPlain-text article body; populated when includeFullText=true.
thumbnail_urlstring | nullLead image thumbnail URL.
original_image_urlstring | nullLead image full-resolution URL.
categoriesarrayCategory names the article belongs to.
referencesarray | nullReferences list; populated when includeReferences=true.
last_modifiedstring | nullLast-edit timestamp, ISO-8601.
scraped_atstringTimestamp when this row was recorded, ISO-8601.

Example output

{
"title": "Web scraping",
"pageid": 1323566,
"language": "en",
"url": "https://en.wikipedia.org/wiki/Web_scraping",
"description": "Data extraction from websites",
"summary": "Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.",
"thumbnail_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/a/ae/Web_scraping.png/320px-Web_scraping.png",
"categories": ["Web scraping", "Data mining", "Internet privacy"],
"last_modified": "2025-04-12T08:34:21Z",
"scraped_at": "2026-06-01T10:00:00Z"
}

💰 Pricing

Pay-Per-Event — you pay only when these events fire:

EventUSDWhat triggers it
actor-start$0.005One-off warm-up charge per run
result$0.002Each article row written to the dataset

Example: 1 000 articles at the rates above is ~$2.00 total. No subscription, no minimum commitment, no card required to start — every new Apify account comes with $5 of free credit.

🚧 Limitations

  • Current revision only — we pull the live version of each article. Version history and revision diffs are a separate API surface and are out of scope here.
  • Infobox structured parsing — the Wikipedia REST API does not expose infobox fields as clean key-value JSON. We capture the infobox HTML where present; for deeply structured infobox facts use the Wikidata API alongside this Actor.
  • Concurrency ceiling — we cap concurrency at 16 to stay within polite bot-policy bounds. Very large batches (50k+ articles) run fine; they just take longer than an aggressive parallelised approach would.
  • Redirect chains — we follow one redirect hop. Circular redirects or chains longer than three hops are logged and skipped; the article title is still written to the dataset with a null body so you can see what was missed.
  • Not for real-time monitoring — Apify runs are asynchronous. For live change detection, schedule runs via the Apify Scheduler rather than polling the Actor directly.

❓ FAQ

Is this legal?

Yes — Wikipedia content is published under the CC BY-SA licence. Provide attribution when you republish or redistribute the content.

Does this work for non-English Wikipedia?

Yes — set language to the matching ISO 639-1 code. Any language that has its own Wikipedia subdomain (200+) is supported. The multilingual Wikipedia dataset use case is one of the top reasons people reach for this Actor.

How does this differ from the wikipedia Python library?

The wikipedia PyPI package is great for one-off lookups in a script. This Actor is for bulk Wikipedia article download: hundreds or thousands of articles in a single run, output already formatted as a clean dataset, with retries and scheduling handled for you. No local environment setup, no rate-limit babysitting.

Can I use this as a Wikipedia text extraction API?

Yes — use the Apify API to trigger runs programmatically and retrieve results via the dataset API. It's the managed version of rolling your own Wikipedia text extraction pipeline.

What is the download Wikipedia articles bulk workflow?

Paste your full list of article titles into the titles field (one per entry), set includeFullText=true, and click Start. The Actor fetches all articles in parallel (within polite rate limits), writes every row to the dataset, and you export once at the end. No paging, no pagination logic on your side.

What about the Wikipedia API wrapper — can I just use that instead?

The official en.wikipedia.org/api/rest_v1/ is a solid API. This Actor wraps it with bulk input handling, redirect resolution, retry logic, proxy rotation, structured output, and Apify scheduling — all the pieces you would otherwise build yourself around the raw API.

What about article history / revision diffs?

Out of scope — we surface only the current revision. If you need revision history, the MediaWiki Action API's revisions endpoint is the right tool.

💬 Your feedback

Spotted a bug, hit a weird edge case, or need a new field? Open an issue on the Actor's Issues tab in the Apify Console — we ship fixes weekly and we read every report.