Wikipedia Article Scraper
Pricing
Pay per event
Wikipedia Article Scraper
Extract Wikipedia article text, summary, infobox, references, and categories — one row per article, in any language. We handle title normalisation, redirects, retries, and rate-limit pacing so your dataset arrives clean.
Pricing
Pay per event
Rating
0.0
(0)
Developer
DevilScrapes
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
9 days ago
Last modified
Categories
Share
🎯 What this scrapes
Wikipedia is the world's most-cited dataset — and the REST API at en.wikipedia.org/api/rest_v1/ is the source most reliable pipelines pull from. This Actor takes a list of titles or URLs (in any Wikipedia language), normalises them, follows redirects, retries on transient errors, and writes one row per article: summary, plain-text body, infobox data, references list, categories, and lead image. We pace requests so we stay a polite citizen on the upstream — your dataset shows up consistent across re-runs.
🔥 What we handle for you
- 🛡️ Browser fingerprint rotation —
curl-cffiimpersonates real Chrome / Firefox / Safari TLS handshakes so the target sees a browser, not Python. - 🌐 Residential proxy rotation via Apify Proxy — fresh session and exit IP on every block.
- 🔁 Retries with exponential backoff on
408 / 429 / 5xx— up to 5 attempts per page,Retry-Afterhonoured. - 🧱 Rate-limit-aware pacing — when the target pushes back, we slow down instead of getting banned.
- 🧊 Clean, typed dataset rows — Pydantic-validated, ISO-8601 timestamps, stable IDs, JSON / CSV / Excel export straight from the Apify Console.
- 💰 Pay-Per-Event pricing — you only pay for results that hit your dataset. No data, no charge.
💡 Use cases
- Knowledge-base seeding — extract structured summaries for a known list of entities and load into a RAG vector store.
- Definition harvesting — pull the lead sentence for every term in your glossary.
- Multilingual analysis — fetch the same article in 5 languages to compare framing.
- Change monitoring — schedule daily runs and diff
last_modifiedto detect updates.
⚙️ How to use it
- Click Try for free at the top of the page.
- Fill in the input form — most fields have sensible defaults.
- Click Start. Output streams into the run's dataset.
- Export from Storage → Dataset as JSON, CSV, or Excel — or fetch via the API.
📥 Input
| Field | Type | Required | Default | Notes |
|---|---|---|---|---|
titles | array | yes | ['Web scraping', 'Apify'] | List of Wikipedia article titles (e.g. Apify, Web scraping) or full article URLs. Spaces are f |
language | string | no | 'en' | ISO 639-1 language code (e.g. en, de, fr, ja). Maps to the matching |
includeFullText | boolean | no | True | When true, fetch the article body and convert to plain text (footnotes stripped). Costs one extra API call per article. |
includeReferences | boolean | no | False | When true, fetch the references via the references endpoint. |
concurrency | integer | no | 4 | Parallel API requests. |
proxyConfiguration | object | no | {'useApifyProxy': False} | Wikipedia is generous with public clients. Proxy optional. |
Example input
{"titles": ["Web scraping"],"language": "en","includeFullText": false,"includeReferences": false,"concurrency": 2,"proxyConfiguration": {"useApifyProxy": false}}
📤 Output
Every row is one dataset item.
| Field | Type | Notes |
|---|---|---|
title | string | Canonical Wikipedia title (after redirects). |
pageid | ['integer', 'null'] | Wikipedia page ID. |
language | string | Wikipedia language code the row was fetched from. |
url | string | Canonical article URL. |
summary | ['string', 'null'] | Lead-section summary (plain text). |
description | ['string', 'null'] | Wikidata short description (1-line). |
extract_html | ['string', 'null'] | Lead-section HTML extract. |
fulltext | ['string', 'null'] | Plain-text article body, when includeFullText=true. |
thumbnail_url | ['string', 'null'] | Lead image thumbnail URL. |
original_image_url | ['string', 'null'] | Lead image full-resolution URL. |
categories | array | Category names the page is in. |
references | ['array', 'null'] | References list, when includeReferences=true. |
last_modified | ['string', 'null'] | Last edit timestamp (ISO-8601). |
scraped_at | string | When this row was recorded. |
Example output
{"title": "Web scraping","language": "en","url": "https://en.wikipedia.org/wiki/Web_scraping","description": "Data extraction from websites","summary": "Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. ..."}
💰 Pricing
Pay-Per-Event — you pay only when these events fire:
| Event | USD | What it is |
|---|---|---|
actor-start | $0.005 | One-off warm-up charge per run |
result | $0.002 | Per dataset item |
Example: 1 000 results at the rates above ≈ $2.00. No subscription, no minimum, no card to start — Apify gives every new account $5 of free credit.
🚧 Limitations
We pull the current version of each article — no version history, no diff. Infobox structured data isn't parsed (the REST API doesn't expose it cleanly); use the Wikidata API for structured facts.
❓ FAQ
Is this legal?
Yes — Wikipedia content is licensed CC BY-SA. Provide attribution when republishing.
Does this work for non-English Wikipedia?
Yes — set the language field to the matching ISO code. Languages with their own Wikipedia (200+) are supported.
Can I get the full HTML?
Set includeFullText=true to get the plain-text body; for full HTML use a Crawlee-based generic crawler.
What about article history?
Out of scope here — we surface only the current revision. Revision history is a different API surface.
💬 Your feedback
Spotted a bug, hit a weird edge case, or need a new field? Open an issue on the Actor's Issues tab on Apify Console — we ship fixes weekly and we read every report.
