Wikipedia Article Scraper avatar

Wikipedia Article Scraper

Pricing

Pay per event

Go to Apify Store
Wikipedia Article Scraper

Wikipedia Article Scraper

Extract Wikipedia article text, summary, infobox, references, and categories — one row per article, in any language. We handle title normalisation, redirects, retries, and rate-limit pacing so your dataset arrives clean.

Pricing

Pay per event

Rating

0.0

(0)

Developer

DevilScrapes

DevilScrapes

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

9 days ago

Last modified

Categories

Share


🎯 What this scrapes

Wikipedia is the world's most-cited dataset — and the REST API at en.wikipedia.org/api/rest_v1/ is the source most reliable pipelines pull from. This Actor takes a list of titles or URLs (in any Wikipedia language), normalises them, follows redirects, retries on transient errors, and writes one row per article: summary, plain-text body, infobox data, references list, categories, and lead image. We pace requests so we stay a polite citizen on the upstream — your dataset shows up consistent across re-runs.

🔥 What we handle for you

  • 🛡️ Browser fingerprint rotationcurl-cffi impersonates real Chrome / Firefox / Safari TLS handshakes so the target sees a browser, not Python.
  • 🌐 Residential proxy rotation via Apify Proxy — fresh session and exit IP on every block.
  • 🔁 Retries with exponential backoff on 408 / 429 / 5xx — up to 5 attempts per page, Retry-After honoured.
  • 🧱 Rate-limit-aware pacing — when the target pushes back, we slow down instead of getting banned.
  • 🧊 Clean, typed dataset rows — Pydantic-validated, ISO-8601 timestamps, stable IDs, JSON / CSV / Excel export straight from the Apify Console.
  • 💰 Pay-Per-Event pricing — you only pay for results that hit your dataset. No data, no charge.

💡 Use cases

  • Knowledge-base seeding — extract structured summaries for a known list of entities and load into a RAG vector store.
  • Definition harvesting — pull the lead sentence for every term in your glossary.
  • Multilingual analysis — fetch the same article in 5 languages to compare framing.
  • Change monitoring — schedule daily runs and diff last_modified to detect updates.

⚙️ How to use it

  1. Click Try for free at the top of the page.
  2. Fill in the input form — most fields have sensible defaults.
  3. Click Start. Output streams into the run's dataset.
  4. Export from Storage → Dataset as JSON, CSV, or Excel — or fetch via the API.

📥 Input

FieldTypeRequiredDefaultNotes
titlesarrayyes['Web scraping', 'Apify']List of Wikipedia article titles (e.g. Apify, Web scraping) or full article URLs. Spaces are f
languagestringno'en'ISO 639-1 language code (e.g. en, de, fr, ja). Maps to the matching
includeFullTextbooleannoTrueWhen true, fetch the article body and convert to plain text (footnotes stripped). Costs one extra API call per article.
includeReferencesbooleannoFalseWhen true, fetch the references via the references endpoint.
concurrencyintegerno4Parallel API requests.
proxyConfigurationobjectno{'useApifyProxy': False}Wikipedia is generous with public clients. Proxy optional.

Example input

{
"titles": [
"Web scraping"
],
"language": "en",
"includeFullText": false,
"includeReferences": false,
"concurrency": 2,
"proxyConfiguration": {
"useApifyProxy": false
}
}

📤 Output

Every row is one dataset item.

FieldTypeNotes
titlestringCanonical Wikipedia title (after redirects).
pageid['integer', 'null']Wikipedia page ID.
languagestringWikipedia language code the row was fetched from.
urlstringCanonical article URL.
summary['string', 'null']Lead-section summary (plain text).
description['string', 'null']Wikidata short description (1-line).
extract_html['string', 'null']Lead-section HTML extract.
fulltext['string', 'null']Plain-text article body, when includeFullText=true.
thumbnail_url['string', 'null']Lead image thumbnail URL.
original_image_url['string', 'null']Lead image full-resolution URL.
categoriesarrayCategory names the page is in.
references['array', 'null']References list, when includeReferences=true.
last_modified['string', 'null']Last edit timestamp (ISO-8601).
scraped_atstringWhen this row was recorded.

Example output

{
"title": "Web scraping",
"language": "en",
"url": "https://en.wikipedia.org/wiki/Web_scraping",
"description": "Data extraction from websites",
"summary": "Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. ..."
}

💰 Pricing

Pay-Per-Event — you pay only when these events fire:

EventUSDWhat it is
actor-start$0.005One-off warm-up charge per run
result$0.002Per dataset item

Example: 1 000 results at the rates above ≈ $2.00. No subscription, no minimum, no card to start — Apify gives every new account $5 of free credit.

🚧 Limitations

We pull the current version of each article — no version history, no diff. Infobox structured data isn't parsed (the REST API doesn't expose it cleanly); use the Wikidata API for structured facts.

❓ FAQ

Is this legal?

Yes — Wikipedia content is licensed CC BY-SA. Provide attribution when republishing.

Does this work for non-English Wikipedia?

Yes — set the language field to the matching ISO code. Languages with their own Wikipedia (200+) are supported.

Can I get the full HTML?

Set includeFullText=true to get the plain-text body; for full HTML use a Crawlee-based generic crawler.

What about article history?

Out of scope here — we surface only the current revision. Revision history is a different API surface.

💬 Your feedback

Spotted a bug, hit a weird edge case, or need a new field? Open an issue on the Actor's Issues tab on Apify Console — we ship fixes weekly and we read every report.