Full Wikipedia Scraper avatar
Full Wikipedia Scraper

Pricing

Pay per event

Go to Apify Store
Full Wikipedia Scraper

Full Wikipedia Scraper

Developed by

Lucas Bertocchini

Lucas Bertocchini

Maintained by Community

This Wikipedia API scrapes and sorts all content from an article, including text, images, links, references, headers, tables, lists, and more. All content is sorted by content type, neatly into JSON for easy use.

0.0 (0)

Pricing

Pay per event

0

1

1

Last modified

a day ago

🚀 What does Full Wikipedia Scraper do?

This all-purpose Wikipedia API allows you to easily and quickly extract all data that you will ever need from a Wikipedia article. All article content is sorted into JSON by content type. Full Wikipedia Scraper can scrape any article for:

  • Full sorted article content
    • Sections and Headings
    • Full text (Paragraphs, Quotes, Notes, etc.)
    • Images and Videos
    • Tables and Lists
    • Links and References
    • Infobox, Navbox, and Sidebar
  • Languages
  • Wiki categories
  • Time of last edit

📚 Why use Full Wikipedia Scraper?

Example uses of this scraper include:

  • 🤖 Machine learning datasets — Train NLP models with clean encyclopedic data.
  • 📰 Fact-checking & journalism — Automate retrieval of source content.
  • 🔍 SEO & content analysis — Analyze keyword usage and content structure.
  • 🎓 Academic research — Gather structured information for citations or topic studies.
  • 🧠 Knowledge graph building — Enrich linked data with Wikipedia’s structured info.

The output of Full Wikipedia Scraper is extremely:

  • 💪 Robust
    • All math displayed on pages is embedded in text in MathJAX notation. This can be displayed and formatted in HTML by simply including MathJax in the file.
    • Many easily confused special characters in the text are replaced with more common characters.
  • 🎯 Specific
    • The position in text and data of all links and references is scraped.
    • Most of the original formatting of the article could be recreated from its output.
  • ⚙️ Consistent
    • The formatting of Wikipedia is entirely inconsistent from article-to-article, making scraping extremely tedious.
    • This scraper can read all of the inconsistent formats and reliably output in a consistent, predictable format (output schema shown below).
  • ⚡ Fast
    • All of this is done in just a couple seconds!

📝 Input

Input consists of the language code (default: English) and the articles (the part of the link after /wiki/). Tip: For each article, make sure that there really is a page at https://{language}.wikipedia.org/wiki/{article}! For example:

{
"language": "en",
"articles": ["Chocolate", "Vanilla"]
}

📤 Output

The actor will ouput one object per article, each with properties:

  • title
  • description
  • sections
  • numLanguages
  • languages
  • categories
  • lastmod

The main page content is contained in the "sections" property. This is an array of objects, which is an array of objects with properties "heading" (string) and "content". "Content" contains an array of objects, each with a "type" property:

{
"sections": [
{
"heading": "First section heading",
"content": [{ "type": "...", "text": "..." }, "..."]
},
{ "heading": "Second section heading", "content": ["..."] },
"..."
]
}

Content types can be one of the following:

  • paragraph
  • quote
  • note
  • list
  • heading
  • gallery
  • image
  • video
  • table
  • refs
  • infobox
  • navbox
  • sidebar

These are the different kinds of content on Wikipedia. Please see the images below for an example of what each of these are, as well as a full TypeScript Schema in the appendix of this README.

🔍 Example output (simplified with "..."):

{
"title": "Chocolate",
"description": "Food produced from cacao seeds",
"sections": [
{
"content": [
{
"type": "note",
"text": "For other uses, see Chocolate (disambiguation).",
"links": ["..."]
},
"...",
{
"type": "paragraph",
"text": "Chocolate is a food made from roasted and ground cocoa beans...",
"links": [
{
"text": "cocoa beans",
"title": "Cocoa bean",
"href": "https://en.wikipedia.org/wiki/Cocoa_bean",
"pos": 49
},
"..."
]
}
]
},
"..."
],
"numLanguages": 154,
"languages": [
{
"autonym": "Afrikaans",
"localName": "Afrikaans",
"title": "Sjokolade",
"lang": "af",
"href": "https://af.wikipedia.org/wiki/Sjokolade"
},
"..."
],
"categories": [
{
"text": "Chocolate",
"links": ["..."]
},
"..."
],
"lastmod": "2025-07-22T22:18:00.000Z"
}

🖼️ Section Types

Wiki Label Image 2Wiki Label Image 6
Wiki Label Image 3Wiki Label Image 4
Wiki Label Image 5
Wiki Label Image 7
Wiki Label Image 8

This scraper only collects publicly available Wikipedia content. Wikipedia content is available under the Creative Commons Attribution-ShareAlike License (CC BY-SA 4.0). When using the data, you must comply with the license terms.

Our scrapers are ethical and do not extract any private data. They only extract publicly available content. We therefore believe that our scrapers, when used for ethical purposes by Apify users, are safe. However, you should be aware that your results could contain personal data. Personal data is protected by the GDPR in the European Union and by other regulations around the world. You should not scrape personal data unless you have a legitimate reason to do so. If you're unsure whether your reason is legitimate, consult your lawyers. You can also read our blog post on the legality of web scraping.

📄 Appendix: Full Output TypeScript Schema

export interface Output {
title: string
description: string
numLanguages: number
categories: {
text: string
links: {
text: string
title?: string | undefined
href: string
pos: number
}[]
}[]
lastmod: string
sections: {
heading?: string | undefined
content: (
| {
type: "paragraph"
text: string
links: {
text: string
title?: string | undefined
href: string
pos: number
}[]
refs: {
num: string
pos: number
}[]
}
| {
type: "quote"
isBoxed: boolean
text: string
links: {
text: string
title?: string | undefined
href: string
pos: number
}[]
refs: {
num: string
pos: number
}[]
}
| {
type: "note"
text: string
links: {
text: string
title?: string | undefined
href: string
pos: number
}[]
}
| {
type: "list"
items: {
text: string
links: {
text: string
title?: string | undefined
href: string
pos: number
}[]
}[]
}
| {
type: "heading"
text: string
level: number
}
| {
type: "gallery"
caption?:
| {
text: string
links: {
text: string
title?: string | undefined
href: string
pos: number
}[]
}
| undefined
items: {
text: string
links: {
text: string
title?: string | undefined
href: string
pos: number
}[]
src?: string | undefined
}[]
}
| {
type: "image"
caption: {
text: string
links: {
text: string
title?: string | undefined
href: string
pos: number
}[]
refs: {
num: string
pos: number
}[]
}
src?: string | undefined
href?: string | undefined
side: "left" | "right" | "center"
}
| {
type: "video"
caption: {
text: string
links: {
text: string
title?: string | undefined
href: string
pos: number
}[]
refs: {
num: string
pos: number
}[]
}
src?: string | undefined
href?: string | undefined
side: "left" | "right" | "center"
}
| {
type: "table"
caption?:
| {
text: string
links: {
text: string
title?: string | undefined
href: string
pos: number
}[]
refs: {
num: string
pos: number
}[]
}
| undefined
rows: {
type: "data" | "heading"
text: string
cols?: number | undefined
rows?: number | undefined
links: {
text: string
title?: string | undefined
href: string
pos: number
}[]
refs: {
num: string
pos: number
}[]
color?: string | undefined
}[][]
side?: ("left" | "right") | undefined
isBoxed: boolean
}
| {
type: "refs"
refs: {
[x: string]:
| {
text: string
links: {
text: string
title?: string | undefined
href: string
pos: number
}[]
}
| undefined
}
}
| {
type: "infobox"
content: (
| {
type: "heading"
text: string
}
| {
type: "title"
text: string
}
| {
type: "subtitle"
text: string
}
| {
type: "image"
text: string
links: {
text: string
title?: string | undefined
href: string
pos: number
}[]
src: string
}
| {
type: "fullrow"
text: string
links: {
text: string
title?: string | undefined
href: string
pos: number
}[]
refs: {
num: string
pos: number
}[]
}
| {
type: "row"
left: {
text: string
links: {
text: string
title?: string | undefined
href: string
pos: number
}[]
}
right: {
text: string
links: {
text: string
title?: string | undefined
href: string
pos: number
}[]
refs: {
num: string
pos: number
}[]
}
}
)[]
}
| {
type: "navbox"
content: (
| {
type: "label"
text: string
links: {
text: string
title?: string | undefined
href: string
pos: number
}[]
level: number
}
| {
type: "title"
text: string
links: {
text: string
title?: string | undefined
href: string
pos: number
}[]
}
| {
type: "items"
items: {
text: string
links: {
text: string
title?: string | undefined
href: string
pos: number
}[]
}[]
}
)[]
}
| {
type: "sidebar"
content: (
| {
type: "pretitle"
text: string
links: {
text: string
title?: string | undefined
href: string
pos: number
}[]
}
| {
type: "image"
src: string
}
| {
type: "items"
items: {
text: string
links: {
text: string
title?: string | undefined
href: string
pos: number
}[]
}[]
}
| {
type: "title"
text: string
links: {
text: string
title?: string | undefined
href: string
pos: number
}[]
}
| {
type: "heading"
text: string
links: {
text: string
title?: string | undefined
href: string
pos: number
}[]
}
)[]
}
)[]
}[]
languages: {
autonym?: string | undefined
localName?: string | undefined
title?: string | undefined
lang?: string | undefined
href?: string | undefined
}[]
}

📚 Documentation reference

To learn more about Apify and Actors, take a look at the following resources: