Full Wikipedia Scraper

Under maintenance

Pricing

Pay per event

Try for free

Go to Apify Store

Full Wikipedia Scraper

Under maintenance

Try for free

This Wikipedia API scrapes and sorts all content from an article, including text, images, links, references, headers, tables, lists, and more. All content is sorted by content type, neatly into JSON for easy use.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Lucas Bertocchini

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

🚀 What does Full Wikipedia Scraper do?

This all-purpose Wikipedia API allows you to easily and quickly extract all data that you will ever need from a Wikipedia article. All article content is sorted into JSON by content type. Full Wikipedia Scraper can scrape any article for:

Full sorted article content
- Sections and Headings
- Full text (Paragraphs, Quotes, Notes, etc.)
- Images and Videos
- Tables and Lists
- Links and References
- Infobox, Navbox, and Sidebar
Languages
Wiki categories
Time of last edit

📚 Why use Full Wikipedia Scraper?

Example uses of this scraper include:

🤖 Machine learning datasets — Train NLP models with clean encyclopedic data.
📰 Fact-checking & journalism — Automate retrieval of source content.
🔍 SEO & content analysis — Analyze keyword usage and content structure.
🎓 Academic research — Gather structured information for citations or topic studies.
🧠 Knowledge graph building — Enrich linked data with Wikipedia’s structured info.

The output of Full Wikipedia Scraper is extremely:

💪 Robust
- All math displayed on pages is embedded in text in MathJAX notation. This can be displayed and formatted in HTML by simply including MathJax in the file.
- Many easily confused special characters in the text are replaced with more common characters.
🎯 Specific
- The position in text and data of all links and references is scraped.
- Most of the original formatting of the article could be recreated from its output.
⚙️ Consistent
- The formatting of Wikipedia is entirely inconsistent from article-to-article, making scraping extremely tedious.
- This scraper can read all of the inconsistent formats and reliably output in a consistent, predictable format (output schema shown below).
⚡ Fast
- All of this is done in just a couple seconds!

📝 Input

Input consists of the language code (default: English) and the articles (the part of the link after /wiki/). Tip: For each article, make sure that there really is a page at https://{language}.wikipedia.org/wiki/{article}! For example:

{
  "language": "en",
  "articles": ["Chocolate", "Vanilla"]
}

📤 Output

The actor will ouput one object per article, each with properties:

title
description
sections
numLanguages
languages
categories
lastmod

The main page content is contained in the "sections" property. This is an array of objects, which is an array of objects with properties "heading" (string) and "content". "Content" contains an array of objects, each with a "type" property:

{
  "sections": [
    {
      "heading": "First section heading",
      "content": [{ "type": "...", "text": "..." }, "..."]
    },
    { "heading": "Second section heading", "content": ["..."] },
    "..."
  ]
}

Content types can be one of the following:

paragraph
quote
note
list
heading
gallery
image
video
table
refs
infobox
navbox
sidebar

These are the different kinds of content on Wikipedia. Please see the images below for an example of what each of these are, as well as a full TypeScript Schema in the appendix of this README.

🔍 Example output (simplified with "..."):

{
  "title": "Chocolate",
  "description": "Food produced from cacao seeds",
  "sections": [
    {
      "content": [
        {
          "type": "note",
          "text": "For other uses, see Chocolate (disambiguation).",
          "links": ["..."]
        },
        "...",
        {
          "type": "paragraph",
          "text": "Chocolate is a food made from roasted and ground cocoa beans...",
          "links": [
            {
              "text": "cocoa beans",
              "title": "Cocoa bean",
              "href": "https://en.wikipedia.org/wiki/Cocoa_bean",
              "pos": 49
            },
            "..."
          ]
        }
      ]
    },
    "..."
  ],
  "numLanguages": 154,
  "languages": [
    {
      "autonym": "Afrikaans",
      "localName": "Afrikaans",
      "title": "Sjokolade",
      "lang": "af",
      "href": "https://af.wikipedia.org/wiki/Sjokolade"
    },
    "..."
  ],
  "categories": [
    {
      "text": "Chocolate",
      "links": ["..."]
    },
    "..."
  ],
  "lastmod": "2025-07-22T22:18:00.000Z"
}

🖼️ Section Types

⚖️ Legal and ethical use

This scraper only collects publicly available Wikipedia content. Wikipedia content is available under the Creative Commons Attribution-ShareAlike License (CC BY-SA 4.0). When using the data, you must comply with the license terms.

Our scrapers are ethical and do not extract any private data. They only extract publicly available content. We therefore believe that our scrapers, when used for ethical purposes by Apify users, are safe. However, you should be aware that your results could contain personal data. Personal data is protected by the GDPR in the European Union and by other regulations around the world. You should not scrape personal data unless you have a legitimate reason to do so. If you're unsure whether your reason is legitimate, consult your lawyers. You can also read our blog post on the legality of web scraping.

📄 Appendix: Full Output TypeScript Schema

export interface Output {
  title: string
  description: string
  numLanguages: number
  categories: {
    text: string
    links: {
      text: string
      title?: string | undefined
      href: string
      pos: number
    }[]
  }[]
  lastmod: string
  sections: {
    heading?: string | undefined
    content: (
      | {
          type: "paragraph"
          text: string
          links: {
            text: string
            title?: string | undefined
            href: string
            pos: number
          }[]
          refs: {
            num: string
            pos: number
          }[]
        }
      | {
          type: "quote"
          isBoxed: boolean
          text: string
          links: {
            text: string
            title?: string | undefined
            href: string
            pos: number
          }[]
          refs: {
            num: string
            pos: number
          }[]
        }
      | {
          type: "note"
          text: string
          links: {
            text: string
            title?: string | undefined
            href: string
            pos: number
          }[]
        }
      | {
          type: "list"
          items: {
            text: string
            links: {
              text: string
              title?: string | undefined
              href: string
              pos: number
            }[]
          }[]
        }
      | {
          type: "heading"
          text: string
          level: number
        }
      | {
          type: "gallery"
          caption?:
            | {
                text: string
                links: {
                  text: string
                  title?: string | undefined
                  href: string
                  pos: number
                }[]
              }
            | undefined
          items: {
            text: string
            links: {
              text: string
              title?: string | undefined
              href: string
              pos: number
            }[]
            src?: string | undefined
          }[]
        }
      | {
          type: "image"
          caption: {
            text: string
            links: {
              text: string
              title?: string | undefined
              href: string
              pos: number
            }[]
            refs: {
              num: string
              pos: number
            }[]
          }
          src?: string | undefined
          href?: string | undefined
          side: "left" | "right" | "center"
        }
      | {
          type: "video"
          caption: {
            text: string
            links: {
              text: string
              title?: string | undefined
              href: string
              pos: number
            }[]
            refs: {
              num: string
              pos: number
            }[]
          }
          src?: string | undefined
          href?: string | undefined
          side: "left" | "right" | "center"
        }
      | {
          type: "table"
          caption?:
            | {
                text: string
                links: {
                  text: string
                  title?: string | undefined
                  href: string
                  pos: number
                }[]
                refs: {
                  num: string
                  pos: number
                }[]
              }
            | undefined
          rows: {
            type: "data" | "heading"
            text: string
            cols?: number | undefined
            rows?: number | undefined
            links: {
              text: string
              title?: string | undefined
              href: string
              pos: number
            }[]
            refs: {
              num: string
              pos: number
            }[]
            color?: string | undefined
          }[][]
          side?: ("left" | "right") | undefined
          isBoxed: boolean
        }
      | {
          type: "refs"
          refs: {
            [x: string]:
              | {
                  text: string
                  links: {
                    text: string
                    title?: string | undefined
                    href: string
                    pos: number
                  }[]
                }
              | undefined
          }
        }
      | {
          type: "infobox"
          content: (
            | {
                type: "heading"
                text: string
              }
            | {
                type: "title"
                text: string
              }
            | {
                type: "subtitle"
                text: string
              }
            | {
                type: "image"
                text: string
                links: {
                  text: string
                  title?: string | undefined
                  href: string
                  pos: number
                }[]
                src: string
              }
            | {
                type: "fullrow"
                text: string
                links: {
                  text: string
                  title?: string | undefined
                  href: string
                  pos: number
                }[]
                refs: {
                  num: string
                  pos: number
                }[]
              }
            | {
                type: "row"
                left: {
                  text: string
                  links: {
                    text: string
                    title?: string | undefined
                    href: string
                    pos: number
                  }[]
                }
                right: {
                  text: string
                  links: {
                    text: string
                    title?: string | undefined
                    href: string
                    pos: number
                  }[]
                  refs: {
                    num: string
                    pos: number
                  }[]
                }
              }
          )[]
        }
      | {
          type: "navbox"
          content: (
            | {
                type: "label"
                text: string
                links: {
                  text: string
                  title?: string | undefined
                  href: string
                  pos: number
                }[]
                level: number
              }
            | {
                type: "title"
                text: string
                links: {
                  text: string
                  title?: string | undefined
                  href: string
                  pos: number
                }[]
              }
            | {
                type: "items"
                items: {
                  text: string
                  links: {
                    text: string
                    title?: string | undefined
                    href: string
                    pos: number
                  }[]
                }[]
              }
          )[]
        }
      | {
          type: "sidebar"
          content: (
            | {
                type: "pretitle"
                text: string
                links: {
                  text: string
                  title?: string | undefined
                  href: string
                  pos: number
                }[]
              }
            | {
                type: "image"
                src: string
              }
            | {
                type: "items"
                items: {
                  text: string
                  links: {
                    text: string
                    title?: string | undefined
                    href: string
                    pos: number
                  }[]
                }[]
              }
            | {
                type: "title"
                text: string
                links: {
                  text: string
                  title?: string | undefined
                  href: string
                  pos: number
                }[]
              }
            | {
                type: "heading"
                text: string
                links: {
                  text: string
                  title?: string | undefined
                  href: string
                  pos: number
                }[]
              }
          )[]
        }
    )[]
  }[]
  languages: {
    autonym?: string | undefined
    localName?: string | undefined
    title?: string | undefined
    lang?: string | undefined
    href?: string | undefined
  }[]
}

📚 Documentation reference

To learn more about Apify and Actors, take a look at the following resources:

Wikipedia-scraper

pluzgi/wikipedia-scraper

The scraper searches Wikipedia for a given term, extracts the titles and URLs of search results, and retrieves the last modification date from each page.

pluzgi

Fandom & Wikipedia Extractor

jupri/wiki-scraper

Scrape content from Fandom.com and Wikipedia.com

cat

111

Wikipedia Search & Content Scraper

tuningsearch/wikipedia-search-scraper

🔥 Only $0.5 per 1,000 results 🔥 **CHEAPEST** Wikipedia Search + Full Page Scraper! 🔍 Search 100 results per query across 70 languages 📄 Extract complete page content in Markdown format ⚡ Lightning-fast batch processing with zero failure charges!

tuningsearch

Wikipedia MCP Server

agentify/wikipedia-mcp-server

MCP server for Wikipedia, providing LLMs and clients with real-time access to Wikipedia articles, summaries, sections, and related information via Apify Actor.

agentify

Fandom Scraper

kuaima/Fandom

Fandom is one of the biggest source for all things TV, movies, and games, including Star Wars, Fallout, Marvel, DC and more. This scraper can help you to get data from Fandom topic like https://www.fandom.com/topics/movies.

kuai ma

Audible Book Search Scraper 📚

easyapi/audible-book-search-scraper

A powerful scraper that extracts comprehensive audiobook information from Audible search results, including titles, authors, narrators, ratings, and pricing. Perfect for market research, price monitoring, and content analysis. 🎧📚

EasyApi

Health & Fitness Intelligence

visita/health-fitness-intelligence

This actor is a powerful data-gathering tool that transforms raw news from top RSS feeds (focused on Health, Fitness, and Nutrition) into structured, actionable intelligence. It uses DuckDuckGo News Search to gather real-time context and an LLM (OpenAI) to perform advanced analysis.

Visita AI & Automation

Wikipedia Scraper | $5 / 1k | Fast & Reliable

fatihtahta/wikipedia-scraper

Get full articles and detailed search results with the Wikipedia Scraper. Extract structured data including titles, summaries, citations, and full content. Ideal for market research, AI training, and competitive intelligence.

Fatih Tahta

5.0

Audible Scraper

mscraper/audible-scraper

Extract data from Amazon's audiobook and podcast service Audible. Extract data straight from Audible Best Sellers. Scrape prices, descriptions, ratings, reviews, and other data from the results, which you can export in a number of dataset formats.

mscraper

BarnesAndNoble.com Scraper (Pay Per Result)

123webdata/barnesandnoble-scraper

A scraper for Barnes & Noble to extract book and product data from both individual product URLs and category pages. Good for AI training, pricing, analytics, integrations. Results in JSON, CSV, XLSX etc. Pricing: $0.005/result. 10 results on free plan.