
Full Wikipedia Scraper
Pricing
Pay per event

Full Wikipedia Scraper
This Wikipedia API scrapes and sorts all content from an article, including text, images, links, references, headers, tables, lists, and more. All content is sorted by content type, neatly into JSON for easy use.
0.0 (0)
Pricing
Pay per event
0
1
1
Last modified
a day ago
🚀 What does Full Wikipedia Scraper do?
This all-purpose Wikipedia API allows you to easily and quickly extract all data that you will ever need from a Wikipedia article. All article content is sorted into JSON by content type. Full Wikipedia Scraper can scrape any article for:
- Full sorted article content
- Sections and Headings
- Full text (Paragraphs, Quotes, Notes, etc.)
- Images and Videos
- Tables and Lists
- Links and References
- Infobox, Navbox, and Sidebar
- Languages
- Wiki categories
- Time of last edit
📚 Why use Full Wikipedia Scraper?
Example uses of this scraper include:
- 🤖 Machine learning datasets — Train NLP models with clean encyclopedic data.
- 📰 Fact-checking & journalism — Automate retrieval of source content.
- 🔍 SEO & content analysis — Analyze keyword usage and content structure.
- 🎓 Academic research — Gather structured information for citations or topic studies.
- 🧠 Knowledge graph building — Enrich linked data with Wikipedia’s structured info.
The output of Full Wikipedia Scraper is extremely:
- 💪 Robust
- All math displayed on pages is embedded in text in MathJAX notation. This can be displayed and formatted in HTML by simply including MathJax in the file.
- Many easily confused special characters in the text are replaced with more common characters.
- 🎯 Specific
- The position in text and data of all links and references is scraped.
- Most of the original formatting of the article could be recreated from its output.
- ⚙️ Consistent
- The formatting of Wikipedia is entirely inconsistent from article-to-article, making scraping extremely tedious.
- This scraper can read all of the inconsistent formats and reliably output in a consistent, predictable format (output schema shown below).
- ⚡ Fast
- All of this is done in just a couple seconds!
📝 Input
Input consists of the language code (default: English) and the articles (the part of the link after /wiki/). Tip: For each article, make sure that there really is a page at https://{language}.wikipedia.org/wiki/{article}! For example:
{"language": "en","articles": ["Chocolate", "Vanilla"]}
📤 Output
The actor will ouput one object per article, each with properties:
- title
- description
- sections
- numLanguages
- languages
- categories
- lastmod
The main page content is contained in the "sections" property. This is an array of objects, which is an array of objects with properties "heading" (string) and "content". "Content" contains an array of objects, each with a "type" property:
{"sections": [{"heading": "First section heading","content": [{ "type": "...", "text": "..." }, "..."]},{ "heading": "Second section heading", "content": ["..."] },"..."]}
Content types can be one of the following:
- paragraph
- quote
- note
- list
- heading
- gallery
- image
- video
- table
- refs
- infobox
- navbox
- sidebar
These are the different kinds of content on Wikipedia. Please see the images below for an example of what each of these are, as well as a full TypeScript Schema in the appendix of this README.
🔍 Example output (simplified with "..."):
{"title": "Chocolate","description": "Food produced from cacao seeds","sections": [{"content": [{"type": "note","text": "For other uses, see Chocolate (disambiguation).","links": ["..."]},"...",{"type": "paragraph","text": "Chocolate is a food made from roasted and ground cocoa beans...","links": [{"text": "cocoa beans","title": "Cocoa bean","href": "https://en.wikipedia.org/wiki/Cocoa_bean","pos": 49},"..."]}]},"..."],"numLanguages": 154,"languages": [{"autonym": "Afrikaans","localName": "Afrikaans","title": "Sjokolade","lang": "af","href": "https://af.wikipedia.org/wiki/Sjokolade"},"..."],"categories": [{"text": "Chocolate","links": ["..."]},"..."],"lastmod": "2025-07-22T22:18:00.000Z"}
🖼️ Section Types
![]() | ![]() | ||
![]() | ![]() | ||
![]() |
|
⚖️ Legal and ethical use
This scraper only collects publicly available Wikipedia content. Wikipedia content is available under the Creative Commons Attribution-ShareAlike License (CC BY-SA 4.0). When using the data, you must comply with the license terms.
Our scrapers are ethical and do not extract any private data. They only extract publicly available content. We therefore believe that our scrapers, when used for ethical purposes by Apify users, are safe. However, you should be aware that your results could contain personal data. Personal data is protected by the GDPR in the European Union and by other regulations around the world. You should not scrape personal data unless you have a legitimate reason to do so. If you're unsure whether your reason is legitimate, consult your lawyers. You can also read our blog post on the legality of web scraping.
📄 Appendix: Full Output TypeScript Schema
export interface Output {title: stringdescription: stringnumLanguages: numbercategories: {text: stringlinks: {text: stringtitle?: string | undefinedhref: stringpos: number}[]}[]lastmod: stringsections: {heading?: string | undefinedcontent: (| {type: "paragraph"text: stringlinks: {text: stringtitle?: string | undefinedhref: stringpos: number}[]refs: {num: stringpos: number}[]}| {type: "quote"isBoxed: booleantext: stringlinks: {text: stringtitle?: string | undefinedhref: stringpos: number}[]refs: {num: stringpos: number}[]}| {type: "note"text: stringlinks: {text: stringtitle?: string | undefinedhref: stringpos: number}[]}| {type: "list"items: {text: stringlinks: {text: stringtitle?: string | undefinedhref: stringpos: number}[]}[]}| {type: "heading"text: stringlevel: number}| {type: "gallery"caption?:| {text: stringlinks: {text: stringtitle?: string | undefinedhref: stringpos: number}[]}| undefineditems: {text: stringlinks: {text: stringtitle?: string | undefinedhref: stringpos: number}[]src?: string | undefined}[]}| {type: "image"caption: {text: stringlinks: {text: stringtitle?: string | undefinedhref: stringpos: number}[]refs: {num: stringpos: number}[]}src?: string | undefinedhref?: string | undefinedside: "left" | "right" | "center"}| {type: "video"caption: {text: stringlinks: {text: stringtitle?: string | undefinedhref: stringpos: number}[]refs: {num: stringpos: number}[]}src?: string | undefinedhref?: string | undefinedside: "left" | "right" | "center"}| {type: "table"caption?:| {text: stringlinks: {text: stringtitle?: string | undefinedhref: stringpos: number}[]refs: {num: stringpos: number}[]}| undefinedrows: {type: "data" | "heading"text: stringcols?: number | undefinedrows?: number | undefinedlinks: {text: stringtitle?: string | undefinedhref: stringpos: number}[]refs: {num: stringpos: number}[]color?: string | undefined}[][]side?: ("left" | "right") | undefinedisBoxed: boolean}| {type: "refs"refs: {[x: string]:| {text: stringlinks: {text: stringtitle?: string | undefinedhref: stringpos: number}[]}| undefined}}| {type: "infobox"content: (| {type: "heading"text: string}| {type: "title"text: string}| {type: "subtitle"text: string}| {type: "image"text: stringlinks: {text: stringtitle?: string | undefinedhref: stringpos: number}[]src: string}| {type: "fullrow"text: stringlinks: {text: stringtitle?: string | undefinedhref: stringpos: number}[]refs: {num: stringpos: number}[]}| {type: "row"left: {text: stringlinks: {text: stringtitle?: string | undefinedhref: stringpos: number}[]}right: {text: stringlinks: {text: stringtitle?: string | undefinedhref: stringpos: number}[]refs: {num: stringpos: number}[]}})[]}| {type: "navbox"content: (| {type: "label"text: stringlinks: {text: stringtitle?: string | undefinedhref: stringpos: number}[]level: number}| {type: "title"text: stringlinks: {text: stringtitle?: string | undefinedhref: stringpos: number}[]}| {type: "items"items: {text: stringlinks: {text: stringtitle?: string | undefinedhref: stringpos: number}[]}[]})[]}| {type: "sidebar"content: (| {type: "pretitle"text: stringlinks: {text: stringtitle?: string | undefinedhref: stringpos: number}[]}| {type: "image"src: string}| {type: "items"items: {text: stringlinks: {text: stringtitle?: string | undefinedhref: stringpos: number}[]}[]}| {type: "title"text: stringlinks: {text: stringtitle?: string | undefinedhref: stringpos: number}[]}| {type: "heading"text: stringlinks: {text: stringtitle?: string | undefinedhref: stringpos: number}[]})[]})[]}[]languages: {autonym?: string | undefinedlocalName?: string | undefinedtitle?: string | undefinedlang?: string | undefinedhref?: string | undefined}[]}
📚 Documentation reference
To learn more about Apify and Actors, take a look at the following resources: