Wikipedia Page Dataset Scraper avatar

Wikipedia Page Dataset Scraper

Pricing

from $20.00 / 1,000 results

Go to Apify Store
Wikipedia Page Dataset Scraper

Wikipedia Page Dataset Scraper

Scrape Wikipedia articles and export structured dataset fields for training, knowledge bases, and research.

Pricing

from $20.00 / 1,000 results

Rating

5.0

(1)

Developer

ScrapeAI

ScrapeAI

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Wikipedia Page Dataset Scraper 📚

Scrape Wikipedia pages and extract structured article content including title, summary, full text, headings, infobox data, categories, references, and internal links. Designed for AI training, RAG pipelines, knowledge base creation, research datasets, and content analysis.

Features

  • Scrapes article content from Wikipedia pages.
  • Extracts structured fields: page_title, page_url, summary, full_text, headings, infobox, categories, references, internal_links, last_updated, and scraped_at.
  • Supports multiple start URLs and optional crawling of linked Wikipedia articles.
  • Uses Playwright and Apify actor conventions for reliable dataset export.

Getting Started

  1. Install dependencies

    $npm install
  2. Configure input

    • Edit INPUT.json or provide actor input through the Apify platform.
    • Example INPUT.json:
      {
      "startUrls": [
      { "url": "https://en.wikipedia.org/wiki/Artificial_intelligence" }
      ],
      "maxPages": 50,
      "followLinks": true
      }
  3. Run locally

    $npm start
  4. Docker / Actor

    • The Dockerfile can build the image.
    • .actor/actor.json defines the Apify actor configuration.

Output Fields

  • page_title
  • page_url
  • summary
  • full_text
  • headings
  • infobox
  • categories
  • references
  • internal_links
  • last_updated
  • scraped_at

File Overview

  • src/main.js – actor entry point that loads input, launches Playwright, and executes the Wikipedia scraper.
  • src/scraper.js – page extraction and crawl logic for Wikipedia articles.
  • .actor/input_schema.json – defines supported actor input fields.
  • .actor/dataset_schema.json – defines the dataset output record fields.
  • .actor/actor.json – actor metadata and Apify configuration.

Logs & Storage

  • Logs are written to Apify storage during actor execution.
  • Scraped dataset records are stored in Apify dataset storage.

License

This project is provided as-is. Feel free to adapt and extend it for your own Wikipedia scraping needs.