Wikipedia Page Dataset Scraper
Pricing
from $20.00 / 1,000 results
Wikipedia Page Dataset Scraper
Scrape Wikipedia articles and export structured dataset fields for training, knowledge bases, and research.
Pricing
from $20.00 / 1,000 results
Rating
5.0
(1)
Developer
ScrapeAI
Maintained by CommunityActor stats
0
Bookmarked
4
Total users
2
Monthly active users
23 days ago
Last modified
Categories
Share
Wikipedia Page Dataset Scraper ๐
Scrape Wikipedia pages and extract structured article content including title, summary, full text, headings, infobox data, categories, references, and internal links. Designed for AI training, RAG pipelines, knowledge base creation, research datasets, and content analysis.
Features
- Scrapes article content from Wikipedia pages.
- Extracts structured fields:
page_title,page_url,summary,full_text,headings,infobox,categories,references,internal_links,last_updated, andscraped_at. - Supports multiple start URLs and optional crawling of linked Wikipedia articles.
- Uses Playwright and Apify actor conventions for reliable dataset export.
Getting Started
-
Install dependencies
$npm install -
Configure input
- Edit
INPUT.jsonor provide actor input through the Apify platform. - Example
INPUT.json:{"startUrls": [{ "url": "https://en.wikipedia.org/wiki/Artificial_intelligence" }],"maxPages": 50,"followLinks": true}
- Edit
-
Run locally
$npm start -
Docker / Actor
- The
Dockerfilecan build the image. .actor/actor.jsondefines the Apify actor configuration.
- The
Output Fields
page_titlepage_urlsummaryfull_textheadingsinfoboxcategoriesreferencesinternal_linkslast_updatedscraped_at
File Overview
src/main.jsโ actor entry point that loads input, launches Playwright, and executes the Wikipedia scraper.src/scraper.jsโ page extraction and crawl logic for Wikipedia articles..actor/input_schema.jsonโ defines supported actor input fields..actor/dataset_schema.jsonโ defines the dataset output record fields..actor/actor.jsonโ actor metadata and Apify configuration.
Logs & Storage
- Logs are written to Apify storage during actor execution.
- Scraped dataset records are stored in Apify dataset storage.
License
This project is provided as-is. Feel free to adapt and extend it for your own Wikipedia scraping needs.
