Kaggle Dataset Scraper - ML Dataset Metadata
Pricing
Pay per usage
Kaggle Dataset Scraper - ML Dataset Metadata
Scrape Kaggle datasets and competitions. Extract dataset names, download counts, file sizes, usability ratings, tags, and license info.
Pricing
Pay per usage
Rating
0.0
(0)
Developer

Donny Nguyen
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
an hour ago
Last modified
Categories
Share
Kaggle Dataset Scraper
Extract dataset metadata and competition data from Kaggle at scale. Scrape dataset names, authors, download counts, vote counts, usability ratings, file sizes, tags, licenses, and more from listing pages, individual datasets, and competition pages.
Features
- Dataset listings - Scrape the main Kaggle datasets directory with pagination support
- Individual datasets - Extract detailed metadata from specific dataset pages
- Competition pages - Scrape competition listings and metadata
- User profiles - Extract all datasets from a specific Kaggle user
- Search results - Scrape datasets matching a search query
- Embedded data extraction - Parses
__NEXT_DATA__, component props, and Kaggle State for comprehensive data capture - Smart pagination - Automatically enqueues next pages when more results are needed
- Proxy support - Optional residential proxy for higher success rates
Input Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
urls | Array of strings | ["https://www.kaggle.com/datasets"] | Kaggle URLs to scrape. Supports dataset listings, individual datasets, competitions, user profiles, and search queries. |
maxResults | Integer | 100 | Maximum number of results to extract. Range: 1-10,000. |
useResidentialProxy | Boolean | false | Use residential proxies for better success rates. Increases cost but reduces blocking. |
Supported URL Formats
https://www.kaggle.com/datasets- Main dataset directoryhttps://www.kaggle.com/datasets?search=topic- Search resultshttps://www.kaggle.com/datasets/owner/dataset-name- Individual datasethttps://www.kaggle.com/username/datasets- User's datasetshttps://www.kaggle.com/competitions- Competition listings
Output Fields
| Field | Type | Description |
|---|---|---|
datasetName | String | Name/title of the dataset |
author | String | Username of the dataset creator |
description | String | Dataset subtitle or description |
fileSize | String | Total file size (e.g., "1.5 GB", "245.3 MB") |
downloadCount | Integer | Number of times the dataset has been downloaded |
voteCount | Integer | Number of upvotes/votes |
usabilityRating | Number | Kaggle usability rating (0-10 scale) |
lastUpdated | String | Date the dataset was last updated |
tags | Array | List of tags/keywords associated with the dataset |
license | String | License type (e.g., "CC0", "CC BY-SA 4.0") |
url | String | Direct URL to the dataset on Kaggle |
scrapedAt | String | ISO 8601 timestamp of when the data was collected |
Example Output
{"datasetName": "Netflix Movies and TV Shows","author": "shivamb","description": "Listings of all movies and TV shows available on Netflix","fileSize": "3.2 MB","downloadCount": 245000,"voteCount": 1892,"usabilityRating": 8.8,"lastUpdated": "2025-09-15","tags": ["movies and tv shows", "arts and entertainment", "netflix"],"license": "CC0: Public Domain","url": "https://www.kaggle.com/datasets/shivamb/netflix-shows","scrapedAt": "2026-02-11T12:00:00.000Z"}
Example Use Cases
- Data science research - Discover and catalog datasets for machine learning projects
- Competitive analysis - Track the most popular and downloaded datasets across categories
- Trend analysis - Monitor which data topics are gaining traction on Kaggle
- Dataset discovery - Find datasets by tag, license, or popularity for specific research needs
- Academic research - Build a comprehensive index of available open datasets
- Competition tracking - Monitor active Kaggle competitions and their engagement
Cost Estimate
This actor uses Utility tier Pay-Per-Event pricing at $0.0003 per result.
| Results | Estimated Cost |
|---|---|
| 100 | $0.03 |
| 1,000 | $0.30 |
| 3,333 | ~$1.00 |
| 10,000 | $3.00 |
Approximately 3,333 results per $1.00.
Compute costs are minimal since this is a Cheerio-based scraper (no browser overhead). A typical run of 100 results completes in under 2 minutes using ~256 MB memory.
Limitations
- JavaScript-rendered content - Kaggle uses heavy client-side rendering (React/Next.js). Some pages may yield fewer results than visible in a browser. The scraper compensates by extracting data from embedded JSON,
__NEXT_DATA__, and component props. - Rate limiting - Kaggle may throttle or block requests at high concurrency. Use residential proxies for large-scale scraping.
- Private datasets - Only publicly accessible datasets can be scraped. Private or organization-only datasets require authentication.
- Login-gated content - Some Kaggle pages require login to view full content. The scraper extracts what is available without authentication.
- Dynamic loading - Kaggle uses infinite scroll on listing pages. The scraper handles pagination via URL parameters but may not capture all items from dynamically loaded content.
- API changes - Kaggle may update their frontend structure at any time, which could affect extraction accuracy.