Kaggle Dataset Scraper - ML Dataset Metadata avatar

Kaggle Dataset Scraper - ML Dataset Metadata

Pricing

Pay per usage

Go to Apify Store
Kaggle Dataset Scraper - ML Dataset Metadata

Kaggle Dataset Scraper - ML Dataset Metadata

Scrape Kaggle datasets and competitions. Extract dataset names, download counts, file sizes, usability ratings, tags, and license info.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Donny Nguyen

Donny Nguyen

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

an hour ago

Last modified

Categories

Share

Kaggle Dataset Scraper

Extract dataset metadata and competition data from Kaggle at scale. Scrape dataset names, authors, download counts, vote counts, usability ratings, file sizes, tags, licenses, and more from listing pages, individual datasets, and competition pages.

Features

  • Dataset listings - Scrape the main Kaggle datasets directory with pagination support
  • Individual datasets - Extract detailed metadata from specific dataset pages
  • Competition pages - Scrape competition listings and metadata
  • User profiles - Extract all datasets from a specific Kaggle user
  • Search results - Scrape datasets matching a search query
  • Embedded data extraction - Parses __NEXT_DATA__, component props, and Kaggle State for comprehensive data capture
  • Smart pagination - Automatically enqueues next pages when more results are needed
  • Proxy support - Optional residential proxy for higher success rates

Input Parameters

ParameterTypeDefaultDescription
urlsArray of strings["https://www.kaggle.com/datasets"]Kaggle URLs to scrape. Supports dataset listings, individual datasets, competitions, user profiles, and search queries.
maxResultsInteger100Maximum number of results to extract. Range: 1-10,000.
useResidentialProxyBooleanfalseUse residential proxies for better success rates. Increases cost but reduces blocking.

Supported URL Formats

  • https://www.kaggle.com/datasets - Main dataset directory
  • https://www.kaggle.com/datasets?search=topic - Search results
  • https://www.kaggle.com/datasets/owner/dataset-name - Individual dataset
  • https://www.kaggle.com/username/datasets - User's datasets
  • https://www.kaggle.com/competitions - Competition listings

Output Fields

FieldTypeDescription
datasetNameStringName/title of the dataset
authorStringUsername of the dataset creator
descriptionStringDataset subtitle or description
fileSizeStringTotal file size (e.g., "1.5 GB", "245.3 MB")
downloadCountIntegerNumber of times the dataset has been downloaded
voteCountIntegerNumber of upvotes/votes
usabilityRatingNumberKaggle usability rating (0-10 scale)
lastUpdatedStringDate the dataset was last updated
tagsArrayList of tags/keywords associated with the dataset
licenseStringLicense type (e.g., "CC0", "CC BY-SA 4.0")
urlStringDirect URL to the dataset on Kaggle
scrapedAtStringISO 8601 timestamp of when the data was collected

Example Output

{
"datasetName": "Netflix Movies and TV Shows",
"author": "shivamb",
"description": "Listings of all movies and TV shows available on Netflix",
"fileSize": "3.2 MB",
"downloadCount": 245000,
"voteCount": 1892,
"usabilityRating": 8.8,
"lastUpdated": "2025-09-15",
"tags": ["movies and tv shows", "arts and entertainment", "netflix"],
"license": "CC0: Public Domain",
"url": "https://www.kaggle.com/datasets/shivamb/netflix-shows",
"scrapedAt": "2026-02-11T12:00:00.000Z"
}

Example Use Cases

  • Data science research - Discover and catalog datasets for machine learning projects
  • Competitive analysis - Track the most popular and downloaded datasets across categories
  • Trend analysis - Monitor which data topics are gaining traction on Kaggle
  • Dataset discovery - Find datasets by tag, license, or popularity for specific research needs
  • Academic research - Build a comprehensive index of available open datasets
  • Competition tracking - Monitor active Kaggle competitions and their engagement

Cost Estimate

This actor uses Utility tier Pay-Per-Event pricing at $0.0003 per result.

ResultsEstimated Cost
100$0.03
1,000$0.30
3,333~$1.00
10,000$3.00

Approximately 3,333 results per $1.00.

Compute costs are minimal since this is a Cheerio-based scraper (no browser overhead). A typical run of 100 results completes in under 2 minutes using ~256 MB memory.

Limitations

  • JavaScript-rendered content - Kaggle uses heavy client-side rendering (React/Next.js). Some pages may yield fewer results than visible in a browser. The scraper compensates by extracting data from embedded JSON, __NEXT_DATA__, and component props.
  • Rate limiting - Kaggle may throttle or block requests at high concurrency. Use residential proxies for large-scale scraping.
  • Private datasets - Only publicly accessible datasets can be scraped. Private or organization-only datasets require authentication.
  • Login-gated content - Some Kaggle pages require login to view full content. The scraper extracts what is available without authentication.
  • Dynamic loading - Kaggle uses infinite scroll on listing pages. The scraper handles pagination via URL parameters but may not capture all items from dynamically loaded content.
  • API changes - Kaggle may update their frontend structure at any time, which could affect extraction accuracy.