Kaggle Datasets Scraper
Pricing
Pay per event
Kaggle Datasets Scraper
Extract Kaggle dataset metadata at scale: titles, owners, descriptions, tags, license, file types, sizes, downloads, views, and votes. Filter by search, tag, user, file type, or size.
Pricing
Pay per event
Rating
0.0
(0)
Developer
ParseForge
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
a day ago
Last modified
Categories
Share

📊 Kaggle Datasets Scraper
🚀 Surface every public dataset on Kaggle in seconds. Filter by keyword, file format, license, sort order, and size. No API key, no registration, no manual CSV wrangling.
🕒 Last updated: 2026-05-06 · 📊 24 fields per record · Powered by the Kaggle public API · No browser, pure HTTP · Up to 1M datasets per run
Kaggle hosts more than 400,000 public datasets contributed by data scientists, ML researchers, and academic groups, ranging from a 16 KB CSV of medical insurance costs to half-gigabyte historical stock dumps and image corpora used in published competitions. Each dataset has a rich metadata footprint that matters in practice: number of downloads, votes, view counts, the kernelCount of public Kaggle notebooks that consume it, the license, the file format, an automated usabilityRating for schema clarity, and a long-form Markdown description. This Actor turns that metadata layer into clean dataset rows you can sort, filter, and pipe into downstream tools.
This Actor is built for ML engineers picking training corpora, data scientists benchmarking model results against community baselines, AI researchers tracking which Kaggle datasets are gaining notebook traction, and academic teams sourcing reproducible inputs for coursework or thesis work. It is a pure HTTP scraper against Kaggle's public dataset endpoints, so runs are fast and cheap. It does not download the actual dataset files - only the metadata layer that helps you decide which ones to pull next. Output is plain JSON, ready to feed into BigQuery, a Postgres staging table, a notebook, or a Make / Zapier workflow.
🎯 Target Audience and Primary Use Cases
| Audience | Use Case |
|---|---|
| 🤖 ML engineers | Source training corpora and benchmark datasets by file format, license, and size |
| 📈 Data scientists | Track which datasets are trending, surfacing newly hot competitions and corpora |
| 🎓 AI researchers | Build reproducible bibliographies of community datasets used in papers |
| 🏫 Academic teams | Pull dataset metadata for coursework, dissertations, and lit reviews |
| 🧪 Product builders | Validate that public training data exists for a given vertical before committing engineering |
📋 What the Kaggle Datasets Scraper does
- 🔎 Search by keyword. Free-text query against dataset titles, descriptions, and tags. Pass
finance,nlp,medical imaging, etc., or leave blank to browse without a keyword. - 🗂️ Filter by file format. CSV, JSON, SQLite, BigQuery, or all formats. Useful when your downstream tooling only accepts one shape.
- ⚖️ Filter by license. Restrict to Creative Commons, GPL, Open Database License, Other, or all licenses.
- 🏷️ Filter by tag. Pass any Kaggle tag slug (
classification,nlp,finance,health) to scope the run to a topic, technique, or domain. - 📐 Filter by size. Set
minSize/maxSizein bytes to keep the result set within memory or storage limits for downstream tools. - 🥇 Sort the way Kaggle does. Hottest, most votes, recently updated, most active, recently published.
- 📜 Optional full description enrichment. When enabled, each record is enriched with the dataset's long-form Markdown description, full tag list, and version history. Disable for faster runs when you only need card-level fields.
Each output record represents one public Kaggle dataset. Alongside the title, owner, and URL, the row includes the canonical ref (owner/slug), license, total bytes, current version number, usability rating, downloads, views, votes, the count of public notebooks (kernels) that reference the dataset, the topic count, last-updated timestamp, an array of tag slugs, a compact version history, and (optionally) the full Markdown description.
💡 Why it matters: Kaggle is one of the largest public catalogues of curated ML data on the web, and it sits behind a JS-heavy UI that is hard to crawl. A clean metadata feed lets you make data-sourcing decisions at the speed of SQL.
🎬 Full Demo
🚧 Coming soon: a 3-minute walkthrough showing how to find every CC0 image-classification dataset over 100 MB and export the list as CSV.
⚙️ Input
| Field | Type | Required | Description |
|---|---|---|---|
| search | string | no | Free-text query against titles, descriptions, and tags. Leave blank to browse without a keyword. |
| maxItems | integer | no | Max datasets to return. Free tier capped at 10. Paid up to 1,000,000. |
| sortBy | enum | no | One of hottest, votes, updated, active, published. Defaults to hottest. |
| fileType | enum | no | One of all, csv, json, sqlite, bigQuery. |
| license | enum | no | One of all, cc, gpl, odb, other. |
| tag | enum | no | Pick a Kaggle tag slug from a dropdown of 400+ canonical taxonomy values (e.g. classification, nlp, finance, health). |
| user | string | no | Filter to datasets owned by a single Kaggle user or organisation slug (e.g. timoboz, mlg-ulb, organizations/google). |
| minSize | integer | no | Lower bound on dataset size in bytes. |
| maxSize | integer | no | Upper bound on dataset size in bytes. |
| includeDescription | boolean | no | Fetch the dataset detail endpoint per record to add description, full tags, and version history. Defaults to true. |
| proxyConfiguration | object | no | Apify proxy configuration. Recommended for large jobs. |
Example: top 100 finance datasets ranked by community votes, with full descriptions:
{"search": "finance","sortBy": "votes","fileType": "all","license": "all","includeDescription": true,"maxItems": 100}
Example: every CSV dataset under 50 MB tagged nlp, sorted by recency, no description bodies for a faster run:
{"tag": "nlp","fileType": "csv","maxSize": 52428800,"sortBy": "updated","includeDescription": false,"maxItems": 500}
⚠️ Good to know: the Kaggle public API tolerates direct calls but rate-limits aggressively under sustained load. For runs above a few thousand datasets, enable Apify Residential proxy in the input.
📊 Output
Each dataset row is one public Kaggle dataset. The optional detail enrichment adds the long-form Markdown description, the canonical tag list, and the dataset's full version history.
🧾 Schema
| Field | Type | Example |
|---|---|---|
| 🖼️ thumbnailImageUrl | string (URL) | https://storage.googleapis.com/kaggle-datasets-images/310/684/.../dataset-thumbnail.jpg |
| 📛 title | string | Credit Card Fraud Detection |
| 🆔 ref | string | mlg-ulb/creditcardfraud |
| 🔗 url | string (URL) | https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud |
| ✏️ subtitle | string | Anonymized credit card transactions labeled as fraudulent or genuine |
| 👤 ownerName | string | Machine Learning Group - ULB |
| 🪪 ownerRef | string | organizations/mlg-ulb |
| 🧑🎨 creatorName | string | Timo Bozsolik |
| 🔗 creatorUrl | string | timoboz |
| 📜 licenseName | string | Database: Open Database, Contents: Database Contents |
| 📦 totalBytes | integer | 69155672 |
| 🔢 currentVersionNumber | integer | 3 |
| ⭐ usabilityRating | number | 0.85294 |
| ⬇️ downloadCount | integer | 1132640 |
| 👁️ viewCount | integer | 12618050 |
| ❤️ voteCount | integer | 13166 |
| 📓 kernelCount | integer | 5984 |
| 💬 topicCount | integer | 0 |
| ✨ isFeatured | boolean | false |
| 🕒 lastUpdated | string (ISO) | 2018-03-23T01:17:27.913Z |
| 📝 description | string (Markdown) | Context\n---\nIt is important... |
| 🏷️ tags | array of strings | ["finance", "crime"] |
| 📚 versions | array of objects | [{"versionNumber":3,"creationDate":"2018-03-23T01:17:27.913Z","status":"Ready",...}] |
| ⏱️ scrapedAt | string (ISO) | 2026-05-05T23:33:29.575Z |
📦 Sample records
✨ Why choose this Actor
| Capability | |
|---|---|
| 🪶 | No browser overhead. Pure HTTP against Kaggle's public dataset endpoints. Cheap to run, fast to finish. |
| 🔎 | Six filter axes. Keyword, sort order, file type, license, tag, size. Combine freely. |
| 📓 | Notebook traction signal. Every record carries kernelCount, the number of public Kaggle notebooks that already use the dataset. Use it to spot which datasets practitioners actually adopt. |
| ⭐ | Usability score included. Kaggle's automated 0-1 score on metadata completeness, schema clarity, and licensing comes for free with every row. |
| 📚 | Full version history. Each record carries the dataset's complete versioning trail: number, creation date, status, release notes, and contributor. |
| 🏷️ | Tags as flat slugs. Tags arrive as a clean string array, not nested objects. Drops straight into a SQL text[] or BigQuery ARRAY<STRING>. |
| 💾 | Clean dataset shape. 24 well-typed fields, no nulls on populated records, plus a direct dataset URL and thumbnail. |
📊 Kaggle hosts more than 400,000 public datasets. This Actor exposes the full metadata layer behind that catalogue with no manual scraping.
📈 How it compares to alternatives
| Approach | Cost | Coverage | Refresh | Filters | Setup |
|---|---|---|---|---|---|
| ⭐ Kaggle Datasets Scraper (this Actor) | Apify usage only | Whole public catalogue | Live | Keyword, sort, file type, license, tag, size | None, run from console |
| Official CLI | Free | Same | Live | Same | Local install, account, API token |
| Manual JSON harvesting | Free | Same | Live | DIY | Pagination, retries, parsing yourself |
| Paid live data marketplaces | High monthly | Curated subsets only | Live | Per-vendor | Account, billing, API key |
| Static community dumps | Free | Stale, partial | Months out of date | Whatever the dump captured | Find and trust the dump |
For most teams the calculus is simple: a hosted scraper that returns clean JSON is worth more than the time spent re-implementing pagination, retries, and detail enrichment.
🚀 How to use
- 🆔 Create a free account. Create a free account w/ $5 credit.
- 🔎 Open the Actor. Find the Kaggle Datasets Scraper on Apify Store.
- 📝 Fill the input form. Set a keyword and pick the filters that matter (file type, license, tag, size, sort).
- ▶️ Run. Click Start. The log streams listing pages and how many datasets have been collected.
- ⬇️ Export. Download as JSON, CSV, Excel, or stream into a Make / Zapier / n8n workflow.
⏱️ Total time to first row: under a minute for most filter combinations.
💼 Business use cases
🌟 Beyond business use cases
Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.
🔌 Automating Kaggle Datasets Scraper
Run this Actor on a schedule, from your own backend, or as part of a larger pipeline.
- Node.js via the Apify JS client
- Python via the Apify Python client
- Docs: Apify Actor API reference
Schedules are first-class on Apify. Set a cron, point it at this Actor's input, and your dataset stays fresh without any glue code.
❓ Frequently Asked Questions
🔌 Integrate with any app
- Make - drop the Actor into a no-code automation
- Zapier - trigger Zaps from each finished run
- n8n - self-hostable workflow automation
- Slack - send completion notifications and dataset links to a channel
- Webhooks - POST run events to any endpoint
- Google Sheets - mirror the dataset into a Sheet for collaborators
🔗 Recommended Actors
- 🤗 Hugging Face Model Scraper - public model catalogue with download counts and licenses
- 📚 Semantic Scholar Scraper - peer-reviewed papers with citation metadata
- 🧬 medRxiv Scraper - medical preprints for biomedical AI training corpora
- 🏛️ FRED Economic Data Scraper - public economic time series for finance and macro models
- 🏥 ClinicalTrials Scraper - structured clinical trial registry data
💡 Pro Tip: browse the complete ParseForge collection for more public-data scrapers built with the same conventions.
🆘 Need Help? Open our contact form
Disclaimer: This Actor is an independent project and is not affiliated with, endorsed by, or sponsored by Kaggle or Google LLC. It only reads public dataset metadata. You are responsible for complying with applicable laws, Kaggle's terms of service, and the per-dataset licenses when using the data downstream.