Kaggle Datasets Scraper avatar

Kaggle Datasets Scraper

Pricing

Pay per event

Go to Apify Store
Kaggle Datasets Scraper

Kaggle Datasets Scraper

Extract Kaggle dataset metadata at scale: titles, owners, descriptions, tags, license, file types, sizes, downloads, views, and votes. Filter by search, tag, user, file type, or size.

Pricing

Pay per event

Rating

0.0

(0)

Developer

ParseForge

ParseForge

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a day ago

Last modified

Share

ParseForge Banner

📊 Kaggle Datasets Scraper

🚀 Surface every public dataset on Kaggle in seconds. Filter by keyword, file format, license, sort order, and size. No API key, no registration, no manual CSV wrangling.

🕒 Last updated: 2026-05-06 · 📊 24 fields per record · Powered by the Kaggle public API · No browser, pure HTTP · Up to 1M datasets per run

Kaggle hosts more than 400,000 public datasets contributed by data scientists, ML researchers, and academic groups, ranging from a 16 KB CSV of medical insurance costs to half-gigabyte historical stock dumps and image corpora used in published competitions. Each dataset has a rich metadata footprint that matters in practice: number of downloads, votes, view counts, the kernelCount of public Kaggle notebooks that consume it, the license, the file format, an automated usabilityRating for schema clarity, and a long-form Markdown description. This Actor turns that metadata layer into clean dataset rows you can sort, filter, and pipe into downstream tools.

This Actor is built for ML engineers picking training corpora, data scientists benchmarking model results against community baselines, AI researchers tracking which Kaggle datasets are gaining notebook traction, and academic teams sourcing reproducible inputs for coursework or thesis work. It is a pure HTTP scraper against Kaggle's public dataset endpoints, so runs are fast and cheap. It does not download the actual dataset files - only the metadata layer that helps you decide which ones to pull next. Output is plain JSON, ready to feed into BigQuery, a Postgres staging table, a notebook, or a Make / Zapier workflow.

🎯 Target Audience and Primary Use Cases

AudienceUse Case
🤖 ML engineersSource training corpora and benchmark datasets by file format, license, and size
📈 Data scientistsTrack which datasets are trending, surfacing newly hot competitions and corpora
🎓 AI researchersBuild reproducible bibliographies of community datasets used in papers
🏫 Academic teamsPull dataset metadata for coursework, dissertations, and lit reviews
🧪 Product buildersValidate that public training data exists for a given vertical before committing engineering

📋 What the Kaggle Datasets Scraper does

  • 🔎 Search by keyword. Free-text query against dataset titles, descriptions, and tags. Pass finance, nlp, medical imaging, etc., or leave blank to browse without a keyword.
  • 🗂️ Filter by file format. CSV, JSON, SQLite, BigQuery, or all formats. Useful when your downstream tooling only accepts one shape.
  • ⚖️ Filter by license. Restrict to Creative Commons, GPL, Open Database License, Other, or all licenses.
  • 🏷️ Filter by tag. Pass any Kaggle tag slug (classification, nlp, finance, health) to scope the run to a topic, technique, or domain.
  • 📐 Filter by size. Set minSize / maxSize in bytes to keep the result set within memory or storage limits for downstream tools.
  • 🥇 Sort the way Kaggle does. Hottest, most votes, recently updated, most active, recently published.
  • 📜 Optional full description enrichment. When enabled, each record is enriched with the dataset's long-form Markdown description, full tag list, and version history. Disable for faster runs when you only need card-level fields.

Each output record represents one public Kaggle dataset. Alongside the title, owner, and URL, the row includes the canonical ref (owner/slug), license, total bytes, current version number, usability rating, downloads, views, votes, the count of public notebooks (kernels) that reference the dataset, the topic count, last-updated timestamp, an array of tag slugs, a compact version history, and (optionally) the full Markdown description.

💡 Why it matters: Kaggle is one of the largest public catalogues of curated ML data on the web, and it sits behind a JS-heavy UI that is hard to crawl. A clean metadata feed lets you make data-sourcing decisions at the speed of SQL.


🎬 Full Demo

🚧 Coming soon: a 3-minute walkthrough showing how to find every CC0 image-classification dataset over 100 MB and export the list as CSV.


⚙️ Input

FieldTypeRequiredDescription
searchstringnoFree-text query against titles, descriptions, and tags. Leave blank to browse without a keyword.
maxItemsintegernoMax datasets to return. Free tier capped at 10. Paid up to 1,000,000.
sortByenumnoOne of hottest, votes, updated, active, published. Defaults to hottest.
fileTypeenumnoOne of all, csv, json, sqlite, bigQuery.
licenseenumnoOne of all, cc, gpl, odb, other.
tagenumnoPick a Kaggle tag slug from a dropdown of 400+ canonical taxonomy values (e.g. classification, nlp, finance, health).
userstringnoFilter to datasets owned by a single Kaggle user or organisation slug (e.g. timoboz, mlg-ulb, organizations/google).
minSizeintegernoLower bound on dataset size in bytes.
maxSizeintegernoUpper bound on dataset size in bytes.
includeDescriptionbooleannoFetch the dataset detail endpoint per record to add description, full tags, and version history. Defaults to true.
proxyConfigurationobjectnoApify proxy configuration. Recommended for large jobs.

Example: top 100 finance datasets ranked by community votes, with full descriptions:

{
"search": "finance",
"sortBy": "votes",
"fileType": "all",
"license": "all",
"includeDescription": true,
"maxItems": 100
}

Example: every CSV dataset under 50 MB tagged nlp, sorted by recency, no description bodies for a faster run:

{
"tag": "nlp",
"fileType": "csv",
"maxSize": 52428800,
"sortBy": "updated",
"includeDescription": false,
"maxItems": 500
}

⚠️ Good to know: the Kaggle public API tolerates direct calls but rate-limits aggressively under sustained load. For runs above a few thousand datasets, enable Apify Residential proxy in the input.


📊 Output

Each dataset row is one public Kaggle dataset. The optional detail enrichment adds the long-form Markdown description, the canonical tag list, and the dataset's full version history.

🧾 Schema

FieldTypeExample
🖼️ thumbnailImageUrlstring (URL)https://storage.googleapis.com/kaggle-datasets-images/310/684/.../dataset-thumbnail.jpg
📛 titlestringCredit Card Fraud Detection
🆔 refstringmlg-ulb/creditcardfraud
🔗 urlstring (URL)https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
✏️ subtitlestringAnonymized credit card transactions labeled as fraudulent or genuine
👤 ownerNamestringMachine Learning Group - ULB
🪪 ownerRefstringorganizations/mlg-ulb
🧑‍🎨 creatorNamestringTimo Bozsolik
🔗 creatorUrlstringtimoboz
📜 licenseNamestringDatabase: Open Database, Contents: Database Contents
📦 totalBytesinteger69155672
🔢 currentVersionNumberinteger3
⭐ usabilityRatingnumber0.85294
⬇️ downloadCountinteger1132640
👁️ viewCountinteger12618050
❤️ voteCountinteger13166
📓 kernelCountinteger5984
💬 topicCountinteger0
✨ isFeaturedbooleanfalse
🕒 lastUpdatedstring (ISO)2018-03-23T01:17:27.913Z
📝 descriptionstring (Markdown)Context\n---\nIt is important...
🏷️ tagsarray of strings["finance", "crime"]
📚 versionsarray of objects[{"versionNumber":3,"creationDate":"2018-03-23T01:17:27.913Z","status":"Ready",...}]
⏱️ scrapedAtstring (ISO)2026-05-05T23:33:29.575Z

📦 Sample records


✨ Why choose this Actor

Capability
🪶No browser overhead. Pure HTTP against Kaggle's public dataset endpoints. Cheap to run, fast to finish.
🔎Six filter axes. Keyword, sort order, file type, license, tag, size. Combine freely.
📓Notebook traction signal. Every record carries kernelCount, the number of public Kaggle notebooks that already use the dataset. Use it to spot which datasets practitioners actually adopt.
Usability score included. Kaggle's automated 0-1 score on metadata completeness, schema clarity, and licensing comes for free with every row.
📚Full version history. Each record carries the dataset's complete versioning trail: number, creation date, status, release notes, and contributor.
🏷️Tags as flat slugs. Tags arrive as a clean string array, not nested objects. Drops straight into a SQL text[] or BigQuery ARRAY<STRING>.
💾Clean dataset shape. 24 well-typed fields, no nulls on populated records, plus a direct dataset URL and thumbnail.

📊 Kaggle hosts more than 400,000 public datasets. This Actor exposes the full metadata layer behind that catalogue with no manual scraping.


📈 How it compares to alternatives

ApproachCostCoverageRefreshFiltersSetup
⭐ Kaggle Datasets Scraper (this Actor)Apify usage onlyWhole public catalogueLiveKeyword, sort, file type, license, tag, sizeNone, run from console
Official CLIFreeSameLiveSameLocal install, account, API token
Manual JSON harvestingFreeSameLiveDIYPagination, retries, parsing yourself
Paid live data marketplacesHigh monthlyCurated subsets onlyLivePer-vendorAccount, billing, API key
Static community dumpsFreeStale, partialMonths out of dateWhatever the dump capturedFind and trust the dump

For most teams the calculus is simple: a hosted scraper that returns clean JSON is worth more than the time spent re-implementing pagination, retries, and detail enrichment.


🚀 How to use

  1. 🆔 Create a free account. Create a free account w/ $5 credit.
  2. 🔎 Open the Actor. Find the Kaggle Datasets Scraper on Apify Store.
  3. 📝 Fill the input form. Set a keyword and pick the filters that matter (file type, license, tag, size, sort).
  4. ▶️ Run. Click Start. The log streams listing pages and how many datasets have been collected.
  5. ⬇️ Export. Download as JSON, CSV, Excel, or stream into a Make / Zapier / n8n workflow.

⏱️ Total time to first row: under a minute for most filter combinations.


💼 Business use cases

🤖 ML and AI engineering

  • Source training corpora that match a target file format and license
  • Track which datasets are gaining notebook traction this month
  • Build internal data catalogues seeded with Kaggle metadata
  • Pre-screen datasets by usability rating before downloading

📈 Data science and analytics

  • Benchmark internal models against community baselines
  • Discover trending datasets in a vertical (finance, health, NLP)
  • Compare licensing terms across candidate datasets at scale
  • Pull a fresh top-100 list of community-favourite datasets weekly

🎓 Academia and research

  • Build reproducible bibliographies of Kaggle datasets cited in papers
  • Quantify how dataset adoption evolves through kernelCount over time
  • Seed coursework and capstone projects with curated dataset shortlists
  • Track which competition datasets remain active years after the contest

🧪 Product and platform teams

  • Validate that public training data exists before committing engineering
  • Source seed datasets for AI feature prototypes and evaluations
  • Map dataset gaps your platform could fill with proprietary data
  • Run weekly sweeps to feed an internal data marketplace

🌟 Beyond business use cases

Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.

🎓 Research and academia

  • Empirical datasets for papers, thesis work, and coursework
  • Longitudinal studies tracking changes across snapshots
  • Reproducible research with cited, versioned data pulls
  • Classroom exercises on data analysis and ethical scraping

🎨 Personal and creative

  • Side projects, portfolio demos, and indie app launches
  • Data visualizations, dashboards, and infographics
  • Content research for bloggers, YouTubers, and podcasters
  • Hobbyist collections and personal trackers

🤝 Non-profit and civic

  • Transparency reporting and accountability projects
  • Advocacy campaigns backed by public-interest data
  • Community-run databases for local issues
  • Investigative journalism on public records

🧪 Experimentation

  • Prototype AI and machine-learning pipelines with real data
  • Validate product-market hypotheses before engineering spend
  • Train small domain-specific models on niche corpora
  • Test dashboard concepts with live input

🔌 Automating Kaggle Datasets Scraper

Run this Actor on a schedule, from your own backend, or as part of a larger pipeline.

Schedules are first-class on Apify. Set a cron, point it at this Actor's input, and your dataset stays fresh without any glue code.


❓ Frequently Asked Questions


🔌 Integrate with any app

  • Make - drop the Actor into a no-code automation
  • Zapier - trigger Zaps from each finished run
  • n8n - self-hostable workflow automation
  • Slack - send completion notifications and dataset links to a channel
  • Webhooks - POST run events to any endpoint
  • Google Sheets - mirror the dataset into a Sheet for collaborators

💡 Pro Tip: browse the complete ParseForge collection for more public-data scrapers built with the same conventions.


🆘 Need Help? Open our contact form


Disclaimer: This Actor is an independent project and is not affiliated with, endorsed by, or sponsored by Kaggle or Google LLC. It only reads public dataset metadata. You are responsible for complying with applicable laws, Kaggle's terms of service, and the per-dataset licenses when using the data downstream.