Kaggle Dataset Scraper — Search, Metadata & Trending
Pricing
$5.00 / 1,000 dataset scrapeds
Kaggle Dataset Scraper — Search, Metadata & Trending
Scrape Kaggle datasets marketplace. Modes: search by keyword/tag, dataset details (owner, license, file list, size, votes, downloads), trending, and user profiles. Extracts titles, descriptions, updated dates, usability scores. Ideal for ML dataset discovery and competitive landscape research.
Pricing
$5.00 / 1,000 dataset scrapeds
Rating
0.0
(0)
Developer
OpenClaw Mara
Actor stats
0
Bookmarked
2
Total users
0
Monthly active users
18 days ago
Last modified
Categories
Share
🏆 Kaggle Dataset Scraper — Searchable ML Dataset Registry
Find ML datasets by keyword, license, file type, and download count — across 400K+ Kaggle datasets. $0.005 per dataset.
Scrape Kaggle — the world's largest public dataset marketplace — for titles, descriptions, licenses, file formats, sizes, download/vote counts, and owner info. Perfect for ML dataset discovery, competitive analysis of data trends, and citation tracking.
🚀 What does this Actor do?
Kaggle hosts 400K+ public datasets, but the search UI caps results and doesn't expose structured metadata. This Actor gives you the data behind the data:
- Search — Multi-keyword search with filters (sort order, minimum downloads, file type, license).
- Structured metadata — Owner, title, URL, description, license, file list, sizes, tags.
- Popularity signals — Downloads, votes, views, usability score.
- No scraping headaches — No CAPTCHAs, no session cookies, no JavaScript rendering.
Use it to build a dataset recommender, monitor trending data in a niche, audit license compliance across an ML pipeline, or feed a research paper's "related datasets" section.
💡 Use Cases
1. ML dataset discovery for RAG / fine-tuning
Pull datasets matching a theme, filter by license (CC0, MIT), and ingest the ones you can legally use.
{"searchQueries": ["customer support conversations", "product reviews", "instruction tuning"],"maxResults": 50,"sortBy": "votes","licenseFilter": "CC0"}
2. ML trend monitoring
Track what's hot in a niche (e.g. computer vision, NLP) — daily snapshot to a dashboard.
{"searchQueries": ["image classification", "object detection", "semantic segmentation"],"maxResults": 30,"sortBy": "hottest","minDownloads": 100}
3. Competitive / academic analysis
Map what data exists around a research topic — useful for literature reviews or building a "state of the field" snapshot.
{"searchQueries": ["large language model", "RLHF", "instruction following"],"maxResults": 100,"sortBy": "published"}
4. Dataset recommender / portal
Build a domain-specific data portal by pulling all datasets in a file type.
{"searchQueries": ["finance", "stock market", "crypto"],"maxResults": 200,"fileType": "csv","minDownloads": 500}
📊 Output Example
{"title": "IMDB Dataset of 50K Movie Reviews","url": "https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews","ref": "lakshmi25npathi/imdb-dataset-of-50k-movie-reviews","owner": "Lakshmipathi N","description": "IMDB dataset having 50K movie reviews for natural language processing or Text analytics.","license": "Other (specified in description)","size": "26 MB","fileCount": 1,"fileTypes": ["csv"],"downloads": 842510,"votes": 6204,"views": 1540220,"usability": 9.12,"createdAt": "2019-03-09T00:00:00Z","lastUpdated": "2024-11-22T00:00:00Z","tags": ["movies and tv shows", "nlp", "text data", "binary classification"]}
⚙️ Input Parameters
| Parameter | Type | Description |
|---|---|---|
searchQueries | array | Keywords — one dataset list per query (e.g. ["machine learning", "image classification", "NLP"]) |
maxResults | int | Results per query (default 20, max 200) |
sortBy | enum | relevance (default), hottest, votes, updated, active, published |
minDownloads | int | Filter: minimum total downloads (default 0) |
fileType | string | Filter by file extension (csv, json, sqlite, parquet, ...). Empty = all. |
licenseFilter | string | Match on license name (CC0, MIT, GPL, Apache). Empty = all. |
📤 Output Fields
| Field | Description |
|---|---|
title, description | Dataset name and author-written description |
url | Full Kaggle URL |
ref | Kaggle reference ID (owner/slug) |
owner | Dataset uploader |
license | License name (filter-compatible) |
size, fileCount, fileTypes[] | Download size, number of files, formats |
downloads, votes, views | Popularity metrics |
usability | Kaggle's usability score (0–10) |
createdAt, lastUpdated | ISO timestamps |
tags[] | Kaggle topic tags |
💰 Pricing & Performance
- Pay-per-event: $0.005 per dataset.
- Typical cost: ~$5 for a 1000-dataset niche sweep.
- Speed: ~30–60 datasets/minute with polite pacing.
- No auth required — public search endpoints only.
🔌 Integrations
- Vector DBs (Pinecone, Weaviate, Qdrant, pgvector) — embed titles + descriptions for semantic dataset search.
- Airbyte / Fivetran — structured JSON → warehouse for ML ops dashboards.
- LangChain / LlamaIndex — feed into a "what datasets exist for my problem" retrieval tool.
- Zapier / n8n / Make — weekly "new datasets in my niche" digest to Slack or Notion.
- Neo4j / graph DBs — tag → dataset → owner graph for discovery.
- MLflow / W&B — annotate experiments with Kaggle source metadata.
🏷️ Popular Sorts
hottest— trending right nowvotes— most upvotedupdated— recently refreshed (for live datasets)published— newly released (for trend monitoring)
❓ FAQ
Does this download the actual dataset files?
No — this Actor returns structured metadata (title, description, URL, license, sizes, counts). Use the returned url + Kaggle API / CLI to pull files.
Why metadata-only? Kaggle requires auth and rate-limits large file downloads. Metadata search is faster, cheaper, and what you actually want for discovery and filtering.
Can I filter by license?
Yes. licenseFilter does a substring match on license name (e.g. CC0, MIT, Apache). Good enough for most compliance workflows.
How fresh is the data? Each run hits Kaggle's live search. You're always getting the latest counts and metadata.
Are private datasets supported? No — public only. Private datasets require a Kaggle auth token, which this Actor doesn't use.
Can I search for competitions instead of datasets? Not in this Actor — it's datasets-only. Competitions are a separate endpoint (possible future Actor).
🔗 Companions
- Hugging Face Scraper — Models, datasets, and Spaces from the other ML hub.
- arXiv Paper Scraper — Academic research for the datasets you're working with.
- GitHub Trending Scraper — Trending ML repos to pair with trending datasets.
- Semantic Scholar Scraper — Papers that cite specific datasets.
🔑 Keywords
Kaggle scraper, Kaggle dataset scraper, Kaggle API, Kaggle metadata, ML dataset discovery, dataset recommender, Kaggle search, Kaggle filter, Kaggle trending datasets, ML dataset registry, dataset license filter, CSV dataset search, CC0 datasets, MIT license datasets, ML data portal, Kaggle bulk metadata, dataset competitive analysis, training data discovery.
📝 Changelog
- v1.0 — Initial release. Keyword search with license/file-type/download filters, 6 sort modes, full metadata per dataset.