Pricing

$5.00 / 1,000 dataset scrapeds

Kaggle Dataset Scraper — Search, Metadata & Trending

Scrape Kaggle datasets marketplace. Modes: search by keyword/tag, dataset details (owner, license, file list, size, votes, downloads), trending, and user profiles. Extracts titles, descriptions, updated dates, usability scores. Ideal for ML dataset discovery and competitive landscape research.

Pricing

$5.00 / 1,000 dataset scrapeds

Rating

0.0

(0)

Developer

OpenClaw Mara

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

🏆 Kaggle Dataset Scraper — Searchable ML Dataset Registry

Find ML datasets by keyword, license, file type, and download count — across 400K+ Kaggle datasets. $0.005 per dataset.

Scrape Kaggle — the world's largest public dataset marketplace — for titles, descriptions, licenses, file formats, sizes, download/vote counts, and owner info. Perfect for ML dataset discovery, competitive analysis of data trends, and citation tracking.

🚀 What does this Actor do?

Kaggle hosts 400K+ public datasets, but the search UI caps results and doesn't expose structured metadata. This Actor gives you the data behind the data:

Search — Multi-keyword search with filters (sort order, minimum downloads, file type, license).
Structured metadata — Owner, title, URL, description, license, file list, sizes, tags.
Popularity signals — Downloads, votes, views, usability score.
No scraping headaches — No CAPTCHAs, no session cookies, no JavaScript rendering.

Use it to build a dataset recommender, monitor trending data in a niche, audit license compliance across an ML pipeline, or feed a research paper's "related datasets" section.

💡 Use Cases

1. ML dataset discovery for RAG / fine-tuning

Pull datasets matching a theme, filter by license (CC0, MIT), and ingest the ones you can legally use.

{
  "searchQueries": ["customer support conversations", "product reviews", "instruction tuning"],
  "maxResults": 50,
  "sortBy": "votes",
  "licenseFilter": "CC0"
}

2. ML trend monitoring

Track what's hot in a niche (e.g. computer vision, NLP) — daily snapshot to a dashboard.

{
  "searchQueries": ["image classification", "object detection", "semantic segmentation"],
  "maxResults": 30,
  "sortBy": "hottest",
  "minDownloads": 100
}

3. Competitive / academic analysis

Map what data exists around a research topic — useful for literature reviews or building a "state of the field" snapshot.

{
  "searchQueries": ["large language model", "RLHF", "instruction following"],
  "maxResults": 100,
  "sortBy": "published"
}

4. Dataset recommender / portal

Build a domain-specific data portal by pulling all datasets in a file type.

{
  "searchQueries": ["finance", "stock market", "crypto"],
  "maxResults": 200,
  "fileType": "csv",
  "minDownloads": 500
}

📊 Output Example

{
  "title": "IMDB Dataset of 50K Movie Reviews",
  "url": "https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews",
  "ref": "lakshmi25npathi/imdb-dataset-of-50k-movie-reviews",
  "owner": "Lakshmipathi N",
  "description": "IMDB dataset having 50K movie reviews for natural language processing or Text analytics.",
  "license": "Other (specified in description)",
  "size": "26 MB",
  "fileCount": 1,
  "fileTypes": ["csv"],
  "downloads": 842510,
  "votes": 6204,
  "views": 1540220,
  "usability": 9.12,
  "createdAt": "2019-03-09T00:00:00Z",
  "lastUpdated": "2024-11-22T00:00:00Z",
  "tags": ["movies and tv shows", "nlp", "text data", "binary classification"]
}

⚙️ Input Parameters

Parameter	Type	Description
`searchQueries`	array	Keywords — one dataset list per query (e.g. `["machine learning", "image classification", "NLP"]`)
`maxResults`	int	Results per query (default 20, max 200)
`sortBy`	enum	`relevance` (default), `hottest`, `votes`, `updated`, `active`, `published`
`minDownloads`	int	Filter: minimum total downloads (default 0)
`fileType`	string	Filter by file extension (`csv`, `json`, `sqlite`, `parquet`, ...). Empty = all.
`licenseFilter`	string	Match on license name (`CC0`, `MIT`, `GPL`, `Apache`). Empty = all.

📤 Output Fields

Field	Description
`title`, `description`	Dataset name and author-written description
`url`	Full Kaggle URL
`ref`	Kaggle reference ID (`owner/slug`)
`owner`	Dataset uploader
`license`	License name (filter-compatible)
`size`, `fileCount`, `fileTypes[]`	Download size, number of files, formats
`downloads`, `votes`, `views`	Popularity metrics
`usability`	Kaggle's usability score (0–10)
`createdAt`, `lastUpdated`	ISO timestamps
`tags[]`	Kaggle topic tags

💰 Pricing & Performance

Pay-per-event: $0.005 per dataset.
Typical cost: ~$5 for a 1000-dataset niche sweep.
Speed: ~30–60 datasets/minute with polite pacing.
No auth required — public search endpoints only.

🔌 Integrations

Vector DBs (Pinecone, Weaviate, Qdrant, pgvector) — embed titles + descriptions for semantic dataset search.
Airbyte / Fivetran — structured JSON → warehouse for ML ops dashboards.
LangChain / LlamaIndex — feed into a "what datasets exist for my problem" retrieval tool.
Zapier / n8n / Make — weekly "new datasets in my niche" digest to Slack or Notion.
Neo4j / graph DBs — tag → dataset → owner graph for discovery.
MLflow / W&B — annotate experiments with Kaggle source metadata.

🏷️ Popular Sorts

hottest — trending right now
votes — most upvoted
updated — recently refreshed (for live datasets)
published — newly released (for trend monitoring)

❓ FAQ

Does this download the actual dataset files? No — this Actor returns structured metadata (title, description, URL, license, sizes, counts). Use the returned url + Kaggle API / CLI to pull files.

Why metadata-only? Kaggle requires auth and rate-limits large file downloads. Metadata search is faster, cheaper, and what you actually want for discovery and filtering.

Can I filter by license? Yes. licenseFilter does a substring match on license name (e.g. CC0, MIT, Apache). Good enough for most compliance workflows.

How fresh is the data? Each run hits Kaggle's live search. You're always getting the latest counts and metadata.

Are private datasets supported? No — public only. Private datasets require a Kaggle auth token, which this Actor doesn't use.

Can I search for competitions instead of datasets? Not in this Actor — it's datasets-only. Competitions are a separate endpoint (possible future Actor).

🔗 Companions

Hugging Face Scraper — Models, datasets, and Spaces from the other ML hub.
arXiv Paper Scraper — Academic research for the datasets you're working with.
GitHub Trending Scraper — Trending ML repos to pair with trending datasets.
Semantic Scholar Scraper — Papers that cite specific datasets.

🔑 Keywords

Kaggle scraper, Kaggle dataset scraper, Kaggle API, Kaggle metadata, ML dataset discovery, dataset recommender, Kaggle search, Kaggle filter, Kaggle trending datasets, ML dataset registry, dataset license filter, CSV dataset search, CC0 datasets, MIT license datasets, ML data portal, Kaggle bulk metadata, dataset competitive analysis, training data discovery.

📝 Changelog

v1.0 — Initial release. Keyword search with license/file-type/download filters, 6 sort modes, full metadata per dataset.

Kaggle Datasets Scraper - Dataset Search Data

benthepythondev/kaggle-datasets-scraper

Scrape Kaggle dataset search results: dataset titles, owners, subtitles, votes, usability scores and URLs.

Ben

Kaggle Datasets Scraper

parseforge/kaggle-scraper

Extract Kaggle dataset metadata at scale: titles, owners, descriptions, tags, license, file types, sizes, downloads, views, and votes. Filter by search, tag, user, file type, or size.

ParseForge

Kaggle Scraper

muhammetakkurtt/kaggle-scraper

Efficiently extracts dataset information from Kaggle based on user-defined search terms. Collects datasets metadata, categories, usability ratings and file information. Customizable scraping depth. Ideal for researchers and data scientists seeking quick insights into Kaggle datasets.

Muhammet Akkurt

5.0

Kaggle Scraper

lulzasaur/kaggle-scraper

Scrape Kaggle datasets, competitions, and notebooks. Get download counts, votes, tags, usability ratings, and metadata for ML and data science resources.

lulz bot

Kaggle Datasets Scraper

klondikeking/kaggle-datasets-scraper

Pierrick McD0nald

Kaggle Scraper

plantane/kaggle-scraper

Scrape datasets and competitions from Kaggle. List/search datasets by query with sorting options (hottest, most-voted, newest). List active or completed competitions (requires Kaggle API credentials). Uses the official Kaggle API.

Daniel

Kaggle Email Scraper - Advanced, Fast & Cheapest

contacts-api/kaggle-email-scraper-fast-advanced-and-cheapest

📊 Kaggle Email Scraper enables you to gather data scientist and organization emails from Kaggle profiles ⚡ Ideal for hiring and research 📧

Lead Heaven

Kaggle Scraper

crawlerbros/kaggle-scraper

Scrape Kaggle datasets, competitions, notebooks, and user profiles. Datasets are open via the public API; competitions and notebooks need Kaggle API credentials.

Crawler Bros

Data.gov.uk Scraper - Cheap 🌐📊🇬🇧

scrapestorm/data-gov-uk-scraper---cheap

🔎 Easily collect dataset listings from data.gov.uk Provide one or multiple search URLs and extract dataset information such as 📄 Dataset Title 🏢 Published By 🕒 Last Updated 📝 Description 🔗 Dataset URL & more Perfect for open data research, government data monitoring & dataset discovery 📊🚀

Storm_Scraper

5.0

Data.gov.uk Scraper - Low-cost💲🔥📚🇬🇧

delectable_incubator/data-gov-uk-scraper-low-cost

Scrape data.gov.uk dataset listings 🔎📊 with a powerful open data scraper. Extract dataset titles, publishers, update dates, descriptions, tags, and dataset URLs from search results. Ideal for government data monitoring, open data research, dataset discovery, and structured data catalog creation 🚀