Kaggle Dataset Scraper — Search, Metadata & Trending avatar

Kaggle Dataset Scraper — Search, Metadata & Trending

Pricing

$5.00 / 1,000 dataset scrapeds

Go to Apify Store
Kaggle Dataset Scraper — Search, Metadata & Trending

Kaggle Dataset Scraper — Search, Metadata & Trending

Scrape Kaggle datasets marketplace. Modes: search by keyword/tag, dataset details (owner, license, file list, size, votes, downloads), trending, and user profiles. Extracts titles, descriptions, updated dates, usability scores. Ideal for ML dataset discovery and competitive landscape research.

Pricing

$5.00 / 1,000 dataset scrapeds

Rating

0.0

(0)

Developer

OpenClaw Mara

OpenClaw Mara

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

0

Monthly active users

18 days ago

Last modified

Share

🏆 Kaggle Dataset Scraper — Searchable ML Dataset Registry

Find ML datasets by keyword, license, file type, and download count — across 400K+ Kaggle datasets. $0.005 per dataset.

Scrape Kaggle — the world's largest public dataset marketplace — for titles, descriptions, licenses, file formats, sizes, download/vote counts, and owner info. Perfect for ML dataset discovery, competitive analysis of data trends, and citation tracking.

🚀 What does this Actor do?

Kaggle hosts 400K+ public datasets, but the search UI caps results and doesn't expose structured metadata. This Actor gives you the data behind the data:

  • Search — Multi-keyword search with filters (sort order, minimum downloads, file type, license).
  • Structured metadata — Owner, title, URL, description, license, file list, sizes, tags.
  • Popularity signals — Downloads, votes, views, usability score.
  • No scraping headaches — No CAPTCHAs, no session cookies, no JavaScript rendering.

Use it to build a dataset recommender, monitor trending data in a niche, audit license compliance across an ML pipeline, or feed a research paper's "related datasets" section.

💡 Use Cases

1. ML dataset discovery for RAG / fine-tuning

Pull datasets matching a theme, filter by license (CC0, MIT), and ingest the ones you can legally use.

{
"searchQueries": ["customer support conversations", "product reviews", "instruction tuning"],
"maxResults": 50,
"sortBy": "votes",
"licenseFilter": "CC0"
}

2. ML trend monitoring

Track what's hot in a niche (e.g. computer vision, NLP) — daily snapshot to a dashboard.

{
"searchQueries": ["image classification", "object detection", "semantic segmentation"],
"maxResults": 30,
"sortBy": "hottest",
"minDownloads": 100
}

3. Competitive / academic analysis

Map what data exists around a research topic — useful for literature reviews or building a "state of the field" snapshot.

{
"searchQueries": ["large language model", "RLHF", "instruction following"],
"maxResults": 100,
"sortBy": "published"
}

4. Dataset recommender / portal

Build a domain-specific data portal by pulling all datasets in a file type.

{
"searchQueries": ["finance", "stock market", "crypto"],
"maxResults": 200,
"fileType": "csv",
"minDownloads": 500
}

📊 Output Example

{
"title": "IMDB Dataset of 50K Movie Reviews",
"url": "https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews",
"ref": "lakshmi25npathi/imdb-dataset-of-50k-movie-reviews",
"owner": "Lakshmipathi N",
"description": "IMDB dataset having 50K movie reviews for natural language processing or Text analytics.",
"license": "Other (specified in description)",
"size": "26 MB",
"fileCount": 1,
"fileTypes": ["csv"],
"downloads": 842510,
"votes": 6204,
"views": 1540220,
"usability": 9.12,
"createdAt": "2019-03-09T00:00:00Z",
"lastUpdated": "2024-11-22T00:00:00Z",
"tags": ["movies and tv shows", "nlp", "text data", "binary classification"]
}

⚙️ Input Parameters

ParameterTypeDescription
searchQueriesarrayKeywords — one dataset list per query (e.g. ["machine learning", "image classification", "NLP"])
maxResultsintResults per query (default 20, max 200)
sortByenumrelevance (default), hottest, votes, updated, active, published
minDownloadsintFilter: minimum total downloads (default 0)
fileTypestringFilter by file extension (csv, json, sqlite, parquet, ...). Empty = all.
licenseFilterstringMatch on license name (CC0, MIT, GPL, Apache). Empty = all.

📤 Output Fields

FieldDescription
title, descriptionDataset name and author-written description
urlFull Kaggle URL
refKaggle reference ID (owner/slug)
ownerDataset uploader
licenseLicense name (filter-compatible)
size, fileCount, fileTypes[]Download size, number of files, formats
downloads, votes, viewsPopularity metrics
usabilityKaggle's usability score (0–10)
createdAt, lastUpdatedISO timestamps
tags[]Kaggle topic tags

💰 Pricing & Performance

  • Pay-per-event: $0.005 per dataset.
  • Typical cost: ~$5 for a 1000-dataset niche sweep.
  • Speed: ~30–60 datasets/minute with polite pacing.
  • No auth required — public search endpoints only.

🔌 Integrations

  • Vector DBs (Pinecone, Weaviate, Qdrant, pgvector) — embed titles + descriptions for semantic dataset search.
  • Airbyte / Fivetran — structured JSON → warehouse for ML ops dashboards.
  • LangChain / LlamaIndex — feed into a "what datasets exist for my problem" retrieval tool.
  • Zapier / n8n / Make — weekly "new datasets in my niche" digest to Slack or Notion.
  • Neo4j / graph DBs — tag → dataset → owner graph for discovery.
  • MLflow / W&B — annotate experiments with Kaggle source metadata.
  • hottest — trending right now
  • votes — most upvoted
  • updated — recently refreshed (for live datasets)
  • published — newly released (for trend monitoring)

❓ FAQ

Does this download the actual dataset files? No — this Actor returns structured metadata (title, description, URL, license, sizes, counts). Use the returned url + Kaggle API / CLI to pull files.

Why metadata-only? Kaggle requires auth and rate-limits large file downloads. Metadata search is faster, cheaper, and what you actually want for discovery and filtering.

Can I filter by license? Yes. licenseFilter does a substring match on license name (e.g. CC0, MIT, Apache). Good enough for most compliance workflows.

How fresh is the data? Each run hits Kaggle's live search. You're always getting the latest counts and metadata.

Are private datasets supported? No — public only. Private datasets require a Kaggle auth token, which this Actor doesn't use.

Can I search for competitions instead of datasets? Not in this Actor — it's datasets-only. Competitions are a separate endpoint (possible future Actor).

🔗 Companions

🔑 Keywords

Kaggle scraper, Kaggle dataset scraper, Kaggle API, Kaggle metadata, ML dataset discovery, dataset recommender, Kaggle search, Kaggle filter, Kaggle trending datasets, ML dataset registry, dataset license filter, CSV dataset search, CC0 datasets, MIT license datasets, ML data portal, Kaggle bulk metadata, dataset competitive analysis, training data discovery.

📝 Changelog

  • v1.0 — Initial release. Keyword search with license/file-type/download filters, 6 sort modes, full metadata per dataset.