Kaggle Datasets & Models Scraper
Pricing
Pay per event
Kaggle Datasets & Models Scraper
Scrape datasets and ML models from Kaggle including metadata, votes, downloads, and more
Pricing
Pay per event
Rating
0.0
(0)
Developer
Stas Persiianenko
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Extract metadata for datasets and ML models from Kaggle — the world's largest data science community with 15 million users, 300,000+ datasets, and a growing library of open-source ML models. No Kaggle account or API key required.
What Does It Do?
This actor scrapes Kaggle's public API to return structured metadata about datasets and machine learning models. You can filter by keyword, sort by popularity or recency, and collect hundreds of results in seconds.
Use it to discover ML training datasets, benchmark models, track popular research topics, or automate data pipeline discovery — all without a Kaggle login.
Who Is It For?
🧪 ML researchers and data scientists — Find relevant datasets for your next project. Search by topic (climate, NLP, finance) and sort by votes or downloads to surface the best-quality data fast.
🏢 AI/ML teams at companies — Audit the Kaggle landscape for datasets in your domain. Identify which public datasets competitors are using or which models are trending.
📊 Data journalists and analysts — Track trending datasets across topics. Build lists of the most-used datasets in a given field for reporting or research.
🤖 AI application developers — Programmatically discover training datasets and pre-trained models to power your LLM fine-tuning, classification, or computer vision pipelines.
📈 Market researchers — Monitor Kaggle model and dataset growth over time. Identify what AI topics are gaining traction by tracking vote counts and download trends.
Why Use This Actor?
- ✅ No login required — Kaggle's public API works without authentication
- ✅ Structured JSON output — All fields are normalized and flattened (no nested objects)
- ✅ Covers both datasets and models — Scrape either or both in one run
- ✅ Keyword search support — Filter by any topic (e.g., "climate", "healthcare", "llama")
- ✅ Multiple sort orders — Hottest, most votes, downloads, recently updated
- ✅ Pagination handled automatically — Just set
maxResultsand get paginated results - ✅ Fast and cheap — Pure HTTP, no browser needed, results in seconds
What Data Is Extracted?
Datasets
| Field | Description | Example |
|---|---|---|
type | Record type | "dataset" |
id | Kaggle dataset ID | 29 |
ref | Owner/slug reference | "berkeleyearth/climate-change-earth-surface-temperature-data" |
url | Full Kaggle URL | "https://www.kaggle.com/datasets/..." |
title | Dataset title | "Climate Change: Earth Surface Temperature Data" |
subtitle | Short description | "Exploring global temperatures since 1750" |
description | Long description | "This dataset contains records of surface temperatures..." |
ownerName | Owner display name | "Berkeley Earth" |
ownerRef | Owner slug | "organizations/berkeleyearth" |
creatorName | Creator name | "John Doe" |
licenseName | Data license | "CC BY-NC-SA 4.0" |
totalBytes | File size in bytes | 88843537 |
voteCount | Number of votes/upvotes | 2453 |
downloadCount | Total download count | 181589 |
viewCount | Total page views | 1242616 |
kernelCount | Number of notebooks using it | 695 |
currentVersionNumber | Dataset version | 2 |
usabilityRating | Kaggle usability score 0-1 | 0.76 |
isPrivate | Whether dataset is private | false |
isFeatured | Whether featured by Kaggle | false |
lastUpdated | Last update timestamp | "2024-02-22T08:53:54.627Z" |
thumbnailImageUrl | Thumbnail image URL | "https://storage.googleapis.com/..." |
tags | List of topic tags | ["climate", "education", "data visualization"] |
Models
| Field | Description | Example |
|---|---|---|
type | Record type | "model" |
id | Kaggle model ID | 619281 |
ref | Owner/slug reference | "kienngx/nemotron-nano-30b-trained" |
url | Full Kaggle URL | "https://www.kaggle.com/models/..." |
title | Model title | "Nemotron-Nano-30B variances" |
subtitle | Short description | "LoRA fine-tuned adapter for reasoning tasks" |
description | Model description (up to 500 chars) | "## Model Details..." |
author | Author display name | "Ngô Xuân Kiên" |
slug | Model slug | "nemotron-nano-30b-trained" |
voteCount | Number of votes | 55 |
isPrivate | Whether model is private | false |
authorImageUrl | Author avatar URL | "https://storage.googleapis.com/kaggle-avatars/..." |
How Much Does It Cost to Scrape Kaggle Datasets?
This actor uses Pay-Per-Event (PPE) pricing. You are charged a flat fee when a run starts, plus a per-result fee that depends on your Apify subscription tier:
| Event | FREE | BRONZE | SILVER | GOLD | PLATINUM | DIAMOND |
|---|---|---|---|---|---|---|
| Run start (one-time) | $0.005 | $0.005 | $0.005 | $0.005 | $0.005 | $0.005 |
| Per result extracted | $0.00115 | $0.001 | $0.00078 | $0.0006 | $0.0004 | $0.00028 |
Cost examples (BRONZE tier):
- 100 datasets = $0.005 + 100 × $0.001 = $0.105
- 500 datasets = $0.005 + 500 × $0.001 = $0.505
- 1,000 datasets + models = $0.005 + 1,000 × $0.001 = $1.005
Free plan estimate: Apify's free tier includes $5/month, which is enough for approximately 4,300 results per month at the FREE tier rate ($0.00115/result).
How to Use It
Step 1: Choose what to scrape
Set searchMode to:
"datasets"— scrape Kaggle datasets only"models"— scrape ML models only"both"— scrape datasets and models (results split 50/50)
Step 2: Optionally add a search query
Use the search field to filter results by keyword (e.g., "natural language processing", "computer vision", "finance").
Step 3: Set your sort order
For datasets: hottest (trending), votes (most upvoted), updated (recent), active (most notebooks), published (newest).
For models: hotness (trending), downloadCount (most downloaded).
Step 4: Set maxResults
Set maxResults to however many items you need. The actor will paginate automatically. For example, maxResults: 200 fetches 200 items across multiple pages.
Step 5: Run and download results
Results appear in the actor's dataset. Download as JSON, CSV, XLSX, or use the API.
Input Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
searchMode | string | "datasets" | What to scrape: datasets, models, or both |
search | string | "" | Keyword to filter results |
datasetSortBy | string | "hottest" | Dataset sort: hottest, votes, updated, active, published |
modelSortBy | string | "hotness" | Model sort: hotness, downloadCount |
maxResults | integer | 100 | Maximum number of items to return |
maxRequestRetries | integer | 3 | Retry attempts for failed requests |
Output Example
{"type": "dataset","id": 29,"ref": "berkeleyearth/climate-change-earth-surface-temperature-data","url": "https://www.kaggle.com/datasets/berkeleyearth/climate-change-earth-surface-temperature-data","title": "Climate Change: Earth Surface Temperature Data","subtitle": "Exploring global temperatures since 1750","description": "","ownerName": "Berkeley Earth","ownerRef": "organizations/berkeleyearth","creatorName": "[Deleted User]","licenseName": "CC BY-NC-SA 4.0","totalBytes": 88843537,"voteCount": 2453,"downloadCount": 181589,"viewCount": 1242616,"kernelCount": 695,"currentVersionNumber": 2,"usabilityRating": 0.7647059,"isPrivate": false,"isFeatured": false,"lastUpdated": "2017-05-01T17:29:10.78Z","thumbnailImageUrl": "https://storage.googleapis.com/kaggle-datasets-images/29/33/default-backgrounds/dataset-thumbnail.jpg","tags": ["atmospheric science", "environment", "business", "news"]}
Tips and Best Practices
💡 Use keyword search for niche topics — The search field is powerful. Try terms like "llm fine-tuning", "medical imaging", "stock market", or "speech recognition" to find highly specific datasets.
💡 Sort by votes for quality — Highly upvoted datasets tend to be well-documented, clean, and widely used. Use votes sort when you want the most trusted datasets in a category.
💡 Run both mode for AI model discovery — If you're building an LLM application, run with searchMode: "both" and search: "llm" to discover both training datasets and pre-trained models in one run.
💡 Download in CSV for spreadsheets — Click "Export" → "CSV" in the run output tab to get a clean spreadsheet of all results.
💡 Set a higher maxResults for comprehensive coverage — Kaggle returns 20 items per page. For broad topic searches, set maxResults: 500 or higher to get full coverage.
💡 Track trends over time — Schedule this actor to run weekly and save results to a Google Sheet to track which datasets are growing in popularity.
Integrations
📊 Connect to Google Sheets
Export Kaggle dataset metadata directly to Google Sheets for tracking and analysis:
- Run this actor with your search query
- Go to Integrations → Google Sheets in the Apify console
- Connect your spreadsheet and select the export fields
🔗 Connect to Airtable
Build a dataset discovery database in Airtable:
- Use the Apify → Airtable integration to push results to a base
- Create filtered views by license type, vote count, or topic tags
- Add a formula field to compute bytes to GB for file size display
⚡ Use with Zapier or Make
Automate notifications when popular new datasets appear:
- Schedule this actor to run daily
- Use Zapier's Apify trigger to catch new results
- Send Slack alerts or email digests for datasets with
voteCount > 500
📦 Export to MongoDB or BigQuery
Store results in your data warehouse for longitudinal analysis:
- Download results as JSONL from the dataset
- Use
mongoimportor BigQuery's JSON import to load - Query by tag, vote count, download count over time
API Usage
Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });const run = await client.actor('automation-lab/kaggle-scraper').call({searchMode: 'datasets',search: 'natural language processing',datasetSortBy: 'votes',maxResults: 100,});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(`Got ${items.length} datasets`);items.forEach(item => {console.log(`${item.title} — ${item.voteCount} votes, ${item.downloadCount} downloads`);});
Python
from apify_client import ApifyClientclient = ApifyClient("YOUR_APIFY_TOKEN")run = client.actor("automation-lab/kaggle-scraper").call(run_input={"searchMode": "both","search": "computer vision","maxResults": 200,})items = client.dataset(run["defaultDatasetId"]).list_items().itemsprint(f"Extracted {len(items)} items")for item in items:print(f"[{item['type']}] {item['title']} — {item.get('voteCount', 0)} votes")
cURL
curl -X POST "https://api.apify.com/v2/acts/automation-lab~kaggle-scraper/runs?token=YOUR_APIFY_TOKEN" \-H "Content-Type: application/json" \-d '{"searchMode": "datasets","search": "finance","datasetSortBy": "hottest","maxResults": 50}'
Use with Claude and MCP
You can use this actor directly from Claude Code, Claude Desktop, or any MCP-compatible AI assistant.
Claude Code (Terminal)
$claude mcp add --transport http apify "https://mcp.apify.com?tools=automation-lab/kaggle-scraper"
Then ask Claude:
"Find the top 20 most-voted climate change datasets on Kaggle" "Search Kaggle for NLP models and return the hottest ones" "Get 50 Kaggle datasets about healthcare, sorted by download count"
Claude Desktop / Cursor / VS Code
Add to your MCP config file (~/.config/claude/claude_desktop_config.json or .vscode/mcp.json):
{"mcpServers": {"apify": {"type": "http","url": "https://mcp.apify.com?tools=automation-lab/kaggle-scraper","headers": {"Authorization": "Bearer YOUR_APIFY_TOKEN"}}}}
Example prompts:
- "Search Kaggle for 'time series' datasets and show me the ones with more than 1000 downloads"
- "What are the hottest ML models on Kaggle right now?"
- "Find datasets related to 'autonomous vehicles' and export them to a CSV"
Legality and Terms of Service
This actor only accesses Kaggle's public API endpoints that are available without authentication. The data returned is publicly visible on the Kaggle website to all visitors without login.
We only collect metadata (titles, descriptions, vote counts, tags) — not dataset files or model weights. No personal data is collected beyond publicly displayed creator names.
Important: Always check Kaggle's Terms of Service and Privacy Policy before using scraped data for commercial purposes. Respect dataset licenses (CC0, CC BY, etc.) when using the actual dataset files.
FAQ
Q: Do I need a Kaggle account or API key? A: No. This actor uses Kaggle's public API which works without authentication. No credentials needed.
Q: Can I download the actual dataset files? A: No. This actor collects metadata only. To download dataset files, you need a Kaggle account and can use the official Kaggle API or CLI.
Q: How many results can I get?
A: maxResults supports up to 10,000. In practice, Kaggle's public API typically returns results from the most recent and most popular content. For very broad queries, you may get fewer unique results than requested.
Q: The actor returned 0 results for my search. Why? A: This can happen if:
- The search term is too specific (try a broader keyword)
- For models mode: make sure
modelSortByishotnessordownloadCount(notvotesorupdated, which the API does not support) - Try with no
searchvalue first to confirm the basic mode works
Q: Can I get private datasets? A: No. This actor only accesses publicly available content. Private datasets require authentication.
Q: Does this work for Kaggle competitions too? A: Not yet. This actor focuses on datasets and models. Competition metadata is on a different API endpoint — contact us if you need that.
Q: What does usabilityRating mean?
A: Kaggle calculates a usability score (0-1) based on whether a dataset has a description, column descriptions, file documentation, a license, and a proper cover image. A score of 1.0 means fully documented.
Related Scrapers
Looking for more data from the AI and ML ecosystem? Check out our other automation-lab actors:
- Hugging Face Models Scraper — Scrape Hugging Face models and datasets
- arXiv Paper Scraper — Scrape academic papers from arXiv
- GitHub Repositories Scraper — Scrape GitHub repositories and star counts