Kaggle Datasets & Models Scraper avatar

Kaggle Datasets & Models Scraper

Pricing

Pay per event

Go to Apify Store
Kaggle Datasets & Models Scraper

Kaggle Datasets & Models Scraper

Scrape datasets and ML models from Kaggle including metadata, votes, downloads, and more

Pricing

Pay per event

Rating

0.0

(0)

Developer

Stas Persiianenko

Stas Persiianenko

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Extract metadata for datasets and ML models from Kaggle — the world's largest data science community with 15 million users, 300,000+ datasets, and a growing library of open-source ML models. No Kaggle account or API key required.

What Does It Do?

This actor scrapes Kaggle's public API to return structured metadata about datasets and machine learning models. You can filter by keyword, sort by popularity or recency, and collect hundreds of results in seconds.

Use it to discover ML training datasets, benchmark models, track popular research topics, or automate data pipeline discovery — all without a Kaggle login.

Who Is It For?

🧪 ML researchers and data scientists — Find relevant datasets for your next project. Search by topic (climate, NLP, finance) and sort by votes or downloads to surface the best-quality data fast.

🏢 AI/ML teams at companies — Audit the Kaggle landscape for datasets in your domain. Identify which public datasets competitors are using or which models are trending.

📊 Data journalists and analysts — Track trending datasets across topics. Build lists of the most-used datasets in a given field for reporting or research.

🤖 AI application developers — Programmatically discover training datasets and pre-trained models to power your LLM fine-tuning, classification, or computer vision pipelines.

📈 Market researchers — Monitor Kaggle model and dataset growth over time. Identify what AI topics are gaining traction by tracking vote counts and download trends.

Why Use This Actor?

  • No login required — Kaggle's public API works without authentication
  • Structured JSON output — All fields are normalized and flattened (no nested objects)
  • Covers both datasets and models — Scrape either or both in one run
  • Keyword search support — Filter by any topic (e.g., "climate", "healthcare", "llama")
  • Multiple sort orders — Hottest, most votes, downloads, recently updated
  • Pagination handled automatically — Just set maxResults and get paginated results
  • Fast and cheap — Pure HTTP, no browser needed, results in seconds

What Data Is Extracted?

Datasets

FieldDescriptionExample
typeRecord type"dataset"
idKaggle dataset ID29
refOwner/slug reference"berkeleyearth/climate-change-earth-surface-temperature-data"
urlFull Kaggle URL"https://www.kaggle.com/datasets/..."
titleDataset title"Climate Change: Earth Surface Temperature Data"
subtitleShort description"Exploring global temperatures since 1750"
descriptionLong description"This dataset contains records of surface temperatures..."
ownerNameOwner display name"Berkeley Earth"
ownerRefOwner slug"organizations/berkeleyearth"
creatorNameCreator name"John Doe"
licenseNameData license"CC BY-NC-SA 4.0"
totalBytesFile size in bytes88843537
voteCountNumber of votes/upvotes2453
downloadCountTotal download count181589
viewCountTotal page views1242616
kernelCountNumber of notebooks using it695
currentVersionNumberDataset version2
usabilityRatingKaggle usability score 0-10.76
isPrivateWhether dataset is privatefalse
isFeaturedWhether featured by Kagglefalse
lastUpdatedLast update timestamp"2024-02-22T08:53:54.627Z"
thumbnailImageUrlThumbnail image URL"https://storage.googleapis.com/..."
tagsList of topic tags["climate", "education", "data visualization"]

Models

FieldDescriptionExample
typeRecord type"model"
idKaggle model ID619281
refOwner/slug reference"kienngx/nemotron-nano-30b-trained"
urlFull Kaggle URL"https://www.kaggle.com/models/..."
titleModel title"Nemotron-Nano-30B variances"
subtitleShort description"LoRA fine-tuned adapter for reasoning tasks"
descriptionModel description (up to 500 chars)"## Model Details..."
authorAuthor display name"Ngô Xuân Kiên"
slugModel slug"nemotron-nano-30b-trained"
voteCountNumber of votes55
isPrivateWhether model is privatefalse
authorImageUrlAuthor avatar URL"https://storage.googleapis.com/kaggle-avatars/..."

How Much Does It Cost to Scrape Kaggle Datasets?

This actor uses Pay-Per-Event (PPE) pricing. You are charged a flat fee when a run starts, plus a per-result fee that depends on your Apify subscription tier:

EventFREEBRONZESILVERGOLDPLATINUMDIAMOND
Run start (one-time)$0.005$0.005$0.005$0.005$0.005$0.005
Per result extracted$0.00115$0.001$0.00078$0.0006$0.0004$0.00028

Cost examples (BRONZE tier):

  • 100 datasets = $0.005 + 100 × $0.001 = $0.105
  • 500 datasets = $0.005 + 500 × $0.001 = $0.505
  • 1,000 datasets + models = $0.005 + 1,000 × $0.001 = $1.005

Free plan estimate: Apify's free tier includes $5/month, which is enough for approximately 4,300 results per month at the FREE tier rate ($0.00115/result).

How to Use It

Step 1: Choose what to scrape

Set searchMode to:

  • "datasets" — scrape Kaggle datasets only
  • "models" — scrape ML models only
  • "both" — scrape datasets and models (results split 50/50)

Step 2: Optionally add a search query

Use the search field to filter results by keyword (e.g., "natural language processing", "computer vision", "finance").

Step 3: Set your sort order

For datasets: hottest (trending), votes (most upvoted), updated (recent), active (most notebooks), published (newest).

For models: hotness (trending), downloadCount (most downloaded).

Step 4: Set maxResults

Set maxResults to however many items you need. The actor will paginate automatically. For example, maxResults: 200 fetches 200 items across multiple pages.

Step 5: Run and download results

Results appear in the actor's dataset. Download as JSON, CSV, XLSX, or use the API.

Input Parameters

ParameterTypeDefaultDescription
searchModestring"datasets"What to scrape: datasets, models, or both
searchstring""Keyword to filter results
datasetSortBystring"hottest"Dataset sort: hottest, votes, updated, active, published
modelSortBystring"hotness"Model sort: hotness, downloadCount
maxResultsinteger100Maximum number of items to return
maxRequestRetriesinteger3Retry attempts for failed requests

Output Example

{
"type": "dataset",
"id": 29,
"ref": "berkeleyearth/climate-change-earth-surface-temperature-data",
"url": "https://www.kaggle.com/datasets/berkeleyearth/climate-change-earth-surface-temperature-data",
"title": "Climate Change: Earth Surface Temperature Data",
"subtitle": "Exploring global temperatures since 1750",
"description": "",
"ownerName": "Berkeley Earth",
"ownerRef": "organizations/berkeleyearth",
"creatorName": "[Deleted User]",
"licenseName": "CC BY-NC-SA 4.0",
"totalBytes": 88843537,
"voteCount": 2453,
"downloadCount": 181589,
"viewCount": 1242616,
"kernelCount": 695,
"currentVersionNumber": 2,
"usabilityRating": 0.7647059,
"isPrivate": false,
"isFeatured": false,
"lastUpdated": "2017-05-01T17:29:10.78Z",
"thumbnailImageUrl": "https://storage.googleapis.com/kaggle-datasets-images/29/33/default-backgrounds/dataset-thumbnail.jpg",
"tags": ["atmospheric science", "environment", "business", "news"]
}

Tips and Best Practices

💡 Use keyword search for niche topics — The search field is powerful. Try terms like "llm fine-tuning", "medical imaging", "stock market", or "speech recognition" to find highly specific datasets.

💡 Sort by votes for quality — Highly upvoted datasets tend to be well-documented, clean, and widely used. Use votes sort when you want the most trusted datasets in a category.

💡 Run both mode for AI model discovery — If you're building an LLM application, run with searchMode: "both" and search: "llm" to discover both training datasets and pre-trained models in one run.

💡 Download in CSV for spreadsheets — Click "Export" → "CSV" in the run output tab to get a clean spreadsheet of all results.

💡 Set a higher maxResults for comprehensive coverage — Kaggle returns 20 items per page. For broad topic searches, set maxResults: 500 or higher to get full coverage.

💡 Track trends over time — Schedule this actor to run weekly and save results to a Google Sheet to track which datasets are growing in popularity.

Integrations

📊 Connect to Google Sheets

Export Kaggle dataset metadata directly to Google Sheets for tracking and analysis:

  1. Run this actor with your search query
  2. Go to IntegrationsGoogle Sheets in the Apify console
  3. Connect your spreadsheet and select the export fields

🔗 Connect to Airtable

Build a dataset discovery database in Airtable:

  1. Use the Apify → Airtable integration to push results to a base
  2. Create filtered views by license type, vote count, or topic tags
  3. Add a formula field to compute bytes to GB for file size display

⚡ Use with Zapier or Make

Automate notifications when popular new datasets appear:

  1. Schedule this actor to run daily
  2. Use Zapier's Apify trigger to catch new results
  3. Send Slack alerts or email digests for datasets with voteCount > 500

📦 Export to MongoDB or BigQuery

Store results in your data warehouse for longitudinal analysis:

  1. Download results as JSONL from the dataset
  2. Use mongoimport or BigQuery's JSON import to load
  3. Query by tag, vote count, download count over time

API Usage

Node.js

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });
const run = await client.actor('automation-lab/kaggle-scraper').call({
searchMode: 'datasets',
search: 'natural language processing',
datasetSortBy: 'votes',
maxResults: 100,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`Got ${items.length} datasets`);
items.forEach(item => {
console.log(`${item.title}${item.voteCount} votes, ${item.downloadCount} downloads`);
});

Python

from apify_client import ApifyClient
client = ApifyClient("YOUR_APIFY_TOKEN")
run = client.actor("automation-lab/kaggle-scraper").call(run_input={
"searchMode": "both",
"search": "computer vision",
"maxResults": 200,
})
items = client.dataset(run["defaultDatasetId"]).list_items().items
print(f"Extracted {len(items)} items")
for item in items:
print(f"[{item['type']}] {item['title']}{item.get('voteCount', 0)} votes")

cURL

curl -X POST "https://api.apify.com/v2/acts/automation-lab~kaggle-scraper/runs?token=YOUR_APIFY_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"searchMode": "datasets",
"search": "finance",
"datasetSortBy": "hottest",
"maxResults": 50
}'

Use with Claude and MCP

You can use this actor directly from Claude Code, Claude Desktop, or any MCP-compatible AI assistant.

Claude Code (Terminal)

$claude mcp add --transport http apify "https://mcp.apify.com?tools=automation-lab/kaggle-scraper"

Then ask Claude:

"Find the top 20 most-voted climate change datasets on Kaggle" "Search Kaggle for NLP models and return the hottest ones" "Get 50 Kaggle datasets about healthcare, sorted by download count"

Claude Desktop / Cursor / VS Code

Add to your MCP config file (~/.config/claude/claude_desktop_config.json or .vscode/mcp.json):

{
"mcpServers": {
"apify": {
"type": "http",
"url": "https://mcp.apify.com?tools=automation-lab/kaggle-scraper",
"headers": {
"Authorization": "Bearer YOUR_APIFY_TOKEN"
}
}
}
}

Example prompts:

  • "Search Kaggle for 'time series' datasets and show me the ones with more than 1000 downloads"
  • "What are the hottest ML models on Kaggle right now?"
  • "Find datasets related to 'autonomous vehicles' and export them to a CSV"

Legality and Terms of Service

This actor only accesses Kaggle's public API endpoints that are available without authentication. The data returned is publicly visible on the Kaggle website to all visitors without login.

We only collect metadata (titles, descriptions, vote counts, tags) — not dataset files or model weights. No personal data is collected beyond publicly displayed creator names.

Important: Always check Kaggle's Terms of Service and Privacy Policy before using scraped data for commercial purposes. Respect dataset licenses (CC0, CC BY, etc.) when using the actual dataset files.

FAQ

Q: Do I need a Kaggle account or API key? A: No. This actor uses Kaggle's public API which works without authentication. No credentials needed.

Q: Can I download the actual dataset files? A: No. This actor collects metadata only. To download dataset files, you need a Kaggle account and can use the official Kaggle API or CLI.

Q: How many results can I get? A: maxResults supports up to 10,000. In practice, Kaggle's public API typically returns results from the most recent and most popular content. For very broad queries, you may get fewer unique results than requested.

Q: The actor returned 0 results for my search. Why? A: This can happen if:

  • The search term is too specific (try a broader keyword)
  • For models mode: make sure modelSortBy is hotness or downloadCount (not votes or updated, which the API does not support)
  • Try with no search value first to confirm the basic mode works

Q: Can I get private datasets? A: No. This actor only accesses publicly available content. Private datasets require authentication.

Q: Does this work for Kaggle competitions too? A: Not yet. This actor focuses on datasets and models. Competition metadata is on a different API endpoint — contact us if you need that.

Q: What does usabilityRating mean? A: Kaggle calculates a usability score (0-1) based on whether a dataset has a description, column descriptions, file documentation, a license, and a proper cover image. A score of 1.0 means fully documented.

Looking for more data from the AI and ML ecosystem? Check out our other automation-lab actors: