Deprecated

Pricing

Pay per usage

See alternative Actors

Go to Apify Store

Kaggle Datasets & Models Scraper

Deprecated

See alternative Actors

Scrape datasets and ML models from Kaggle including metadata, votes, downloads, and more

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Stas Persiianenko

Actor stats

Bookmarked

Total users

Monthly active users

5 days ago

Last modified

What Does It Do?

This actor scrapes Kaggle's public API to return structured metadata about datasets and machine learning models. You can filter by keyword, sort by popularity or recency, and collect hundreds of results in seconds.

Use it to discover ML training datasets, benchmark models, track popular research topics, or automate data pipeline discovery — all without a Kaggle login.

Who Is It For?

🧪 ML researchers and data scientists — Find relevant datasets for your next project. Search by topic (climate, NLP, finance) and sort by votes or downloads to surface the best-quality data fast.

🏢 AI/ML teams at companies — Audit the Kaggle landscape for datasets in your domain. Identify which public datasets competitors are using or which models are trending.

📊 Data journalists and analysts — Track trending datasets across topics. Build lists of the most-used datasets in a given field for reporting or research.

🤖 AI application developers — Programmatically discover training datasets and pre-trained models to power your LLM fine-tuning, classification, or computer vision pipelines.

📈 Market researchers — Monitor Kaggle model and dataset growth over time. Identify what AI topics are gaining traction by tracking vote counts and download trends.

Why Use This Actor?

✅ No login required — Kaggle's public API works without authentication
✅ Structured JSON output — All fields are normalized and flattened (no nested objects)
✅ Covers both datasets and models — Scrape either or both in one run
✅ Keyword search support — Filter by any topic (e.g., "climate", "healthcare", "llama")
✅ Multiple sort orders — Hottest, most votes, downloads, recently updated
✅ Pagination handled automatically — Just set maxResults and get paginated results
✅ Fast and cheap — Pure HTTP, no browser needed, results in seconds

What Data Is Extracted?

Datasets

Field	Description	Example
`type`	Record type	`"dataset"`
`id`	Kaggle dataset ID	`29`
`ref`	Owner/slug reference	`"berkeleyearth/climate-change-earth-surface-temperature-data"`
`url`	Full Kaggle URL	`"https://www.kaggle.com/datasets/..."`
`title`	Dataset title	`"Climate Change: Earth Surface Temperature Data"`
`subtitle`	Short description	`"Exploring global temperatures since 1750"`
`description`	Long description	`"This dataset contains records of surface temperatures..."`
`ownerName`	Owner display name	`"Berkeley Earth"`
`ownerRef`	Owner slug	`"organizations/berkeleyearth"`
`creatorName`	Creator name	`"John Doe"`
`licenseName`	Data license	`"CC BY-NC-SA 4.0"`
`totalBytes`	File size in bytes	`88843537`
`voteCount`	Number of votes/upvotes	`2453`
`downloadCount`	Total download count	`181589`
`viewCount`	Total page views	`1242616`
`kernelCount`	Number of notebooks using it	`695`
`currentVersionNumber`	Dataset version	`2`
`usabilityRating`	Kaggle usability score 0-1	`0.76`
`isPrivate`	Whether dataset is private	`false`
`isFeatured`	Whether featured by Kaggle	`false`
`lastUpdated`	Last update timestamp	`"2024-02-22T08:53:54.627Z"`
`thumbnailImageUrl`	Thumbnail image URL	`"https://storage.googleapis.com/..."`
`tags`	List of topic tags	`["climate", "education", "data visualization"]`

Models

Field	Description	Example
`type`	Record type	`"model"`
`id`	Kaggle model ID	`619281`
`ref`	Owner/slug reference	`"kienngx/nemotron-nano-30b-trained"`
`url`	Full Kaggle URL	`"https://www.kaggle.com/models/..."`
`title`	Model title	`"Nemotron-Nano-30B variances"`
`subtitle`	Short description	`"LoRA fine-tuned adapter for reasoning tasks"`
`description`	Model description (up to 500 chars)	`"## Model Details..."`
`author`	Author display name	`"Ngô Xuân Kiên"`
`slug`	Model slug	`"nemotron-nano-30b-trained"`
`voteCount`	Number of votes	`55`
`isPrivate`	Whether model is private	`false`
`authorImageUrl`	Author avatar URL	`"https://storage.googleapis.com/kaggle-avatars/..."`

How Much Does It Cost to Scrape Kaggle Datasets?

This actor uses Pay-Per-Event (PPE) pricing. You are charged a flat fee when a run starts, plus a per-result fee that depends on your Apify subscription tier:

Event	FREE	BRONZE	SILVER	GOLD	PLATINUM	DIAMOND
Run start (one-time)	$0.005	$0.005	$0.005	$0.005	$0.005	$0.005
Per result extracted	$0.00115	$0.001	$0.00078	$0.0006	$0.0004	$0.00028

Cost examples (BRONZE tier):

100 datasets = $0.005 + 100 × $0.001 = $0.105
500 datasets = $0.005 + 500 × $0.001 = $0.505
1,000 datasets + models = $0.005 + 1,000 × $0.001 = $1.005

Free plan estimate: Apify's free tier includes $5/month, which is enough for approximately 4,300 results per month at the FREE tier rate ($0.00115/result).

How to Use It

Step 1: Choose what to scrape

Set searchMode to:

"datasets" — scrape Kaggle datasets only
"models" — scrape ML models only
"both" — scrape datasets and models (results split 50/50)

Step 2: Optionally add a search query

Use the search field to filter results by keyword (e.g., "natural language processing", "computer vision", "finance").

Step 3: Set your sort order

For datasets: hottest (trending), votes (most upvoted), updated (recent), active (most notebooks), published (newest).

For models: hotness (trending), downloadCount (most downloaded).

Step 4: Set maxResults

Set maxResults to however many items you need. The actor will paginate automatically. For example, maxResults: 200 fetches 200 items across multiple pages.

Step 5: Run and download results

Results appear in the actor's dataset. Download as JSON, CSV, XLSX, or use the API.

Input Parameters

Parameter	Type	Default	Description
`searchMode`	string	`"datasets"`	What to scrape: `datasets`, `models`, or `both`
`search`	string	`""`	Keyword to filter results
`datasetSortBy`	string	`"hottest"`	Dataset sort: `hottest`, `votes`, `updated`, `active`, `published`
`modelSortBy`	string	`"hotness"`	Model sort: `hotness`, `downloadCount`
`maxResults`	integer	`100`	Maximum number of items to return
`maxRequestRetries`	integer	`3`	Retry attempts for failed requests

Output Example

{
  "type": "dataset",
  "id": 29,
  "ref": "berkeleyearth/climate-change-earth-surface-temperature-data",
  "url": "https://www.kaggle.com/datasets/berkeleyearth/climate-change-earth-surface-temperature-data",
  "title": "Climate Change: Earth Surface Temperature Data",
  "subtitle": "Exploring global temperatures since 1750",
  "description": "",
  "ownerName": "Berkeley Earth",
  "ownerRef": "organizations/berkeleyearth",
  "creatorName": "[Deleted User]",
  "licenseName": "CC BY-NC-SA 4.0",
  "totalBytes": 88843537,
  "voteCount": 2453,
  "downloadCount": 181589,
  "viewCount": 1242616,
  "kernelCount": 695,
  "currentVersionNumber": 2,
  "usabilityRating": 0.7647059,
  "isPrivate": false,
  "isFeatured": false,
  "lastUpdated": "2017-05-01T17:29:10.78Z",
  "thumbnailImageUrl": "https://storage.googleapis.com/kaggle-datasets-images/29/33/default-backgrounds/dataset-thumbnail.jpg",
  "tags": ["atmospheric science", "environment", "business", "news"]
}

Tips and Best Practices

💡 Use keyword search for niche topics — The search field is powerful. Try terms like "llm fine-tuning", "medical imaging", "stock market", or "speech recognition" to find highly specific datasets.

💡 Sort by votes for quality — Highly upvoted datasets tend to be well-documented, clean, and widely used. Use votes sort when you want the most trusted datasets in a category.

💡 Run both mode for AI model discovery — If you're building an LLM application, run with searchMode: "both" and search: "llm" to discover both training datasets and pre-trained models in one run.

💡 Download in CSV for spreadsheets — Click "Export" → "CSV" in the run output tab to get a clean spreadsheet of all results.

💡 Set a higher maxResults for comprehensive coverage — Kaggle returns 20 items per page. For broad topic searches, set maxResults: 500 or higher to get full coverage.

💡 Track trends over time — Schedule this actor to run weekly and save results to a Google Sheet to track which datasets are growing in popularity.

Integrations

📊 Connect to Google Sheets

Export Kaggle dataset metadata directly to Google Sheets for tracking and analysis:

Run this actor with your search query
Go to Integrations → Google Sheets in the Apify console
Connect your spreadsheet and select the export fields

🔗 Connect to Airtable

Build a dataset discovery database in Airtable:

Use the Apify → Airtable integration to push results to a base
Create filtered views by license type, vote count, or topic tags
Add a formula field to compute bytes to GB for file size display

⚡ Use with Zapier or Make

Automate notifications when popular new datasets appear:

Schedule this actor to run daily
Use Zapier's Apify trigger to catch new results
Send Slack alerts or email digests for datasets with voteCount > 500

📦 Export to MongoDB or BigQuery

Store results in your data warehouse for longitudinal analysis:

Download results as JSONL from the dataset
Use mongoimport or BigQuery's JSON import to load
Query by tag, vote count, download count over time

API Usage

Node.js

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });

const run = await client.actor('automation-lab/kaggle-scraper').call({
    searchMode: 'datasets',
    search: 'natural language processing',
    datasetSortBy: 'votes',
    maxResults: 100,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`Got ${items.length} datasets`);
items.forEach(item => {
    console.log(`${item.title} — ${item.voteCount} votes, ${item.downloadCount} downloads`);
});

Python

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("automation-lab/kaggle-scraper").call(run_input={
    "searchMode": "both",
    "search": "computer vision",
    "maxResults": 200,
})

items = client.dataset(run["defaultDatasetId"]).list_items().items
print(f"Extracted {len(items)} items")
for item in items:
    print(f"[{item['type']}] {item['title']} — {item.get('voteCount', 0)} votes")

cURL

curl -X POST "https://api.apify.com/v2/acts/automation-lab~kaggle-scraper/runs?token=YOUR_APIFY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "searchMode": "datasets",
    "search": "finance",
    "datasetSortBy": "hottest",
    "maxResults": 50
  }'

Use with Claude and MCP

You can use this actor directly from Claude Code, Claude Desktop, or any MCP-compatible AI assistant.

Claude Code (Terminal)

$claude mcp add --transport http apify "https://mcp.apify.com?tools=automation-lab/kaggle-scraper"

Then ask Claude:

"Find the top 20 most-voted climate change datasets on Kaggle" "Search Kaggle for NLP models and return the hottest ones" "Get 50 Kaggle datasets about healthcare, sorted by download count"

Claude Desktop / Cursor / VS Code

Add to your MCP config file (~/.config/claude/claude_desktop_config.json or .vscode/mcp.json):

{
  "mcpServers": {
    "apify": {
      "type": "http",
      "url": "https://mcp.apify.com?tools=automation-lab/kaggle-scraper",
      "headers": {
        "Authorization": "Bearer YOUR_APIFY_TOKEN"
      }
    }
  }
}

Example prompts:

"Search Kaggle for 'time series' datasets and show me the ones with more than 1000 downloads"
"What are the hottest ML models on Kaggle right now?"
"Find datasets related to 'autonomous vehicles' and export them to a CSV"

Legality and Terms of Service

This actor only accesses Kaggle's public API endpoints that are available without authentication. The data returned is publicly visible on the Kaggle website to all visitors without login.

We only collect metadata (titles, descriptions, vote counts, tags) — not dataset files or model weights. No personal data is collected beyond publicly displayed creator names.

Important: Always check Kaggle's Terms of Service and Privacy Policy before using scraped data for commercial purposes. Respect dataset licenses (CC0, CC BY, etc.) when using the actual dataset files.

FAQ

Q: Do I need a Kaggle account or API key? A: No. This actor uses Kaggle's public API which works without authentication. No credentials needed.

Q: Can I download the actual dataset files? A: No. This actor collects metadata only. To download dataset files, you need a Kaggle account and can use the official Kaggle API or CLI.

Q: How many results can I get? A: maxResults supports up to 10,000. In practice, Kaggle's public API typically returns results from the most recent and most popular content. For very broad queries, you may get fewer unique results than requested.

Q: The actor returned 0 results for my search. Why? A: This can happen if:

The search term is too specific (try a broader keyword)
For models mode: make sure modelSortBy is hotness or downloadCount (not votes or updated, which the API does not support)
Try with no search value first to confirm the basic mode works

Q: Can I get private datasets? A: No. This actor only accesses publicly available content. Private datasets require authentication.

Q: Does this work for Kaggle competitions too? A: Not yet. This actor focuses on datasets and models. Competition metadata is on a different API endpoint — contact us if you need that.

Q: What does usabilityRating mean? A: Kaggle calculates a usability score (0-1) based on whether a dataset has a description, column descriptions, file documentation, a license, and a proper cover image. A score of 1.0 means fully documented.

Looking for more data from the AI and ML ecosystem? Check out our other automation-lab actors:

Hugging Face Models Scraper — Scrape Hugging Face models and datasets
arXiv Paper Scraper — Scrape academic papers from arXiv
GitHub Repositories Scraper — Scrape GitHub repositories and star counts

Kaggle Dataset Scraper — Search, Metadata & Trending

openclawmara/kaggle-dataset-scraper

Scrape Kaggle datasets marketplace. Modes: search by keyword/tag, dataset details (owner, license, file list, size, votes, downloads), trending, and user profiles. Extracts titles, descriptions, updated dates, usability scores. Ideal for ML dataset discovery and competitive landscape research.

OpenClaw Mara

Kaggle Scraper

muhammetakkurtt/kaggle-scraper

Efficiently extracts dataset information from Kaggle based on user-defined search terms. Collects datasets metadata, categories, usability ratings and file information. Customizable scraping depth. Ideal for researchers and data scientists seeking quick insights into Kaggle datasets.

Muhammet Akkurt

5.0

(1)

Kaggle Datasets Scraper

parseforge/kaggle-scraper

Extract Kaggle dataset metadata at scale: titles, owners, descriptions, tags, license, file types, sizes, downloads, views, and votes. Filter by search, tag, user, file type, or size.

ParseForge

FBI Crime Data Scraper - Crime Statistics and Incidents

parseforge/fbi-crime-data-scraper

Scrape FBI crime statistics, incident reports, and law enforcement data. Extract offense counts, arrest data, and agency information by location and year.

ParseForge

Hugging Face Datasets Catalog — ML Training Data Intel

nexgendata/huggingface-datasets-catalog

Hugging Face dataset registry: downloads, likes, last_modified, task_categories, language, size_categories, license, tags, author. Filter by task/language/size. Sort by downloads/likes/trending/modified. ML researchers, MLOps, AI compliance.

Stephan Corbeil

Username OSINT Scraper

khadinakbar/username-osint-scraper

Find where a username is registered across 480+ websites (Twitter, GitHub, Reddit, TikTok, gaming, dev, NSFW). MCP-ready. $0.04 per username scanned.

Khadin Akbar

ML Contests Scraper

automation-lab/mlcontests-scraper

Scrape machine learning, data science, and robotics competitions from mlcontests.com

Stas Persiianenko

Kaggle Scraper

crawlerbros/kaggle-scraper

Scrape Kaggle datasets, competitions, notebooks, and user profiles. Datasets are open via the public API; competitions and notebooks need Kaggle API credentials.

Crawler Bros

5.0

(22)

Username Checker — Find Profiles Across 460+ Platforms

automation-lab/username-checker

Search for any username across 460+ platforms instantly. Like Sherlock but no install needed — runs in the cloud. Find social media accounts, check username availability, or audit your brand. Covers GitHub, Reddit, Twitter/X, Instagram, TikTok, Steam, YouTube, LinkedIn, and 450+ more.

Stas Persiianenko

203

Kaggle Datasets Scraper

klondikeking/kaggle-datasets-scraper

Pierrick McD0nald

Hugging Face Model & Dataset Scraper

cloud9_ai/huggingface-scraper

Search and extract ML models and datasets from Hugging Face Hub. Get model cards, download stats, tasks, and architectures. No API key needed.

cloud9

Kaggle Datasets & Models Scraper

What Does It Do?

Who Is It For?

Why Use This Actor?

What Data Is Extracted?

Datasets

Models

How Much Does It Cost to Scrape Kaggle Datasets?

How to Use It

Step 1: Choose what to scrape

Step 2: Optionally add a search query

Step 3: Set your sort order

Step 4: Set maxResults

Step 5: Run and download results

Input Parameters

Output Example

Tips and Best Practices

Integrations

📊 Connect to Google Sheets

🔗 Connect to Airtable

⚡ Use with Zapier or Make

📦 Export to MongoDB or BigQuery

API Usage

Node.js

Python

cURL

Use with Claude and MCP

Claude Code (Terminal)

Claude Desktop / Cursor / VS Code

Legality and Terms of Service

FAQ

Related Scrapers

You might also like

Kaggle Dataset Scraper — Search, Metadata & Trending

Kaggle Scraper

Kaggle Datasets Scraper

FBI Crime Data Scraper - Crime Statistics and Incidents

Hugging Face Datasets Catalog — ML Training Data Intel

Username OSINT Scraper

ML Contests Scraper

Kaggle Scraper

Username Checker — Find Profiles Across 460+ Platforms

Kaggle Datasets Scraper

Hugging Face Model & Dataset Scraper