Kaggle Scraper avatar

Kaggle Scraper

Pricing

from $3.00 / 1,000 results

Go to Apify Store
Kaggle Scraper

Kaggle Scraper

Scrape Kaggle datasets, competitions, notebooks, and user profiles. Datasets are open via the public API; competitions and notebooks need Kaggle API credentials.

Pricing

from $3.00 / 1,000 results

Rating

5.0

(17)

Developer

Crawler Bros

Crawler Bros

Maintained by Community

Actor stats

17

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Scrape Kaggle — the world's largest data-science community. Search public datasets, fetch by ref or URL, browse trending datasets, list a user's datasets, and (with API credentials) pull competitions and notebooks (kernels). Pure HTTP via the official Kaggle public API at kaggle.com/api/v1/*.

What this actor does

  • 8 modes: search, byDataset, byCompetition, byNotebook, byUser, trendingDatasets, trendingNotebooks, byUrl
  • Two auth tiers:
    • Public (no auth): datasets search/list/view, byUser, trendingDatasets, byUrl for datasets/users
    • Auth required: competitions, notebooks (kernels), trendingNotebooks
  • Filters: owner, sort order, file type, license family, min votes, min downloads, min usability, max size
  • URL auto-detection: paste any kaggle.com/datasets/<owner>/<slug>, /competitions/<slug>, /code/<owner>/<slug>, or user URL
  • Empty fields are omitted — every record only contains populated fields

Output

Each record is a flat dict. Field names you might see (omit-empty applies):

Common

  • recordTypedataset / competition / kernel / user
  • ref — Kaggle reference (e.g. heptapod/titanic)
  • scrapedAt

Dataset

  • datasetId, title, subtitle, description
  • ownerName, ownerRef, creatorName, creatorUrl
  • licenseName, lastUpdated
  • totalBytes, downloadCount, voteCount, viewCount, kernelCount
  • currentVersionNumber, usabilityRating
  • isPrivate, isFeatured, thumbnailImageUrl
  • tags[], files[], fileCount
  • datasetUrl

Competition

  • competitionId, title, description, category
  • organizationName, organizationRef, tags
  • deadline, enabledDate, evaluationMetric
  • rewardType, rewardQuantity, teamCount
  • submissionsDisabled, isKernelsSubmissionsOnly
  • competitionUrl

Kernel (notebook)

  • kernelId, title, author, language, kernelType
  • lastRunTime, totalVotes, totalViews, totalComments
  • kernelUrl

User

  • username, displayName, profileUrl
  • totalDatasetsListed

Input

FieldTypeDefaultDescription
modeenumsearchOne of the 8 modes
searchQuerystringtitanicFree-text query
datasetRefsarrayowner/slug refs (mode=byDataset)
competitionRefsarrayCompetition slugs (mode=byCompetition, auth)
kernelRefsarrayowner/slug refs (mode=byNotebook, auth)
userSlugsarrayUsernames (mode=byUser)
startUrlsarrayKaggle URLs (mode=byUrl)
ownerSlugstringFilter to user/org
sortByenumhottesthottest / votes / updated / active / published
fileTypeenumallall / csv / sqlite / json / bigQuery
licenseGroupenumallall / cc / gpl / odb / other
minVotesintegerDrop below this vote count
minDownloadsintegerDrop below this download count
minUsabilityintegerDrop below this usability rating
maxSizeBytesintegerDrop datasets larger than this
kernelSortByenumhotnessNotebook sort key (auth modes)
kernelLanguageenumallNotebook language (auth modes)
kernelTypeenumallscript / notebook (auth modes)
kaggleUsernamestringRequired for competition / notebook modes
kaggleApiKeystring (secret)Required for competition / notebook modes
maxItemsinteger50Hard cap (1–10000)

Examples

Search top Titanic datasets

{
"mode": "search",
"searchQuery": "titanic",
"sortBy": "votes",
"maxItems": 25
}
{
"mode": "trendingDatasets",
"fileType": "csv",
"minUsability": 0.8,
"maxItems": 50
}

Lookup a specific dataset

{
"mode": "byDataset",
"datasetRefs": ["heptapod/titanic"]
}

Browse a user's datasets

{
"mode": "byUser",
"userSlugs": ["heptapod"]
}

Lookup by URL (auto-detect)

{
"mode": "byUrl",
"startUrls": [
"https://www.kaggle.com/datasets/heptapod/titanic",
"https://www.kaggle.com/heptapod"
]
}

Competition lookup (auth required)

{
"mode": "byCompetition",
"competitionRefs": ["titanic"],
"kaggleUsername": "your-username",
"kaggleApiKey": "your-api-key"
}

How to get Kaggle API credentials

  1. Sign in to kaggle.com.
  2. Go to Account settings → "API" → "Create New Token".
  3. A kaggle.json file downloads. Use the username and key fields here as kaggleUsername and kaggleApiKey.

You only need credentials for byCompetition, byNotebook, and trendingNotebooks modes. All dataset modes work without auth.

Reliability

  • Direct calls to the official kaggle.com/api/v1/* endpoints
  • Exponential backoff retries on 429, 500504
  • HTML 404 fallback handling (Kaggle redirects unknown refs to a 404 HTML page)
  • No proxy needed — works from datacenter IPs

Limitations

  • The Kaggle public API exposes user info indirectly; byUser records are derived from the user's first listed datasets and contain only username, displayName, and a count of listed datasets.
  • Competitions, notebooks (kernels), and trending notebooks all require Kaggle API credentials — these are private endpoints (return 401 Unauthenticated without auth).
  • The license filter passes one of 5 broad families (cc/gpl/odb/other/all); finer-grained licenses like cc-by-sa-4.0 are returned in the output's licenseName field but cannot be filtered server-side.
  • Single-version datasets only — version history is not enumerated.

FAQ

Do I need a Kaggle account? Only for competitions / notebooks. Dataset search and lookup work anonymously.

How fresh is the data? Real-time — every run hits the live Kaggle API.

Can I download dataset files? No. This actor exposes Kaggle metadata — refs, file lists, vote / download counts, license, etc. To download files, use the Kaggle CLI with the ref from this actor's output.

Why are some fields missing? Empty / null fields are omitted — only populated fields appear in the output.

Why does the daily test run only return datasets? The default prefill targets dataset search, which is the only mode that works without credentials. Once you provide kaggleUsername + kaggleApiKey, all 8 modes are available.