OpenML Dataset Scraper avatar

OpenML Dataset Scraper

Pricing

Pay per event

Go to Apify Store
OpenML Dataset Scraper

OpenML Dataset Scraper

Scrape ML datasets, tasks, flows, and runs from OpenML - the open science platform for machine learning

Pricing

Pay per event

Rating

0.0

(0)

Developer

Stas Persiianenko

Stas Persiianenko

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

17 days ago

Last modified

Categories

Share

OpenML Scraper

Extract ML datasets, benchmark tasks, and algorithm flows from OpenML β€” the open science platform for machine learning. Get structured metadata for thousands of public ML benchmark datasets including feature counts, instance counts, class distributions, quality metrics, tags, download URLs, and more.

No API key required. No proxy needed. Pure HTTP access to OpenML's public REST API.

What does it do?

OpenML Scraper connects to the OpenML public REST API (openml.org/api/v1/json) and extracts structured data for three resource types:

  • πŸ“Š Datasets β€” ML benchmark datasets with quality metrics (features, instances, classes, missing values), tags, descriptions, download URLs, and metadata
  • 🎯 Tasks β€” Supervised classification and regression tasks defining evaluation procedures and target attributes
  • βš™οΈ Flows β€” Algorithm and pipeline implementations (scikit-learn, Weka, R packages, etc.) uploaded by the community

Results are pushed to an Apify dataset in clean, flat JSON format β€” ready for analysis, filtering, or export to CSV/Excel.

Who is it for?

ML researchers who want to browse and discover datasets for benchmarking without clicking through the OpenML web UI. Filter by name, status, or type and get all metadata in a single structured output.

AutoML engineers building dataset recommendation systems or experiment tracking pipelines. Use the scraper to programmatically catalog available benchmark datasets and their properties.

Data scientists who need to audit which OpenML datasets match their constraints (minimum features, instances, classes) for reproducible research.

Platform builders creating dataset directories or ML curriculum tools who need a machine-readable catalog of public benchmark datasets.

Students and educators exploring the landscape of ML datasets for teaching purposes β€” quickly find datasets by name, size, or domain tag.

Why use it?

OpenML's REST API is public and powerful, but integrating it into workflows requires building custom fetch/pagination/normalization code. This actor handles all of that:

  • βœ… Pagination built-in β€” fetches all matching results up to your maxResults limit, automatically handling page offsets
  • βœ… Rich metadata β€” goes beyond the list API to fetch full dataset descriptions, upload dates, licence info, and download URLs
  • βœ… Quality metrics extracted β€” flattens the nested quality array into named fields (numberOfFeatures, numberOfInstances, etc.)
  • βœ… No auth needed β€” OpenML's public API requires no API key
  • βœ… Retry logic β€” configurable retry count for transient failures
  • βœ… Clean flat output β€” no nested objects, ready for Apify datasets table view and CSV export

Data extracted

Datasets

FieldDescription
idOpenML dataset ID
nameDataset name
versionDataset version number
statusActive / deactivated / in_preparation
formatFile format (ARFF, CSV, etc.)
urlOpenML dataset page URL
downloadUrlDirect ARFF file download URL
numberOfFeaturesTotal number of attributes/columns
numberOfInstancesTotal number of rows/samples
numberOfClassesNumber of target classes (classification datasets)
numberOfMissingValuesCount of missing values across all cells
uploadDateWhen the dataset was uploaded
descriptionDataset description (up to 500 chars)
licenceLicence (Public, CC BY, etc.)
defaultTargetAttributeDefault prediction target column name
tagsArray of tags (domain, study labels, source)

Tasks

FieldDescription
idTask ID
nameTask name (usually dataset name)
taskTypeTask type (Supervised Classification, Supervised Regression, etc.)
taskTypeIdNumeric task type ID
datasetIdSource dataset ID
statusTask status
targetFeatureTarget column to predict
estimationProcedureCross-validation procedure ID
evaluationMeasuresPrimary evaluation metric
numberOfFeaturesFeatures in the underlying dataset
numberOfInstancesInstances in the underlying dataset
urlOpenML task page URL

Flows

FieldDescription
idFlow ID
nameFlow name (e.g., sklearn.ensemble.forest.RandomForestClassifier)
fullNameFull name with version (e.g., sklearn...RandomForestClassifier(8))
versionFlow version number
externalVersionExternal library version tag
uploaderIdUser ID of the uploader
urlOpenML flow page URL

How much does it cost to scrape OpenML datasets?

πŸ’‘ Free plan estimate: ~100 free results per month on the Apify Free plan.

The actor uses Pay-Per-Event (PPE) pricing:

EventBRONZESILVERGOLDPLATINUMDIAMOND
Run startedflat feeflat feeflat feeflat feeflat fee
Per result~$0.000029~$0.0000225~$0.0000173~$0.0000115~$0.0000081

Example costs:

  • 100 datasets: ~$0.008
  • 500 datasets: ~$0.019
  • 1,000 datasets: ~$0.034

OpenML has ~6,000 active datasets, ~100,000 tasks, and ~20,000 flows. A full catalog export at BRONZE pricing costs ~$0.18–$2.89 depending on resource type.

How to use it

Step 1 β€” Choose your resource type

Select whether you want Datasets, Tasks, or Flows from the "What to scrape" section.

Step 2 β€” Filter (optional)

For datasets, enter a name filter in Search by name (e.g., iris, mnist, breast cancer) and set the Dataset status filter to active.

Step 3 β€” Set a result limit

Set Max results to control how many items to return. Start small (20–50) to preview the output before running a large batch.

Step 4 β€” Run and export

Click Save & Run. Results appear in the Dataset tab. Export to JSON, CSV, or Excel from the Export button.

Input parameters

ParameterTypeDefaultDescription
resourceTypestringdatasetsWhat to scrape: datasets, tasks, or flows
searchQuerystring``Filter by name (datasets: API-side; flows: client-side)
statusstringactiveDataset status: active, deactivated, in_preparation, any
maxResultsinteger100Maximum results to return (1–10,000)
maxRequestRetriesinteger3Retry attempts per failed request

Output example

{
"resourceType": "dataset",
"id": 61,
"name": "iris",
"version": 1,
"status": "active",
"format": "ARFF",
"url": "https://www.openml.org/d/61",
"downloadUrl": "https://openml.org/data/v1/download/61/iris.arff",
"numberOfFeatures": 5,
"numberOfInstances": 150,
"numberOfClasses": 3,
"numberOfMissingValues": 0,
"uploadDate": "2014-04-06T23:23:39",
"description": "Fisher's Iris Plants Database...",
"licence": "Public",
"defaultTargetAttribute": "class",
"tags": ["Botany", "Machine Learning", "uci"]
}

Tips for best results

  • πŸ” Name search is exact-prefix for datasets β€” search for iris returns iris, iris-2, etc. Use short, common dataset names.
  • βš™οΈ Flow search is substring match β€” searching for sklearn matches any flow whose name contains sklearn. It scans all flows (up to 20,000), which takes ~30–60 seconds.
  • πŸ“Š Use status: any to include deactivated and in-preparation datasets in your catalog.
  • ⚑ Set maxResults to 100 for quick previews. For full catalogs, set it to 10,000 and expect 2–5 minutes of runtime.
  • πŸ”„ Tasks don't support name filtering β€” all tasks are returned in order of task ID. Filter by task type in your downstream pipeline.

Integrations

πŸ”— Export to Google Sheets

Use the Google Sheets integration to automatically push extracted datasets to a spreadsheet for collaborative review or ML experiment planning.

πŸ“Š Connect to Power BI or Tableau

Export the dataset as CSV from the Apify console and import it into your BI tool to build dashboards comparing dataset sizes, feature counts, and class distributions.

πŸ€– AutoML pipeline seeding

Run this actor on a schedule to keep a local database of OpenML datasets fresh. Use the dataset list to auto-select benchmark datasets for your AutoML framework's evaluation suite.

πŸ”” Monitor new datasets via webhook

Configure an Apify webhook to trigger your downstream pipeline whenever new datasets matching your filter are found. Useful for ML research groups that want to stay current with new public benchmarks.

API usage

Node.js

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const run = await client.actor('automation-lab/openml-scraper').call({
resourceType: 'datasets',
searchQuery: 'mnist',
status: 'active',
maxResults: 50,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items);

Python

from apify_client import ApifyClient
client = ApifyClient(token="YOUR_API_TOKEN")
run = client.actor("automation-lab/openml-scraper").call(run_input={
"resourceType": "datasets",
"searchQuery": "mnist",
"status": "active",
"maxResults": 50,
})
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item)

cURL

curl -X POST \
"https://api.apify.com/v2/acts/automation-lab~openml-scraper/runs?token=YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"resourceType": "datasets",
"searchQuery": "iris",
"status": "active",
"maxResults": 10
}'

Use with Claude and MCP (AI agent access)

This actor is available as an MCP (Model Context Protocol) tool, letting AI agents like Claude query OpenML datasets directly in conversation.

Claude Code (terminal)

$claude mcp add --transport http apify "https://mcp.apify.com?tools=automation-lab/openml-scraper"

Claude Desktop / Cursor / VS Code

Add to your MCP config file:

{
"mcpServers": {
"apify": {
"type": "http",
"url": "https://mcp.apify.com?tools=automation-lab/openml-scraper",
"headers": {
"Authorization": "Bearer YOUR_API_TOKEN"
}
}
}
}

Example prompts for Claude:

  • "Find all active OpenML datasets with 'breast cancer' in the name"
  • "Get 100 OpenML benchmark datasets with at least 1000 instances"
  • "List the first 20 supervised classification tasks on OpenML"
  • "Find all scikit-learn algorithm flows on OpenML"

Legality and terms of service

OpenML data is publicly available under the OpenML terms of service. The datasets themselves are shared under various open licences (Public Domain, CC BY, etc.) which are included in the licence field. This actor only accesses the public REST API using documented endpoints β€” no scraping of HTML content. Commercial use of the data depends on individual dataset licences.

FAQ

Q: Why does the flow search take a long time? A: OpenML's API doesn't support server-side name filtering for flows. The actor paginates through all flows and filters client-side. With 20,000+ flows, this can take 30–120 seconds. For fast results on flows, set maxResults to 50–100 and omit the searchQuery to get the latest flows by ID.

Q: The actor returned fewer results than my maxResults β€” why? A: OpenML may not have that many resources matching your filter. For example, searching for iris as a dataset name returns ~5 datasets (multiple versions). This is expected behavior.

Q: How do I get the actual dataset file (ARFF/CSV)? A: Each result includes a downloadUrl field with the direct ARFF download link. You can use this in your ML framework (e.g., arff.load() in Python, or pass directly to OpenML Python client).

Q: Can I filter datasets by minimum number of instances or features? A: Not directly via the actor input. Run the actor with no filter to get all datasets, then filter in your downstream pipeline using the numberOfInstances and numberOfFeatures fields.

Q: The description field is truncated β€” can I get the full description? A: The description is truncated at 500 characters to keep dataset sizes manageable. OpenML descriptions can be several kilobytes of text. If you need full descriptions, use the id field to call https://www.openml.org/api/v1/json/data/{id} directly.