Pricing

Pay per event

Go to Apify Store

OpenML Dataset Scraper

Try for free

Scrape ML datasets, tasks, flows, and runs from OpenML - the open science platform for machine learning

Pricing

Pay per event

Rating

0.0

(0)

Developer

Stas Persiianenko

Actor stats

Bookmarked

Total users

Monthly active users

14 days ago

Last modified

OpenML Scraper

Extract ML datasets, benchmark tasks, and algorithm flows from OpenML — the open science platform for machine learning. Get structured metadata for thousands of public ML benchmark datasets including feature counts, instance counts, class distributions, quality metrics, tags, download URLs, and more.

No API key required. No proxy needed. Pure HTTP access to OpenML's public REST API.

What does it do?

OpenML Scraper connects to the OpenML public REST API (openml.org/api/v1/json) and extracts structured data for three resource types:

📊 Datasets — ML benchmark datasets with quality metrics (features, instances, classes, missing values), tags, descriptions, download URLs, and metadata
🎯 Tasks — Supervised classification and regression tasks defining evaluation procedures and target attributes
⚙️ Flows — Algorithm and pipeline implementations (scikit-learn, Weka, R packages, etc.) uploaded by the community

Results are pushed to an Apify dataset in clean, flat JSON format — ready for analysis, filtering, or export to CSV/Excel.

Who is it for?

ML researchers who want to browse and discover datasets for benchmarking without clicking through the OpenML web UI. Filter by name, status, or type and get all metadata in a single structured output.

AutoML engineers building dataset recommendation systems or experiment tracking pipelines. Use the scraper to programmatically catalog available benchmark datasets and their properties.

Data scientists who need to audit which OpenML datasets match their constraints (minimum features, instances, classes) for reproducible research.

Platform builders creating dataset directories or ML curriculum tools who need a machine-readable catalog of public benchmark datasets.

Students and educators exploring the landscape of ML datasets for teaching purposes — quickly find datasets by name, size, or domain tag.

Why use it?

OpenML's REST API is public and powerful, but integrating it into workflows requires building custom fetch/pagination/normalization code. This actor handles all of that:

✅ Pagination built-in — fetches all matching results up to your maxResults limit, automatically handling page offsets
✅ Rich metadata — goes beyond the list API to fetch full dataset descriptions, upload dates, licence info, and download URLs
✅ Quality metrics extracted — flattens the nested quality array into named fields (numberOfFeatures, numberOfInstances, etc.)
✅ No auth needed — OpenML's public API requires no API key
✅ Retry logic — configurable retry count for transient failures
✅ Clean flat output — no nested objects, ready for Apify datasets table view and CSV export

Data extracted

Datasets

Field	Description
`id`	OpenML dataset ID
`name`	Dataset name
`version`	Dataset version number
`status`	Active / deactivated / in_preparation
`format`	File format (ARFF, CSV, etc.)
`url`	OpenML dataset page URL
`downloadUrl`	Direct ARFF file download URL
`numberOfFeatures`	Total number of attributes/columns
`numberOfInstances`	Total number of rows/samples
`numberOfClasses`	Number of target classes (classification datasets)
`numberOfMissingValues`	Count of missing values across all cells
`uploadDate`	When the dataset was uploaded
`description`	Dataset description (up to 500 chars)
`licence`	Licence (Public, CC BY, etc.)
`defaultTargetAttribute`	Default prediction target column name
`tags`	Array of tags (domain, study labels, source)

Tasks

Field	Description
`id`	Task ID
`name`	Task name (usually dataset name)
`taskType`	Task type (Supervised Classification, Supervised Regression, etc.)
`taskTypeId`	Numeric task type ID
`datasetId`	Source dataset ID
`status`	Task status
`targetFeature`	Target column to predict
`estimationProcedure`	Cross-validation procedure ID
`evaluationMeasures`	Primary evaluation metric
`numberOfFeatures`	Features in the underlying dataset
`numberOfInstances`	Instances in the underlying dataset
`url`	OpenML task page URL

Flows

Field	Description
`id`	Flow ID
`name`	Flow name (e.g., `sklearn.ensemble.forest.RandomForestClassifier`)
`fullName`	Full name with version (e.g., `sklearn...RandomForestClassifier(8)`)
`version`	Flow version number
`externalVersion`	External library version tag
`uploaderId`	User ID of the uploader
`url`	OpenML flow page URL

How much does it cost to scrape OpenML datasets?

💡 Free plan estimate: ~100 free results per month on the Apify Free plan.

The actor uses Pay-Per-Event (PPE) pricing:

Event	BRONZE	SILVER	GOLD	PLATINUM	DIAMOND
Run started	flat fee	flat fee	flat fee	flat fee	flat fee
Per result	~$0.000029	~$0.0000225	~$0.0000173	~$0.0000115	~$0.00001

Example costs:

100 datasets: ~$0.008
500 datasets: ~$0.019
1,000 datasets: ~$0.034

OpenML has ~6,000 active datasets, ~100,000 tasks, and ~20,000 flows. A full catalog export at BRONZE pricing costs ~$0.18–$2.89 depending on resource type.

How to use it

Step 1 — Choose your resource type

Select whether you want Datasets, Tasks, or Flows from the "What to scrape" section.

Step 2 — Filter (optional)

For datasets, enter a name filter in Search by name (e.g., iris, mnist, breast cancer) and set the Dataset status filter to active.

Step 3 — Set a result limit

Set Max results to control how many items to return. Start small (20–50) to preview the output before running a large batch.

Step 4 — Run and export

Click Save & Run. Results appear in the Dataset tab. Export to JSON, CSV, or Excel from the Export button.

Input parameters

Parameter	Type	Default	Description
`resourceType`	string	`datasets`	What to scrape: `datasets`, `tasks`, or `flows`
`searchQuery`	string	``	Filter by name (datasets: API-side; flows: client-side)
`status`	string	`active`	Dataset status: `active`, `deactivated`, `in_preparation`, `any`
`maxResults`	integer	`100`	Maximum results to return (1–10,000)
`maxRequestRetries`	integer	`3`	Retry attempts per failed request

Output example

{
  "resourceType": "dataset",
  "id": 61,
  "name": "iris",
  "version": 1,
  "status": "active",
  "format": "ARFF",
  "url": "https://www.openml.org/d/61",
  "downloadUrl": "https://openml.org/data/v1/download/61/iris.arff",
  "numberOfFeatures": 5,
  "numberOfInstances": 150,
  "numberOfClasses": 3,
  "numberOfMissingValues": 0,
  "uploadDate": "2014-04-06T23:23:39",
  "description": "Fisher's Iris Plants Database...",
  "licence": "Public",
  "defaultTargetAttribute": "class",
  "tags": ["Botany", "Machine Learning", "uci"]
}

Tips for best results

🔍 Name search is exact-prefix for datasets — search for iris returns iris, iris-2, etc. Use short, common dataset names.
⚙️ Flow search is substring match — searching for sklearn matches any flow whose name contains sklearn. It scans all flows (up to 20,000), which takes ~30–60 seconds.
📊 Use status: any to include deactivated and in-preparation datasets in your catalog.
⚡ Set maxResults to 100 for quick previews. For full catalogs, set it to 10,000 and expect 2–5 minutes of runtime.
🔄 Tasks don't support name filtering — all tasks are returned in order of task ID. Filter by task type in your downstream pipeline.

Integrations

🔗 Export to Google Sheets

Use the Google Sheets integration to automatically push extracted datasets to a spreadsheet for collaborative review or ML experiment planning.

📊 Connect to Power BI or Tableau

Export the dataset as CSV from the Apify console and import it into your BI tool to build dashboards comparing dataset sizes, feature counts, and class distributions.

🤖 AutoML pipeline seeding

Run this actor on a schedule to keep a local database of OpenML datasets fresh. Use the dataset list to auto-select benchmark datasets for your AutoML framework's evaluation suite.

🔔 Monitor new datasets via webhook

Configure an Apify webhook to trigger your downstream pipeline whenever new datasets matching your filter are found. Useful for ML research groups that want to stay current with new public benchmarks.

API usage

Node.js

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });

const run = await client.actor('automation-lab/openml-scraper').call({
  resourceType: 'datasets',
  searchQuery: 'mnist',
  status: 'active',
  maxResults: 50,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items);

Python

from apify_client import ApifyClient

client = ApifyClient(token="YOUR_API_TOKEN")

run = client.actor("automation-lab/openml-scraper").call(run_input={
    "resourceType": "datasets",
    "searchQuery": "mnist",
    "status": "active",
    "maxResults": 50,
})

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

cURL

curl -X POST \
  "https://api.apify.com/v2/acts/automation-lab~openml-scraper/runs?token=YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "resourceType": "datasets",
    "searchQuery": "iris",
    "status": "active",
    "maxResults": 10
  }'

Use with Claude and MCP (AI agent access)

This actor is available as an MCP (Model Context Protocol) tool, letting AI agents like Claude query OpenML datasets directly in conversation.

Claude Code (terminal)

$claude mcp add --transport http apify "https://mcp.apify.com?tools=automation-lab/openml-scraper"

Claude Desktop / Cursor / VS Code

Add to your MCP config file:

{
  "mcpServers": {
    "apify": {
      "type": "http",
      "url": "https://mcp.apify.com?tools=automation-lab/openml-scraper",
      "headers": {
        "Authorization": "Bearer YOUR_API_TOKEN"
      }
    }
  }
}

Example prompts for Claude:

"Find all active OpenML datasets with 'breast cancer' in the name"
"Get 100 OpenML benchmark datasets with at least 1000 instances"
"List the first 20 supervised classification tasks on OpenML"
"Find all scikit-learn algorithm flows on OpenML"

Legality and terms of service

OpenML data is publicly available under the OpenML terms of service. The datasets themselves are shared under various open licences (Public Domain, CC BY, etc.) which are included in the licence field. This actor only accesses the public REST API using documented endpoints — no scraping of HTML content. Commercial use of the data depends on individual dataset licences.

FAQ

Q: Why does the flow search take a long time? A: OpenML's API doesn't support server-side name filtering for flows. The actor paginates through all flows and filters client-side. With 20,000+ flows, this can take 30–120 seconds. For fast results on flows, set maxResults to 50–100 and omit the searchQuery to get the latest flows by ID.

Q: The actor returned fewer results than my maxResults — why? A: OpenML may not have that many resources matching your filter. For example, searching for iris as a dataset name returns ~5 datasets (multiple versions). This is expected behavior.

Q: How do I get the actual dataset file (ARFF/CSV)? A: Each result includes a downloadUrl field with the direct ARFF download link. You can use this in your ML framework (e.g., arff.load() in Python, or pass directly to OpenML Python client).

Q: Can I filter datasets by minimum number of instances or features? A: Not directly via the actor input. Run the actor with no filter to get all datasets, then filter in your downstream pipeline using the numberOfInstances and numberOfFeatures fields.

Q: The description field is truncated — can I get the full description? A: The description is truncated at 500 characters to keep dataset sizes manageable. OpenML descriptions can be several kilobytes of text. If you need full descriptions, use the id field to call https://www.openml.org/api/v1/json/data/{id} directly.

ACL Anthology Scraper — scrape NLP/ML research papers from the ACL Anthology
ArXiv Paper Scraper — extract ML and AI paper metadata from arXiv

ML Contests Scraper

automation-lab/mlcontests-scraper

Scrape machine learning, data science, and robotics competitions from mlcontests.com

Stas Persiianenko

Papers with Code Scraper

crawlerbros/papers-with-code-scraper

Scrape Papers with Code like search ML papers, fetch paper details with repos and results, browse ML tasks and leaderboards, search datasets, and find ML methods.

Crawler Bros

AI Jobs Scraper (aijobs.net) — ML & Data Roles

nomad-agent/ai-jobs-net-scraper

Scrape AI, machine learning and data science jobs from aijobs.net: ML engineer, data scientist, MLOps, research scientist, more. Each record has title, company, location, remote flag, seniority, salary band, posted date and apply URL. Filter by keyword and location; company + description included.

Nomad.Dev

ML URL Scraper

tasty_gesture/meli-url-scraper

Misto Quente

OSF Open Science Framework Scraper

parseforge/osf-scraper

Export public research projects, preprints, and registrations from the Open Science Framework (OSF). Search across 1M+ open science records. Filter by keyword, subject, or provider. Pull titles, descriptions, tags, DOIs, authors, institutions, dates, and full metadata.

ParseForge

Hosco Courses Scraper - Low-cost💲🔥🎓📚

delectable_incubator/hosco-courses-scraper-low-cost

Scrape Hosco courses and learning opportunities 🎓📚 with a powerful education scraper. Extract course titles, providers, locations, durations, learning formats, descriptions, and course URLs. Ideal for e-learning platforms, education research, skills development tracking and learning datasets 📊🚀

Prime Scrape

Dataset to HuggingFace

flamboyant_leaf/DatasetToHuggingFace

Transfers data from Apify datasets to Hugging Face datasets. Bridges web scraping with ML platforms, enabling access to pre-trained models and collaborative tools. Customize transfer limits, streamline ML workflows, and leverage data versioning. Ideal for data scientists and ML researchers.

AIRabbit

Aijobs.net AI & ML Job Listings Scraper

jungle_synthesizer/aijobs-net-ai-engineer-jobs-scraper

Scrape AI, ML, and data science job listings from aijobs.net — the go-to AI/ML job board. Extracts full job details including salary range, seniority, remote policy, tech stack tags, company info, and apply URL. Sitemap-driven for complete coverage.

BowTiedRaccoon

OSF Open Science Framework Projects Scraper

parseforge/osf-projects-scraper

Search the Open Science Framework for public research projects by keyword or category. Returns project IDs, titles, descriptions, contributors, public flags, date created, date modified, and tag lists. Useful for meta science, scholarly discovery, and tracking research outputs across labs.