OpenML Dataset Scraper
Pricing
Pay per event
OpenML Dataset Scraper
Scrape ML datasets, tasks, flows, and runs from OpenML - the open science platform for machine learning
Pricing
Pay per event
Rating
0.0
(0)
Developer
Stas Persiianenko
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
17 days ago
Last modified
Categories
Share
OpenML Scraper
Extract ML datasets, benchmark tasks, and algorithm flows from OpenML β the open science platform for machine learning. Get structured metadata for thousands of public ML benchmark datasets including feature counts, instance counts, class distributions, quality metrics, tags, download URLs, and more.
No API key required. No proxy needed. Pure HTTP access to OpenML's public REST API.
What does it do?
OpenML Scraper connects to the OpenML public REST API (openml.org/api/v1/json) and extracts structured data for three resource types:
- π Datasets β ML benchmark datasets with quality metrics (features, instances, classes, missing values), tags, descriptions, download URLs, and metadata
- π― Tasks β Supervised classification and regression tasks defining evaluation procedures and target attributes
- βοΈ Flows β Algorithm and pipeline implementations (scikit-learn, Weka, R packages, etc.) uploaded by the community
Results are pushed to an Apify dataset in clean, flat JSON format β ready for analysis, filtering, or export to CSV/Excel.
Who is it for?
ML researchers who want to browse and discover datasets for benchmarking without clicking through the OpenML web UI. Filter by name, status, or type and get all metadata in a single structured output.
AutoML engineers building dataset recommendation systems or experiment tracking pipelines. Use the scraper to programmatically catalog available benchmark datasets and their properties.
Data scientists who need to audit which OpenML datasets match their constraints (minimum features, instances, classes) for reproducible research.
Platform builders creating dataset directories or ML curriculum tools who need a machine-readable catalog of public benchmark datasets.
Students and educators exploring the landscape of ML datasets for teaching purposes β quickly find datasets by name, size, or domain tag.
Why use it?
OpenML's REST API is public and powerful, but integrating it into workflows requires building custom fetch/pagination/normalization code. This actor handles all of that:
- β
Pagination built-in β fetches all matching results up to your
maxResultslimit, automatically handling page offsets - β Rich metadata β goes beyond the list API to fetch full dataset descriptions, upload dates, licence info, and download URLs
- β
Quality metrics extracted β flattens the nested
qualityarray into named fields (numberOfFeatures,numberOfInstances, etc.) - β No auth needed β OpenML's public API requires no API key
- β Retry logic β configurable retry count for transient failures
- β Clean flat output β no nested objects, ready for Apify datasets table view and CSV export
Data extracted
Datasets
| Field | Description |
|---|---|
id | OpenML dataset ID |
name | Dataset name |
version | Dataset version number |
status | Active / deactivated / in_preparation |
format | File format (ARFF, CSV, etc.) |
url | OpenML dataset page URL |
downloadUrl | Direct ARFF file download URL |
numberOfFeatures | Total number of attributes/columns |
numberOfInstances | Total number of rows/samples |
numberOfClasses | Number of target classes (classification datasets) |
numberOfMissingValues | Count of missing values across all cells |
uploadDate | When the dataset was uploaded |
description | Dataset description (up to 500 chars) |
licence | Licence (Public, CC BY, etc.) |
defaultTargetAttribute | Default prediction target column name |
tags | Array of tags (domain, study labels, source) |
Tasks
| Field | Description |
|---|---|
id | Task ID |
name | Task name (usually dataset name) |
taskType | Task type (Supervised Classification, Supervised Regression, etc.) |
taskTypeId | Numeric task type ID |
datasetId | Source dataset ID |
status | Task status |
targetFeature | Target column to predict |
estimationProcedure | Cross-validation procedure ID |
evaluationMeasures | Primary evaluation metric |
numberOfFeatures | Features in the underlying dataset |
numberOfInstances | Instances in the underlying dataset |
url | OpenML task page URL |
Flows
| Field | Description |
|---|---|
id | Flow ID |
name | Flow name (e.g., sklearn.ensemble.forest.RandomForestClassifier) |
fullName | Full name with version (e.g., sklearn...RandomForestClassifier(8)) |
version | Flow version number |
externalVersion | External library version tag |
uploaderId | User ID of the uploader |
url | OpenML flow page URL |
How much does it cost to scrape OpenML datasets?
π‘ Free plan estimate: ~100 free results per month on the Apify Free plan.
The actor uses Pay-Per-Event (PPE) pricing:
| Event | BRONZE | SILVER | GOLD | PLATINUM | DIAMOND |
|---|---|---|---|---|---|
| Run started | flat fee | flat fee | flat fee | flat fee | flat fee |
| Per result | ~$0.000029 | ~$0.0000225 | ~$0.0000173 | ~$0.0000115 | ~$0.0000081 |
Example costs:
- 100 datasets: ~$0.008
- 500 datasets: ~$0.019
- 1,000 datasets: ~$0.034
OpenML has ~6,000 active datasets, ~100,000 tasks, and ~20,000 flows. A full catalog export at BRONZE pricing costs ~$0.18β$2.89 depending on resource type.
How to use it
Step 1 β Choose your resource type
Select whether you want Datasets, Tasks, or Flows from the "What to scrape" section.
Step 2 β Filter (optional)
For datasets, enter a name filter in Search by name (e.g., iris, mnist, breast cancer) and set the Dataset status filter to active.
Step 3 β Set a result limit
Set Max results to control how many items to return. Start small (20β50) to preview the output before running a large batch.
Step 4 β Run and export
Click Save & Run. Results appear in the Dataset tab. Export to JSON, CSV, or Excel from the Export button.
Input parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
resourceType | string | datasets | What to scrape: datasets, tasks, or flows |
searchQuery | string | `` | Filter by name (datasets: API-side; flows: client-side) |
status | string | active | Dataset status: active, deactivated, in_preparation, any |
maxResults | integer | 100 | Maximum results to return (1β10,000) |
maxRequestRetries | integer | 3 | Retry attempts per failed request |
Output example
{"resourceType": "dataset","id": 61,"name": "iris","version": 1,"status": "active","format": "ARFF","url": "https://www.openml.org/d/61","downloadUrl": "https://openml.org/data/v1/download/61/iris.arff","numberOfFeatures": 5,"numberOfInstances": 150,"numberOfClasses": 3,"numberOfMissingValues": 0,"uploadDate": "2014-04-06T23:23:39","description": "Fisher's Iris Plants Database...","licence": "Public","defaultTargetAttribute": "class","tags": ["Botany", "Machine Learning", "uci"]}
Tips for best results
- π Name search is exact-prefix for datasets β search for
irisreturnsiris,iris-2, etc. Use short, common dataset names. - βοΈ Flow search is substring match β searching for
sklearnmatches any flow whose name containssklearn. It scans all flows (up to 20,000), which takes ~30β60 seconds. - π Use
status: anyto include deactivated and in-preparation datasets in your catalog. - β‘ Set maxResults to 100 for quick previews. For full catalogs, set it to 10,000 and expect 2β5 minutes of runtime.
- π Tasks don't support name filtering β all tasks are returned in order of task ID. Filter by task type in your downstream pipeline.
Integrations
π Export to Google Sheets
Use the Google Sheets integration to automatically push extracted datasets to a spreadsheet for collaborative review or ML experiment planning.
π Connect to Power BI or Tableau
Export the dataset as CSV from the Apify console and import it into your BI tool to build dashboards comparing dataset sizes, feature counts, and class distributions.
π€ AutoML pipeline seeding
Run this actor on a schedule to keep a local database of OpenML datasets fresh. Use the dataset list to auto-select benchmark datasets for your AutoML framework's evaluation suite.
π Monitor new datasets via webhook
Configure an Apify webhook to trigger your downstream pipeline whenever new datasets matching your filter are found. Useful for ML research groups that want to stay current with new public benchmarks.
API usage
Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });const run = await client.actor('automation-lab/openml-scraper').call({resourceType: 'datasets',searchQuery: 'mnist',status: 'active',maxResults: 50,});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(items);
Python
from apify_client import ApifyClientclient = ApifyClient(token="YOUR_API_TOKEN")run = client.actor("automation-lab/openml-scraper").call(run_input={"resourceType": "datasets","searchQuery": "mnist","status": "active","maxResults": 50,})for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(item)
cURL
curl -X POST \"https://api.apify.com/v2/acts/automation-lab~openml-scraper/runs?token=YOUR_API_TOKEN" \-H "Content-Type: application/json" \-d '{"resourceType": "datasets","searchQuery": "iris","status": "active","maxResults": 10}'
Use with Claude and MCP (AI agent access)
This actor is available as an MCP (Model Context Protocol) tool, letting AI agents like Claude query OpenML datasets directly in conversation.
Claude Code (terminal)
$claude mcp add --transport http apify "https://mcp.apify.com?tools=automation-lab/openml-scraper"
Claude Desktop / Cursor / VS Code
Add to your MCP config file:
{"mcpServers": {"apify": {"type": "http","url": "https://mcp.apify.com?tools=automation-lab/openml-scraper","headers": {"Authorization": "Bearer YOUR_API_TOKEN"}}}}
Example prompts for Claude:
- "Find all active OpenML datasets with 'breast cancer' in the name"
- "Get 100 OpenML benchmark datasets with at least 1000 instances"
- "List the first 20 supervised classification tasks on OpenML"
- "Find all scikit-learn algorithm flows on OpenML"
Legality and terms of service
OpenML data is publicly available under the OpenML terms of service. The datasets themselves are shared under various open licences (Public Domain, CC BY, etc.) which are included in the licence field. This actor only accesses the public REST API using documented endpoints β no scraping of HTML content. Commercial use of the data depends on individual dataset licences.
FAQ
Q: Why does the flow search take a long time?
A: OpenML's API doesn't support server-side name filtering for flows. The actor paginates through all flows and filters client-side. With 20,000+ flows, this can take 30β120 seconds. For fast results on flows, set maxResults to 50β100 and omit the searchQuery to get the latest flows by ID.
Q: The actor returned fewer results than my maxResults β why?
A: OpenML may not have that many resources matching your filter. For example, searching for iris as a dataset name returns ~5 datasets (multiple versions). This is expected behavior.
Q: How do I get the actual dataset file (ARFF/CSV)?
A: Each result includes a downloadUrl field with the direct ARFF download link. You can use this in your ML framework (e.g., arff.load() in Python, or pass directly to OpenML Python client).
Q: Can I filter datasets by minimum number of instances or features?
A: Not directly via the actor input. Run the actor with no filter to get all datasets, then filter in your downstream pipeline using the numberOfInstances and numberOfFeatures fields.
Q: The description field is truncated β can I get the full description?
A: The description is truncated at 500 characters to keep dataset sizes manageable. OpenML descriptions can be several kilobytes of text. If you need full descriptions, use the id field to call https://www.openml.org/api/v1/json/data/{id} directly.
Related scrapers
- ACL Anthology Scraper β scrape NLP/ML research papers from the ACL Anthology
- ArXiv Paper Scraper β extract ML and AI paper metadata from arXiv
