Cern Opendata Actor
Pricing
from $0.05 / 1,000 results
Go to Apify Store
Pricing
from $0.05 / 1,000 results
Rating
0.0
(0)
Developer

Maksim Kudriavtsev
Maintained by Community
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
a month ago
Last modified
Categories
Share
CERN OpenData Harvester
Actor collects the CERN OpenData catalog, bypassing the 10,000 result limit: it reads public sitemaps, converts links to API calls, and, if necessary, downloads additional records by recid ranges. Normalized data is written to a dataset; raw payloads can optionally be stored in a key-value store.
Input
Define in .actor/input_schema.json. All fields are optional and have defaults:
sitemapIndexUrl(string, defaulthttps://opendata.cern.ch/sitemap.xml): root sitemap index.enableSitemapScan(boolean, defaulttrue): enable sitemap crawling.enableRecidScan(boolean, defaulttrue): enable searching byrecidranges.maxWorkers(integer, default12): Parallel downloads of API pages from sitemaps.retries(integer, default5): HTTP error/429/5xx attempts.recidMax(integer, default120000): Upper limit ofrecidfor crawling.recidStep(integer, default500): Recid step size for search requests.pageSize(integer, default200): Search API page size.skipIds(array of string): List of IDs/slugs to skip (default includes service pages).storeRaw(boolean, defaultfalse): Whether to store raw API payloads in KV. -rawKeyValuePrefix(string, defaultraw-cern-opendata/): Key prefix in KV whenstoreRaw=true.
Output
Dataset (default)
Each record is a normalized object:
{"id": "12345","title": "Some dataset","type": "dataset","experiment": "CMS","availability": "open","file_count": 4,"portal_url": "https://opendata.cern.ch/record/12345","api_url": "https://opendata.cern.ch/api/records/12345","created": "2020-01-01T00:00:00","updated": "2020-05-01T00:00:00","files": [{ "key": "...", "uri": "...", "size": 123, "availability": "...", "checksum": "...", "version_id": "...", "tags": [...] }],"description": "...","keywords": [...],"collections": [...],"distribution": {...},"pids": {...},"publisher": "...","language": "...","run_period": "...","source": "sitemap|recid-range","bucket_url": "https://opendata.cern.ch/api/files/<bucket>"}
Dataset view overview shows: id, title, type, experiment, availability, file_count, portal_url, api_url, created, updated.