Cern Opendata Actor avatar
Cern Opendata Actor
Under maintenance

Pricing

from $0.05 / 1,000 results

Go to Apify Store
Cern Opendata Actor

Cern Opendata Actor

Under maintenance

Harvests the CERN OpenData catalog

Pricing

from $0.05 / 1,000 results

Rating

0.0

(0)

Developer

Maksim Kudriavtsev

Maksim Kudriavtsev

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a month ago

Last modified

Categories

Share

CERN OpenData Harvester

Actor collects the CERN OpenData catalog, bypassing the 10,000 result limit: it reads public sitemaps, converts links to API calls, and, if necessary, downloads additional records by recid ranges. Normalized data is written to a dataset; raw payloads can optionally be stored in a key-value store.

Input

Define in .actor/input_schema.json. All fields are optional and have defaults:

  • sitemapIndexUrl (string, default https://opendata.cern.ch/sitemap.xml): root sitemap index.
  • enableSitemapScan (boolean, default true): enable sitemap crawling.
  • enableRecidScan (boolean, default true): enable searching by recid ranges.
  • maxWorkers (integer, default 12): Parallel downloads of API pages from sitemaps.
  • retries (integer, default 5): HTTP error/429/5xx attempts.
  • recidMax (integer, default 120000): Upper limit of recid for crawling.
  • recidStep (integer, default 500): Recid step size for search requests.
  • pageSize (integer, default 200): Search API page size.
  • skipIds (array of string): List of IDs/slugs to skip (default includes service pages).
  • storeRaw (boolean, default false): Whether to store raw API payloads in KV. - rawKeyValuePrefix (string, default raw-cern-opendata/): Key prefix in KV when storeRaw=true.

Output

Dataset (default)

Each record is a normalized object:

{
"id": "12345",
"title": "Some dataset",
"type": "dataset",
"experiment": "CMS",
"availability": "open",
"file_count": 4,
"portal_url": "https://opendata.cern.ch/record/12345",
"api_url": "https://opendata.cern.ch/api/records/12345",
"created": "2020-01-01T00:00:00",
"updated": "2020-05-01T00:00:00",
"files": [{ "key": "...", "uri": "...", "size": 123, "availability": "...", "checksum": "...", "version_id": "...", "tags": [...] }],
"description": "...",
"keywords": [...],
"collections": [...],
"distribution": {...},
"pids": {...},
"publisher": "...",
"language": "...",
"run_period": "...",
"source": "sitemap|recid-range",
"bucket_url": "https://opendata.cern.ch/api/files/<bucket>"
}

Dataset view overview shows: id, title, type, experiment, availability, file_count, portal_url, api_url, created, updated.