# NCI GDC Cancer Genomics Scraper (`parseforge/nci-gdc-cancer-scraper`) Actor

Scrape projects, cases, files, and annotations from the NCI Genomic Data Commons (GDC) public API. Filter by primary site or program (TCGA / CPTAC / TARGET) and get rich summary fields like case\_count, file\_count, file\_size, disease\_type and demographics. No API key required.

- **URL**: https://apify.com/parseforge/nci-gdc-cancer-scraper.md
- **Developed by:** [ParseForge](https://apify.com/parseforge) (community)
- **Categories:** Education, Developer tools, Business
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $29.62 / 1,000 results

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

![ParseForge Banner](https://github.com/ParseForge/apify-assets/raw/main/banner.jpg)

## 🧬 NCI Genomic Data Commons (GDC) Cancer Scraper

> 🚀 **Export cancer genomics metadata in seconds.** Pull **TCGA, CPTAC, TARGET, HCMI, BEATAML, and 20+ NCI programs** across projects, cases, files, and annotations. No API key, no registration, no manual REST stitching.

> 🕒 **Last updated:** 2026-05-13 · **📊 4 entity modes** · **🏥 26 NCI programs** · **🧬 50+ primary sites** · **🌐 GDC public API**

The **NCI GDC Cancer Scraper** queries the [NCI Genomic Data Commons](https://portal.gdc.cancer.gov) public REST API and returns rich records across **four entity types**: projects, cases, files, and annotations. GDC is the National Cancer Institute's open data platform for cancer research, hosting harmonized genomic and clinical data from TCGA, CPTAC, TARGET, HCMI, BEATAML, MMRF, CCDI, and 20+ other landmark programs.

The catalog covers **50+ primary tumour sites** (breast, lung, brain, colon, ovary, pancreas, prostate, kidney, liver, and more) and **26 NCI cancer programs**, totalling thousands of projects, hundreds of thousands of cases, and millions of harmonized data files. This Actor exposes both site and program filters at the API level, so disease-specific or program-specific exports are fast.

| 🎯 Target Audience | 💡 Primary Use Cases |
|---|---|
| Cancer biologists, computational oncologists, clinical bioinformaticians, biostatisticians, pharma R&D teams, journalists, regulatory analysts, ML researchers | Cohort discovery, program-level metadata audits, file inventory, annotation tracking, ML training datasets, demographic surveys, cross-program comparisons |

---

### 📋 What the NCI GDC Scraper does

Four entity modes in a single Actor:

- 🏥 **Projects.** Project IDs (TCGA-BRCA, CPTAC-3, TARGET-AML, etc.), program affiliation, primary sites, disease types, dbGaP accession, releasable / released state, and a summary block with file count, case count, and total file size in bytes.
- 👤 **Cases.** Case IDs, submitter IDs, primary site, disease type, project context, demographic block, diagnoses (with stage, vital status, age at diagnosis), exposures, index date, and timestamps.
- 📁 **Files.** File IDs, file names, data category (Sequencing Reads, Transcriptome Profiling, etc.), data format (BAM, VCF, TSV, etc.), data type, experimental strategy (WGS, WXS, RNA-Seq, etc.), file size, MD5 sum, access (open / controlled), state, associated cases, and analysis workflow.
- 📝 **Annotations.** Annotation IDs, entity ID and type, submitter ID, category, classification, notes, status, project context, and timestamps.

Filter any mode by **primary site** (50 options) or **NCI program** (26 options) to scope your export. Filters are pushed to the GDC API server-side.

> 💡 **Why it matters:** TCGA and CPTAC underpin most modern cancer-genomics research. Building your own GDC filter compiler and paginator means days of plumbing; this Actor returns ready-joined records on every run.

---

### 🎬 Full Demo

_🚧 Coming soon: a 3-minute walkthrough showing how to go from sign-up to a downloaded GDC dataset._

---

### ⚙️ Input

<table>
<thead>
<tr><th>Input</th><th>Type</th><th>Default</th><th>Behavior</th></tr>
</thead>
<tbody>
<tr><td><code>entity</code></td><td>enum</td><td><code>"projects"</code></td><td>One of <code>projects</code>, <code>cases</code>, <code>files</code>, <code>annotations</code>.</td></tr>
<tr><td><code>maxItems</code></td><td>integer</td><td><code>10</code></td><td>Records to return. Free plan caps at 10, paid plan at 1,000,000.</td></tr>
<tr><td><code>primarySite</code></td><td>string</td><td><code>""</code></td><td>One of 50 primary tumour sites. Empty = all.</td></tr>
<tr><td><code>program</code></td><td>string</td><td><code>""</code></td><td>One of 26 NCI programs (TCGA, CPTAC, TARGET, HCMI, BEATAML1.0, MMRF, CCDI, etc.). Empty = all.</td></tr>
</tbody>
</table>

**Example: every TCGA case in a single dataset.**

```json
{
    "entity": "cases",
    "program": "TCGA",
    "maxItems": 10000
}
````

**Example: all open BAM files for breast-cancer cases.**

```json
{
    "entity": "files",
    "primarySite": "Breast",
    "maxItems": 500
}
```

> ⚠️ **Good to Know:** filter field paths differ by entity. For projects, `primary_site` is checked at the project level; for cases, at the case level; for files, at `cases.primary_site`; for annotations, at `case.primary_site`. The Actor handles this routing for you, so the same `primarySite` input works across all four modes. The same applies to `program.name` versus `project.program.name`.

***

### 📊 Output

Output shape varies by entity. Each record always carries a `url` to the GDC portal and a `scrapedAt` timestamp.

#### 🧾 Projects schema

| Field | Type | Example |
|---|---|---|
| 🆔 `project_id` | string | `"TCGA-LGG"` |
| 🔗 `url` | string | `"https://portal.gdc.cancer.gov/projects/TCGA-LGG"` |
| 📛 `name` | string | `"Brain Lower Grade Glioma"` |
| 📍 `primary_site` | string\[] | `["Brain"]` |
| 🦠 `disease_type` | string\[] | `["Gliomas"]` |
| 🏥 `program` | object | `{ name: "TCGA", program_id, dbgap_accession_number }` |
| 🔓 `releasable` | boolean | `true` |
| ✅ `released` | boolean | `true` |
| 📂 `state` | string | `"open"` |
| 📊 `summary` | object | `{ file_count, case_count, file_size }` |
| 🕒 `scrapedAt` | ISO 8601 | `"2026-05-13T..."` |

#### 🧾 Cases / Files / Annotations

Each mode emits its native ID, GDC portal URL, the expand block (demographic, diagnoses, exposures for cases; cases + analysis for files; project for annotations), and timestamps.

#### 📦 Sample records

<details>
<summary><strong>🏥 Project: HCMI-CMDC (Human Cancer Model Initiative)</strong></summary>

```json
{
    "project_id": "HCMI-CMDC",
    "url": "https://portal.gdc.cancer.gov/projects/HCMI-CMDC",
    "name": "NCI Cancer Model Development for the Human Cancer Model Initiative",
    "primary_site": ["Ovary", "Skin", "Breast", "Brain", "Pancreas", "Lung"],
    "program": {
        "dbgap_accession_number": "phs001486",
        "program_id": "a5448c11-d46a-56aa-a5e1-5c1aa06404df",
        "name": "HCMI"
    },
    "releasable": true,
    "released": true,
    "state": "open",
    "summary": {
        "file_count": 43662,
        "case_count": 805,
        "file_size": 317056196803411
    },
    "scrapedAt": "2026-05-13T22:26:22.463Z"
}
```

</details>

<details>
<summary><strong>🧠 Project: TCGA-LGG (Brain Lower Grade Glioma)</strong></summary>

```json
{
    "project_id": "TCGA-LGG",
    "url": "https://portal.gdc.cancer.gov/projects/TCGA-LGG",
    "name": "Brain Lower Grade Glioma",
    "primary_site": ["Brain"],
    "disease_type": ["Gliomas"],
    "program": { "name": "TCGA", "dbgap_accession_number": "phs000178" },
    "state": "open"
}
```

</details>

<details>
<summary><strong>🫁 Project: CPTAC-3 (CPTAC Brain, Head/Neck, Kidney, Lung)</strong></summary>

```json
{
    "project_id": "CPTAC-3",
    "url": "https://portal.gdc.cancer.gov/projects/CPTAC-3",
    "name": "CPTAC-Brain, Head and Neck, Kidney, Lung, Pancreas, Uterus",
    "primary_site": ["Bronchus and lung", "Brain", "Kidney", "Breast", "Pancreas"],
    "program": { "name": "CPTAC" },
    "dbgap_accession_number": "phs001287",
    "summary": {
        "file_count": 100015,
        "case_count": 1683,
        "file_size": 699022036930795
    }
}
```

</details>

***

### ✨ Why choose this Actor

| | Capability |
|---|---|
| 🏥 | **All four GDC entities.** Projects, cases, files, and annotations in one Actor. |
| 🎯 | **Server-side filtering.** Primary site and program filters compile to GDC filter JSON and run at the API level. |
| 🧬 | **Expand-ready.** Demographics, diagnoses, exposures, project, and analysis blocks expanded per record. |
| 🔬 | **26 NCI programs.** TCGA, CPTAC, TARGET, HCMI, BEATAML, MMRF, CCDI, ALCHEMIST, MATCH, ORGANOID, and more. |
| 📍 | **50 primary sites.** Every ICD-O site from adrenal gland to vulva. |
| ⚡ | **Fast.** REST pagination with offset, page size 100-1000. |
| 🚫 | **No authentication.** Works on the public GDC API. No login or API key. |

> 📊 GDC hosts the largest harmonized cancer-genomics dataset in the world.

***

### 📈 How it compares to alternatives

| Approach | Cost | Coverage | Refresh | Filters | Setup |
|---|---|---|---|---|---|
| **⭐ NCI GDC Scraper** *(this Actor)* | $5 free credit, then pay-per-use | Full GDC catalogue | **Live per run** | site, program, entity | ⚡ 2 min |
| Hand-rolled GDC REST queries | Free | Full | Manual | Manual | 🐢 Days |
| cBioPortal | Free | Curated subset | Quarterly | Many | ⏳ Hours |
| Commercial cancer-genomics platforms | $$$/year | Proprietary | Per-release | Many | ⏳ Weeks |

Pick this Actor when you want broad cancer-genomics coverage, four ready-built entity modes, and no pipeline maintenance.

***

### 🚀 How to use

1. 📝 **Sign up.** [Create a free account with $5 credit](https://console.apify.com/sign-up?fpr=vmoqkp) (takes 2 minutes).
2. 🌐 **Open the Actor.** Go to the NCI GDC Cancer Scraper page on the Apify Store.
3. 🎯 **Set input.** Pick an entity, optionally filter by primary site or program, and set `maxItems`.
4. 🚀 **Run it.** Click **Start** and let the Actor collect your data.
5. 📥 **Download.** Grab your results in the **Dataset** tab as CSV, Excel, JSON, or XML.

> ⏱️ Total time from signup to downloaded dataset: **3-5 minutes.** No coding required.

***

### 💼 Business use cases

<table>
<tr>
<td width="50%" valign="top">

#### 💊 Pharma & Translational Research

- Cohort discovery for trial recruitment
- Tumour-type prevalence dashboards
- Biomarker prevalence by program
- Indication-specific case inventories

</td>
<td width="50%" valign="top">

#### 🧬 Computational Oncology

- TCGA / CPTAC harmonized data audits
- WGS, WXS, and RNA-Seq file inventories
- Demographic distributions per disease cohort
- Cross-program survival analyses

</td>
</tr>
<tr>
<td width="50%" valign="top">

#### 📰 Reporting & Scientometrics

- Public-data programme accounting
- Coverage maps by primary site
- Program-level open-data benchmarks
- Annotation tracking for data quality

</td>
<td width="50%" valign="top">

#### 🤖 ML & AI for Cancer Genomics

- Training-cohort assembly with site / program filters
- File-inventory feeds for download pipelines
- Demographic feature engineering
- Knowledge graphs joining cases, files, and projects

</td>
</tr>
</table>

***

### 🔌 Automating NCI GDC Scraper

Control the scraper programmatically for scheduled runs and pipeline integrations:

- 🟢 **Node.js.** Install the `apify-client` NPM package.
- 🐍 **Python.** Use the `apify-client` PyPI package.
- 📚 See the [Apify API documentation](https://docs.apify.com/api/v2) for full details.

The [Apify Schedules feature](https://docs.apify.com/platform/schedules) lets you trigger this Actor on any cron interval. Weekly or monthly refreshes catch every GDC data-release.

***

### 🌟 Beyond business use cases

Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.

<table>
<tr>
<td width="50%">

#### 🎓 Research and academia

- Reproducible cancer-genomics studies with versioned dataset pulls
- Teaching datasets for oncology and bioinformatics
- Open-source benchmark publications
- Cross-database joins with cBioPortal and Ensembl

</td>
<td width="50%">

#### 🎨 Personal and creative

- Indie cancer-data visualization apps
- Educational dashboards for science communication
- Public-health storytelling
- Portfolio projects on biomedical NLP

</td>
</tr>
<tr>
<td width="50%">

#### 🤝 Non-profit and civic

- Cancer-charity awareness dashboards
- Open-science cancer-equity tracking
- Public-domain references for journalism
- Civic transparency on federally funded cancer research

</td>
<td width="50%">

#### 🧪 Experimentation

- Train tumour-classification models on real cohorts
- Prototype agentic tools that resolve TCGA IDs
- Benchmark cancer-genomics libraries on real data
- Generate cohort embeddings at scale

</td>
</tr>
</table>

***

### 🤖 Ask an AI assistant about this scraper

Open a ready-to-send prompt about this ParseForge actor in the AI of your choice:

- 💬 [**ChatGPT**](https://chat.openai.com/?q=How%20do%20I%20use%20the%20NCI%20GDC%20Cancer%20Scraper%20by%20ParseForge%20on%20Apify%3F%20Show%20me%20input%20examples%2C%20output%20fields%2C%20common%20use%20cases%2C%20and%20how%20to%20integrate%20it%20into%20a%20workflow.)
- 🧠 [**Claude**](https://claude.ai/new?q=How%20do%20I%20use%20the%20NCI%20GDC%20Cancer%20Scraper%20by%20ParseForge%20on%20Apify%3F%20Show%20me%20input%20examples%2C%20output%20fields%2C%20common%20use%20cases%2C%20and%20how%20to%20integrate%20it%20into%20a%20workflow.)
- 🔍 [**Perplexity**](https://perplexity.ai/search?q=How%20do%20I%20use%20the%20NCI%20GDC%20Cancer%20Scraper%20by%20ParseForge%20on%20Apify%3F%20Show%20me%20input%20examples%2C%20output%20fields%2C%20common%20use%20cases%2C%20and%20how%20to%20integrate%20it%20into%20a%20workflow.)
- 🅒 [**Copilot**](https://copilot.microsoft.com/?q=How%20do%20I%20use%20the%20NCI%20GDC%20Cancer%20Scraper%20by%20ParseForge%20on%20Apify%3F%20Show%20me%20input%20examples%2C%20output%20fields%2C%20common%20use%20cases%2C%20and%20how%20to%20integrate%20it%20into%20a%20workflow.)

***

### ❓ Frequently Asked Questions

#### 🧩 How does it work?

Pick an entity (projects / cases / files / annotations), optionally apply site or program filters, click Start, and the Actor hits the GDC REST API with server-side filtering and pagination. Records are emitted as clean JSON ready for download. No browser automation, no captchas, no setup.

#### 🏥 Where does the data come from?

Directly from the GDC public API at `api.gdc.cancer.gov`. The GDC is operated by the National Cancer Institute (NCI) at the U.S. National Institutes of Health.

#### 🔓 Does it return controlled-access data?

No. This Actor returns only open-access metadata. Files marked `access: "controlled"` show up in the inventory but their contents require dbGaP authorization to download.

#### 🧬 What is the difference between projects, cases, files, and annotations?

A **project** is a study (e.g. TCGA-BRCA). A **case** is a patient or participant within a project. A **file** is a single data artifact (BAM, VCF, TSV) tied to one or more cases. An **annotation** is a curated note on a case, file, or sample (e.g. data quality flags).

#### 📂 Which file types can the files mode list?

Every type GDC hosts, including aligned and unaligned sequencing reads (BAM, FASTQ), variant calls (VCF, MAF), copy-number data, gene expression matrices, methylation tables, and clinical TSVs. Use the `data_format`, `data_type`, and `experimental_strategy` fields in the output to filter downstream.

#### 🔁 How often is the dataset refreshed?

GDC ships data releases on a roughly quarterly cadence and adds smaller updates between releases. Every run of this Actor hits the live API, so your dataset reflects the current GDC release.

#### ⏰ Can I schedule regular runs?

Yes. Use Apify Schedules to run this Actor on any cron interval (weekly, monthly) and keep a downstream cancer-genomics database in sync.

#### ⚖️ Is this data legal to use?

GDC metadata is publicly accessible under U.S. federal open-data policies. Review the GDC data-use policies for your specific use case, especially for redistribution. Controlled-access raw data requires dbGaP authorization and is not returned by this Actor.

#### 💳 Do I need a paid Apify plan to use this Actor?

No. The free Apify plan is enough for testing and small runs (10 records per run). A paid plan lifts the limit and unlocks scheduling, higher concurrency, and larger datasets.

#### 🧪 What if I need download URLs for files?

The `file_id` field is the canonical GDC handle. Pass it to the GDC Data Download API at `api.gdc.cancer.gov/data/{file_id}` for open-access files, or request a companion downloader via the contact form below.

#### 🆘 What if I need help?

Our support team is here to help. Contact us through the Apify platform or use the Tally form linked below.

***

### 🔌 Integrate with any app

NCI GDC Scraper connects to any cloud service via [Apify integrations](https://apify.com/integrations):

- [**Make**](https://docs.apify.com/platform/integrations/make) - Automate multi-step workflows
- [**Zapier**](https://docs.apify.com/platform/integrations/zapier) - Connect with 5,000+ apps
- [**Slack**](https://docs.apify.com/platform/integrations/slack) - Get run notifications in your channels
- [**Airbyte**](https://docs.apify.com/platform/integrations/airbyte) - Pipe cancer-genomics metadata into your warehouse
- [**GitHub**](https://docs.apify.com/platform/integrations/github) - Trigger runs from commits and releases
- [**Google Drive**](https://docs.apify.com/platform/integrations/drive) - Export datasets straight to Sheets

You can also use webhooks to trigger downstream actions when a run finishes. Push fresh GDC case lists into your cohort builder, or alert your team in Slack when a new project releases.

***

### 🔗 Recommended Actors

- [**🤗 Hugging Face Model Scraper**](https://apify.com/parseforge/hugging-face-model-scraper) - Model metadata, downloads, and benchmarks
- [**🏥 FINRA BrokerCheck Scraper**](https://apify.com/parseforge/finra-brokercheck-scraper) - U.S. broker and firm regulatory disclosures
- [**🏨 Greatschools Scraper**](https://apify.com/parseforge/greatschools-scraper) - U.S. school ratings and demographics
- [**📈 Smart Apify Actor Scraper**](https://apify.com/parseforge/smart-apify-actor-scraper) - Apify Store actor metadata and quality signals

> 💡 **Pro Tip:** browse the complete [ParseForge collection](https://apify.com/parseforge) for more reference-data scrapers.

***

**🆘 Need Help?** [**Open our contact form**](https://tally.so/r/BzdKgA) to request a new scraper, propose a custom data project, or report an issue.

***

> **⚠️ Disclaimer:** this Actor is an independent tool and is not affiliated with, endorsed by, or sponsored by the NCI Genomic Data Commons, the National Cancer Institute, or the National Institutes of Health. All trademarks mentioned are the property of their respective owners. Only publicly available open cancer-genomics metadata is collected.

# Actor input Schema

## `entity` (type: `string`):

Which GDC public endpoint to scrape.

## `maxItems` (type: `integer`):

Free users: Limited to 10 items (preview). Paid users: Optional, max 1,000,000

## `primarySite` (type: `string`):

Filter records by primary tumour / tissue site. Leave empty for all sites.

## `program` (type: `string`):

Filter by NCI cancer program. Leave empty for all programs.

## Actor input object example

```json
{
  "entity": "projects",
  "maxItems": 10,
  "primarySite": "",
  "program": ""
}
```

# Actor output Schema

## `overview` (type: `string`):

Overview of scraped data

## `fullData` (type: `string`):

Complete dataset

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "entity": "projects",
    "maxItems": 10,
    "primarySite": "",
    "program": ""
};

// Run the Actor and wait for it to finish
const run = await client.actor("parseforge/nci-gdc-cancer-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "entity": "projects",
    "maxItems": 10,
    "primarySite": "",
    "program": "",
}

# Run the Actor and wait for it to finish
run = client.actor("parseforge/nci-gdc-cancer-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "entity": "projects",
  "maxItems": 10,
  "primarySite": "",
  "program": ""
}' |
apify call parseforge/nci-gdc-cancer-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=parseforge/nci-gdc-cancer-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "NCI GDC Cancer Genomics Scraper",
        "description": "Scrape projects, cases, files, and annotations from the NCI Genomic Data Commons (GDC) public API. Filter by primary site or program (TCGA / CPTAC / TARGET) and get rich summary fields like case_count, file_count, file_size, disease_type and demographics. No API key required.",
        "version": "0.0",
        "x-build-id": "bomT6yMWAsG8tAcdu"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/parseforge~nci-gdc-cancer-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-parseforge-nci-gdc-cancer-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/parseforge~nci-gdc-cancer-scraper/runs": {
            "post": {
                "operationId": "runs-sync-parseforge-nci-gdc-cancer-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/parseforge~nci-gdc-cancer-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-parseforge-nci-gdc-cancer-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "entity": {
                        "title": "Entity",
                        "enum": [
                            "projects",
                            "cases",
                            "files",
                            "annotations"
                        ],
                        "type": "string",
                        "description": "Which GDC public endpoint to scrape.",
                        "default": "projects"
                    },
                    "maxItems": {
                        "title": "Max Items",
                        "minimum": 1,
                        "maximum": 1000000,
                        "type": "integer",
                        "description": "Free users: Limited to 10 items (preview). Paid users: Optional, max 1,000,000"
                    },
                    "primarySite": {
                        "title": "Primary Site",
                        "enum": [
                            "",
                            "Adrenal gland",
                            "Anus and anal canal",
                            "Base of tongue",
                            "Bladder",
                            "Bones, joints and articular cartilage of limbs",
                            "Bones, joints and articular cartilage of other and unspecified sites",
                            "Brain",
                            "Breast",
                            "Bronchus and lung",
                            "Cervix uteri",
                            "Colon",
                            "Connective, subcutaneous and other soft tissues",
                            "Corpus uteri",
                            "Esophagus",
                            "Eye and adnexa",
                            "Gallbladder",
                            "Heart, mediastinum, and pleura",
                            "Hematopoietic and reticuloendothelial systems",
                            "Kidney",
                            "Larynx",
                            "Lip",
                            "Liver and intrahepatic bile ducts",
                            "Lymph nodes",
                            "Meninges",
                            "Nasal cavity and middle ear",
                            "Nasopharynx",
                            "Oropharynx",
                            "Other and ill-defined sites",
                            "Other and unspecified parts of mouth",
                            "Other and unspecified parts of tongue",
                            "Ovary",
                            "Pancreas",
                            "Penis",
                            "Peripheral nerves and autonomic nervous system",
                            "Prostate gland",
                            "Rectum",
                            "Rectosigmoid junction",
                            "Renal pelvis",
                            "Retroperitoneum and peritoneum",
                            "Skin",
                            "Small intestine",
                            "Spinal cord, cranial nerves, and other parts of central nervous system",
                            "Stomach",
                            "Testis",
                            "Thymus",
                            "Thyroid gland",
                            "Tonsil",
                            "Trachea",
                            "Ureter",
                            "Uterus, NOS",
                            "Vagina",
                            "Vulva"
                        ],
                        "type": "string",
                        "description": "Filter records by primary tumour / tissue site. Leave empty for all sites.",
                        "default": ""
                    },
                    "program": {
                        "title": "Program",
                        "enum": [
                            "",
                            "ALCHEMIST",
                            "APOLLO",
                            "BEATAML1.0",
                            "CCDI",
                            "CCG",
                            "CDDP_EAGLE",
                            "CGCI",
                            "CMI",
                            "CPTAC",
                            "CTSP",
                            "EXCEPTIONAL_RESPONDERS",
                            "FM",
                            "HCMI",
                            "MATCH",
                            "MMRF",
                            "MP2PRT",
                            "NCICCR",
                            "OHSU",
                            "ORGANOID",
                            "RC",
                            "REBC",
                            "TARGET",
                            "TCGA",
                            "TRIO",
                            "VAREPOP",
                            "WCDT"
                        ],
                        "type": "string",
                        "description": "Filter by NCI cancer program. Leave empty for all programs.",
                        "default": ""
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
