NCI GDC Cancer Genomics Scraper
Pricing
from $29.62 / 1,000 results
NCI GDC Cancer Genomics Scraper
Scrape projects, cases, files, and annotations from the NCI Genomic Data Commons (GDC) public API. Filter by primary site or program (TCGA / CPTAC / TARGET) and get rich summary fields like case_count, file_count, file_size, disease_type and demographics. No API key required.
Pricing
from $29.62 / 1,000 results
Rating
0.0
(0)
Developer
ParseForge
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
8 days ago
Last modified
Categories
Share

🧬 NCI Genomic Data Commons (GDC) Cancer Scraper
🚀 Export cancer genomics metadata in seconds. Pull TCGA, CPTAC, TARGET, HCMI, BEATAML, and 20+ NCI programs across projects, cases, files, and annotations. No API key, no registration, no manual REST stitching.
🕒 Last updated: 2026-05-13 · 📊 4 entity modes · 🏥 26 NCI programs · 🧬 50+ primary sites · 🌐 GDC public API
The NCI GDC Cancer Scraper queries the NCI Genomic Data Commons public REST API and returns rich records across four entity types: projects, cases, files, and annotations. GDC is the National Cancer Institute's open data platform for cancer research, hosting harmonized genomic and clinical data from TCGA, CPTAC, TARGET, HCMI, BEATAML, MMRF, CCDI, and 20+ other landmark programs.
The catalog covers 50+ primary tumour sites (breast, lung, brain, colon, ovary, pancreas, prostate, kidney, liver, and more) and 26 NCI cancer programs, totalling thousands of projects, hundreds of thousands of cases, and millions of harmonized data files. This Actor exposes both site and program filters at the API level, so disease-specific or program-specific exports are fast.
| 🎯 Target Audience | 💡 Primary Use Cases |
|---|---|
| Cancer biologists, computational oncologists, clinical bioinformaticians, biostatisticians, pharma R&D teams, journalists, regulatory analysts, ML researchers | Cohort discovery, program-level metadata audits, file inventory, annotation tracking, ML training datasets, demographic surveys, cross-program comparisons |
📋 What the NCI GDC Scraper does
Four entity modes in a single Actor:
- 🏥 Projects. Project IDs (TCGA-BRCA, CPTAC-3, TARGET-AML, etc.), program affiliation, primary sites, disease types, dbGaP accession, releasable / released state, and a summary block with file count, case count, and total file size in bytes.
- 👤 Cases. Case IDs, submitter IDs, primary site, disease type, project context, demographic block, diagnoses (with stage, vital status, age at diagnosis), exposures, index date, and timestamps.
- 📁 Files. File IDs, file names, data category (Sequencing Reads, Transcriptome Profiling, etc.), data format (BAM, VCF, TSV, etc.), data type, experimental strategy (WGS, WXS, RNA-Seq, etc.), file size, MD5 sum, access (open / controlled), state, associated cases, and analysis workflow.
- 📝 Annotations. Annotation IDs, entity ID and type, submitter ID, category, classification, notes, status, project context, and timestamps.
Filter any mode by primary site (50 options) or NCI program (26 options) to scope your export. Filters are pushed to the GDC API server-side.
💡 Why it matters: TCGA and CPTAC underpin most modern cancer-genomics research. Building your own GDC filter compiler and paginator means days of plumbing; this Actor returns ready-joined records on every run.
🎬 Full Demo
🚧 Coming soon: a 3-minute walkthrough showing how to go from sign-up to a downloaded GDC dataset.
⚙️ Input
| Input | Type | Default | Behavior |
|---|---|---|---|
entity | enum | "projects" | One of projects, cases, files, annotations. |
maxItems | integer | 10 | Records to return. Free plan caps at 10, paid plan at 1,000,000. |
primarySite | string | "" | One of 50 primary tumour sites. Empty = all. |
program | string | "" | One of 26 NCI programs (TCGA, CPTAC, TARGET, HCMI, BEATAML1.0, MMRF, CCDI, etc.). Empty = all. |
Example: every TCGA case in a single dataset.
{"entity": "cases","program": "TCGA","maxItems": 10000}
Example: all open BAM files for breast-cancer cases.
{"entity": "files","primarySite": "Breast","maxItems": 500}
⚠️ Good to Know: filter field paths differ by entity. For projects,
primary_siteis checked at the project level; for cases, at the case level; for files, atcases.primary_site; for annotations, atcase.primary_site. The Actor handles this routing for you, so the sameprimarySiteinput works across all four modes. The same applies toprogram.nameversusproject.program.name.
📊 Output
Output shape varies by entity. Each record always carries a url to the GDC portal and a scrapedAt timestamp.
🧾 Projects schema
| Field | Type | Example |
|---|---|---|
🆔 project_id | string | "TCGA-LGG" |
🔗 url | string | "https://portal.gdc.cancer.gov/projects/TCGA-LGG" |
📛 name | string | "Brain Lower Grade Glioma" |
📍 primary_site | string[] | ["Brain"] |
🦠 disease_type | string[] | ["Gliomas"] |
🏥 program | object | { name: "TCGA", program_id, dbgap_accession_number } |
🔓 releasable | boolean | true |
✅ released | boolean | true |
📂 state | string | "open" |
📊 summary | object | { file_count, case_count, file_size } |
🕒 scrapedAt | ISO 8601 | "2026-05-13T..." |
🧾 Cases / Files / Annotations
Each mode emits its native ID, GDC portal URL, the expand block (demographic, diagnoses, exposures for cases; cases + analysis for files; project for annotations), and timestamps.
📦 Sample records
✨ Why choose this Actor
| Capability | |
|---|---|
| 🏥 | All four GDC entities. Projects, cases, files, and annotations in one Actor. |
| 🎯 | Server-side filtering. Primary site and program filters compile to GDC filter JSON and run at the API level. |
| 🧬 | Expand-ready. Demographics, diagnoses, exposures, project, and analysis blocks expanded per record. |
| 🔬 | 26 NCI programs. TCGA, CPTAC, TARGET, HCMI, BEATAML, MMRF, CCDI, ALCHEMIST, MATCH, ORGANOID, and more. |
| 📍 | 50 primary sites. Every ICD-O site from adrenal gland to vulva. |
| ⚡ | Fast. REST pagination with offset, page size 100-1000. |
| 🚫 | No authentication. Works on the public GDC API. No login or API key. |
📊 GDC hosts the largest harmonized cancer-genomics dataset in the world.
📈 How it compares to alternatives
| Approach | Cost | Coverage | Refresh | Filters | Setup |
|---|---|---|---|---|---|
| ⭐ NCI GDC Scraper (this Actor) | $5 free credit, then pay-per-use | Full GDC catalogue | Live per run | site, program, entity | ⚡ 2 min |
| Hand-rolled GDC REST queries | Free | Full | Manual | Manual | 🐢 Days |
| cBioPortal | Free | Curated subset | Quarterly | Many | ⏳ Hours |
| Commercial cancer-genomics platforms | $$$/year | Proprietary | Per-release | Many | ⏳ Weeks |
Pick this Actor when you want broad cancer-genomics coverage, four ready-built entity modes, and no pipeline maintenance.
🚀 How to use
- 📝 Sign up. Create a free account with $5 credit (takes 2 minutes).
- 🌐 Open the Actor. Go to the NCI GDC Cancer Scraper page on the Apify Store.
- 🎯 Set input. Pick an entity, optionally filter by primary site or program, and set
maxItems. - 🚀 Run it. Click Start and let the Actor collect your data.
- 📥 Download. Grab your results in the Dataset tab as CSV, Excel, JSON, or XML.
⏱️ Total time from signup to downloaded dataset: 3-5 minutes. No coding required.
💼 Business use cases
🔌 Automating NCI GDC Scraper
Control the scraper programmatically for scheduled runs and pipeline integrations:
- 🟢 Node.js. Install the
apify-clientNPM package. - 🐍 Python. Use the
apify-clientPyPI package. - 📚 See the Apify API documentation for full details.
The Apify Schedules feature lets you trigger this Actor on any cron interval. Weekly or monthly refreshes catch every GDC data-release.
🌟 Beyond business use cases
Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.
🤖 Ask an AI assistant about this scraper
Open a ready-to-send prompt about this ParseForge actor in the AI of your choice:
- 💬 ChatGPT
- 🧠 Claude
- 🔍 Perplexity
- 🅒 Copilot
❓ Frequently Asked Questions
🧩 How does it work?
Pick an entity (projects / cases / files / annotations), optionally apply site or program filters, click Start, and the Actor hits the GDC REST API with server-side filtering and pagination. Records are emitted as clean JSON ready for download. No browser automation, no captchas, no setup.
🏥 Where does the data come from?
Directly from the GDC public API at api.gdc.cancer.gov. The GDC is operated by the National Cancer Institute (NCI) at the U.S. National Institutes of Health.
🔓 Does it return controlled-access data?
No. This Actor returns only open-access metadata. Files marked access: "controlled" show up in the inventory but their contents require dbGaP authorization to download.
🧬 What is the difference between projects, cases, files, and annotations?
A project is a study (e.g. TCGA-BRCA). A case is a patient or participant within a project. A file is a single data artifact (BAM, VCF, TSV) tied to one or more cases. An annotation is a curated note on a case, file, or sample (e.g. data quality flags).
📂 Which file types can the files mode list?
Every type GDC hosts, including aligned and unaligned sequencing reads (BAM, FASTQ), variant calls (VCF, MAF), copy-number data, gene expression matrices, methylation tables, and clinical TSVs. Use the data_format, data_type, and experimental_strategy fields in the output to filter downstream.
🔁 How often is the dataset refreshed?
GDC ships data releases on a roughly quarterly cadence and adds smaller updates between releases. Every run of this Actor hits the live API, so your dataset reflects the current GDC release.
⏰ Can I schedule regular runs?
Yes. Use Apify Schedules to run this Actor on any cron interval (weekly, monthly) and keep a downstream cancer-genomics database in sync.
⚖️ Is this data legal to use?
GDC metadata is publicly accessible under U.S. federal open-data policies. Review the GDC data-use policies for your specific use case, especially for redistribution. Controlled-access raw data requires dbGaP authorization and is not returned by this Actor.
💳 Do I need a paid Apify plan to use this Actor?
No. The free Apify plan is enough for testing and small runs (10 records per run). A paid plan lifts the limit and unlocks scheduling, higher concurrency, and larger datasets.
🧪 What if I need download URLs for files?
The file_id field is the canonical GDC handle. Pass it to the GDC Data Download API at api.gdc.cancer.gov/data/{file_id} for open-access files, or request a companion downloader via the contact form below.
🆘 What if I need help?
Our support team is here to help. Contact us through the Apify platform or use the Tally form linked below.
🔌 Integrate with any app
NCI GDC Scraper connects to any cloud service via Apify integrations:
- Make - Automate multi-step workflows
- Zapier - Connect with 5,000+ apps
- Slack - Get run notifications in your channels
- Airbyte - Pipe cancer-genomics metadata into your warehouse
- GitHub - Trigger runs from commits and releases
- Google Drive - Export datasets straight to Sheets
You can also use webhooks to trigger downstream actions when a run finishes. Push fresh GDC case lists into your cohort builder, or alert your team in Slack when a new project releases.
🔗 Recommended Actors
- 🤗 Hugging Face Model Scraper - Model metadata, downloads, and benchmarks
- 🏥 FINRA BrokerCheck Scraper - U.S. broker and firm regulatory disclosures
- 🏨 Greatschools Scraper - U.S. school ratings and demographics
- 📈 Smart Apify Actor Scraper - Apify Store actor metadata and quality signals
💡 Pro Tip: browse the complete ParseForge collection for more reference-data scrapers.
🆘 Need Help? Open our contact form to request a new scraper, propose a custom data project, or report an issue.
⚠️ Disclaimer: this Actor is an independent tool and is not affiliated with, endorsed by, or sponsored by the NCI Genomic Data Commons, the National Cancer Institute, or the National Institutes of Health. All trademarks mentioned are the property of their respective owners. Only publicly available open cancer-genomics metadata is collected.
