NCI GDC Cancer Genomics Scraper avatar

NCI GDC Cancer Genomics Scraper

Pricing

from $29.62 / 1,000 results

Go to Apify Store
NCI GDC Cancer Genomics Scraper

NCI GDC Cancer Genomics Scraper

Scrape projects, cases, files, and annotations from the NCI Genomic Data Commons (GDC) public API. Filter by primary site or program (TCGA / CPTAC / TARGET) and get rich summary fields like case_count, file_count, file_size, disease_type and demographics. No API key required.

Pricing

from $29.62 / 1,000 results

Rating

0.0

(0)

Developer

ParseForge

ParseForge

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

8 days ago

Last modified

Share

ParseForge Banner

🧬 NCI Genomic Data Commons (GDC) Cancer Scraper

🚀 Export cancer genomics metadata in seconds. Pull TCGA, CPTAC, TARGET, HCMI, BEATAML, and 20+ NCI programs across projects, cases, files, and annotations. No API key, no registration, no manual REST stitching.

🕒 Last updated: 2026-05-13 · 📊 4 entity modes · 🏥 26 NCI programs · 🧬 50+ primary sites · 🌐 GDC public API

The NCI GDC Cancer Scraper queries the NCI Genomic Data Commons public REST API and returns rich records across four entity types: projects, cases, files, and annotations. GDC is the National Cancer Institute's open data platform for cancer research, hosting harmonized genomic and clinical data from TCGA, CPTAC, TARGET, HCMI, BEATAML, MMRF, CCDI, and 20+ other landmark programs.

The catalog covers 50+ primary tumour sites (breast, lung, brain, colon, ovary, pancreas, prostate, kidney, liver, and more) and 26 NCI cancer programs, totalling thousands of projects, hundreds of thousands of cases, and millions of harmonized data files. This Actor exposes both site and program filters at the API level, so disease-specific or program-specific exports are fast.

🎯 Target Audience💡 Primary Use Cases
Cancer biologists, computational oncologists, clinical bioinformaticians, biostatisticians, pharma R&D teams, journalists, regulatory analysts, ML researchersCohort discovery, program-level metadata audits, file inventory, annotation tracking, ML training datasets, demographic surveys, cross-program comparisons

📋 What the NCI GDC Scraper does

Four entity modes in a single Actor:

  • 🏥 Projects. Project IDs (TCGA-BRCA, CPTAC-3, TARGET-AML, etc.), program affiliation, primary sites, disease types, dbGaP accession, releasable / released state, and a summary block with file count, case count, and total file size in bytes.
  • 👤 Cases. Case IDs, submitter IDs, primary site, disease type, project context, demographic block, diagnoses (with stage, vital status, age at diagnosis), exposures, index date, and timestamps.
  • 📁 Files. File IDs, file names, data category (Sequencing Reads, Transcriptome Profiling, etc.), data format (BAM, VCF, TSV, etc.), data type, experimental strategy (WGS, WXS, RNA-Seq, etc.), file size, MD5 sum, access (open / controlled), state, associated cases, and analysis workflow.
  • 📝 Annotations. Annotation IDs, entity ID and type, submitter ID, category, classification, notes, status, project context, and timestamps.

Filter any mode by primary site (50 options) or NCI program (26 options) to scope your export. Filters are pushed to the GDC API server-side.

💡 Why it matters: TCGA and CPTAC underpin most modern cancer-genomics research. Building your own GDC filter compiler and paginator means days of plumbing; this Actor returns ready-joined records on every run.


🎬 Full Demo

🚧 Coming soon: a 3-minute walkthrough showing how to go from sign-up to a downloaded GDC dataset.


⚙️ Input

InputTypeDefaultBehavior
entityenum"projects"One of projects, cases, files, annotations.
maxItemsinteger10Records to return. Free plan caps at 10, paid plan at 1,000,000.
primarySitestring""One of 50 primary tumour sites. Empty = all.
programstring""One of 26 NCI programs (TCGA, CPTAC, TARGET, HCMI, BEATAML1.0, MMRF, CCDI, etc.). Empty = all.

Example: every TCGA case in a single dataset.

{
"entity": "cases",
"program": "TCGA",
"maxItems": 10000
}

Example: all open BAM files for breast-cancer cases.

{
"entity": "files",
"primarySite": "Breast",
"maxItems": 500
}

⚠️ Good to Know: filter field paths differ by entity. For projects, primary_site is checked at the project level; for cases, at the case level; for files, at cases.primary_site; for annotations, at case.primary_site. The Actor handles this routing for you, so the same primarySite input works across all four modes. The same applies to program.name versus project.program.name.


📊 Output

Output shape varies by entity. Each record always carries a url to the GDC portal and a scrapedAt timestamp.

🧾 Projects schema

FieldTypeExample
🆔 project_idstring"TCGA-LGG"
🔗 urlstring"https://portal.gdc.cancer.gov/projects/TCGA-LGG"
📛 namestring"Brain Lower Grade Glioma"
📍 primary_sitestring[]["Brain"]
🦠 disease_typestring[]["Gliomas"]
🏥 programobject{ name: "TCGA", program_id, dbgap_accession_number }
🔓 releasablebooleantrue
releasedbooleantrue
📂 statestring"open"
📊 summaryobject{ file_count, case_count, file_size }
🕒 scrapedAtISO 8601"2026-05-13T..."

🧾 Cases / Files / Annotations

Each mode emits its native ID, GDC portal URL, the expand block (demographic, diagnoses, exposures for cases; cases + analysis for files; project for annotations), and timestamps.

📦 Sample records


✨ Why choose this Actor

Capability
🏥All four GDC entities. Projects, cases, files, and annotations in one Actor.
🎯Server-side filtering. Primary site and program filters compile to GDC filter JSON and run at the API level.
🧬Expand-ready. Demographics, diagnoses, exposures, project, and analysis blocks expanded per record.
🔬26 NCI programs. TCGA, CPTAC, TARGET, HCMI, BEATAML, MMRF, CCDI, ALCHEMIST, MATCH, ORGANOID, and more.
📍50 primary sites. Every ICD-O site from adrenal gland to vulva.
Fast. REST pagination with offset, page size 100-1000.
🚫No authentication. Works on the public GDC API. No login or API key.

📊 GDC hosts the largest harmonized cancer-genomics dataset in the world.


📈 How it compares to alternatives

ApproachCostCoverageRefreshFiltersSetup
⭐ NCI GDC Scraper (this Actor)$5 free credit, then pay-per-useFull GDC catalogueLive per runsite, program, entity⚡ 2 min
Hand-rolled GDC REST queriesFreeFullManualManual🐢 Days
cBioPortalFreeCurated subsetQuarterlyMany⏳ Hours
Commercial cancer-genomics platforms$$$/yearProprietaryPer-releaseMany⏳ Weeks

Pick this Actor when you want broad cancer-genomics coverage, four ready-built entity modes, and no pipeline maintenance.


🚀 How to use

  1. 📝 Sign up. Create a free account with $5 credit (takes 2 minutes).
  2. 🌐 Open the Actor. Go to the NCI GDC Cancer Scraper page on the Apify Store.
  3. 🎯 Set input. Pick an entity, optionally filter by primary site or program, and set maxItems.
  4. 🚀 Run it. Click Start and let the Actor collect your data.
  5. 📥 Download. Grab your results in the Dataset tab as CSV, Excel, JSON, or XML.

⏱️ Total time from signup to downloaded dataset: 3-5 minutes. No coding required.


💼 Business use cases

💊 Pharma & Translational Research

  • Cohort discovery for trial recruitment
  • Tumour-type prevalence dashboards
  • Biomarker prevalence by program
  • Indication-specific case inventories

🧬 Computational Oncology

  • TCGA / CPTAC harmonized data audits
  • WGS, WXS, and RNA-Seq file inventories
  • Demographic distributions per disease cohort
  • Cross-program survival analyses

📰 Reporting & Scientometrics

  • Public-data programme accounting
  • Coverage maps by primary site
  • Program-level open-data benchmarks
  • Annotation tracking for data quality

🤖 ML & AI for Cancer Genomics

  • Training-cohort assembly with site / program filters
  • File-inventory feeds for download pipelines
  • Demographic feature engineering
  • Knowledge graphs joining cases, files, and projects

🔌 Automating NCI GDC Scraper

Control the scraper programmatically for scheduled runs and pipeline integrations:

  • 🟢 Node.js. Install the apify-client NPM package.
  • 🐍 Python. Use the apify-client PyPI package.
  • 📚 See the Apify API documentation for full details.

The Apify Schedules feature lets you trigger this Actor on any cron interval. Weekly or monthly refreshes catch every GDC data-release.


🌟 Beyond business use cases

Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.

🎓 Research and academia

  • Reproducible cancer-genomics studies with versioned dataset pulls
  • Teaching datasets for oncology and bioinformatics
  • Open-source benchmark publications
  • Cross-database joins with cBioPortal and Ensembl

🎨 Personal and creative

  • Indie cancer-data visualization apps
  • Educational dashboards for science communication
  • Public-health storytelling
  • Portfolio projects on biomedical NLP

🤝 Non-profit and civic

  • Cancer-charity awareness dashboards
  • Open-science cancer-equity tracking
  • Public-domain references for journalism
  • Civic transparency on federally funded cancer research

🧪 Experimentation

  • Train tumour-classification models on real cohorts
  • Prototype agentic tools that resolve TCGA IDs
  • Benchmark cancer-genomics libraries on real data
  • Generate cohort embeddings at scale

🤖 Ask an AI assistant about this scraper

Open a ready-to-send prompt about this ParseForge actor in the AI of your choice:


❓ Frequently Asked Questions

🧩 How does it work?

Pick an entity (projects / cases / files / annotations), optionally apply site or program filters, click Start, and the Actor hits the GDC REST API with server-side filtering and pagination. Records are emitted as clean JSON ready for download. No browser automation, no captchas, no setup.

🏥 Where does the data come from?

Directly from the GDC public API at api.gdc.cancer.gov. The GDC is operated by the National Cancer Institute (NCI) at the U.S. National Institutes of Health.

🔓 Does it return controlled-access data?

No. This Actor returns only open-access metadata. Files marked access: "controlled" show up in the inventory but their contents require dbGaP authorization to download.

🧬 What is the difference between projects, cases, files, and annotations?

A project is a study (e.g. TCGA-BRCA). A case is a patient or participant within a project. A file is a single data artifact (BAM, VCF, TSV) tied to one or more cases. An annotation is a curated note on a case, file, or sample (e.g. data quality flags).

📂 Which file types can the files mode list?

Every type GDC hosts, including aligned and unaligned sequencing reads (BAM, FASTQ), variant calls (VCF, MAF), copy-number data, gene expression matrices, methylation tables, and clinical TSVs. Use the data_format, data_type, and experimental_strategy fields in the output to filter downstream.

🔁 How often is the dataset refreshed?

GDC ships data releases on a roughly quarterly cadence and adds smaller updates between releases. Every run of this Actor hits the live API, so your dataset reflects the current GDC release.

⏰ Can I schedule regular runs?

Yes. Use Apify Schedules to run this Actor on any cron interval (weekly, monthly) and keep a downstream cancer-genomics database in sync.

GDC metadata is publicly accessible under U.S. federal open-data policies. Review the GDC data-use policies for your specific use case, especially for redistribution. Controlled-access raw data requires dbGaP authorization and is not returned by this Actor.

💳 Do I need a paid Apify plan to use this Actor?

No. The free Apify plan is enough for testing and small runs (10 records per run). A paid plan lifts the limit and unlocks scheduling, higher concurrency, and larger datasets.

🧪 What if I need download URLs for files?

The file_id field is the canonical GDC handle. Pass it to the GDC Data Download API at api.gdc.cancer.gov/data/{file_id} for open-access files, or request a companion downloader via the contact form below.

🆘 What if I need help?

Our support team is here to help. Contact us through the Apify platform or use the Tally form linked below.


🔌 Integrate with any app

NCI GDC Scraper connects to any cloud service via Apify integrations:

  • Make - Automate multi-step workflows
  • Zapier - Connect with 5,000+ apps
  • Slack - Get run notifications in your channels
  • Airbyte - Pipe cancer-genomics metadata into your warehouse
  • GitHub - Trigger runs from commits and releases
  • Google Drive - Export datasets straight to Sheets

You can also use webhooks to trigger downstream actions when a run finishes. Push fresh GDC case lists into your cohort builder, or alert your team in Slack when a new project releases.


💡 Pro Tip: browse the complete ParseForge collection for more reference-data scrapers.


🆘 Need Help? Open our contact form to request a new scraper, propose a custom data project, or report an issue.


⚠️ Disclaimer: this Actor is an independent tool and is not affiliated with, endorsed by, or sponsored by the NCI Genomic Data Commons, the National Cancer Institute, or the National Institutes of Health. All trademarks mentioned are the property of their respective owners. Only publicly available open cancer-genomics metadata is collected.