Pricing

from $29.62 / 1,000 results

NCI GDC Cancer Genomics Scraper

Scrape projects, cases, files, and annotations from the NCI Genomic Data Commons (GDC) public API. Filter by primary site or program (TCGA / CPTAC / TARGET) and get rich summary fields like case_count, file_count, file_size, disease_type and demographics. No API key required.

Pricing

from $29.62 / 1,000 results

Rating

0.0

(0)

Developer

ParseForge

Actor stats

Bookmarked

Total users

Monthly active users

6 days ago

Last modified

🧬 NCI Genomic Data Commons (GDC) Cancer Scraper

🚀 Export cancer genomics metadata in seconds. Pull TCGA, CPTAC, TARGET, HCMI, BEATAML, and 20+ NCI programs across projects, cases, files, and annotations. No API key, no registration, no manual REST stitching.

🕒 Last updated: 2026-05-13 · 📊 4 entity modes · 🏥 26 NCI programs · 🧬 50+ primary sites · 🌐 GDC public API

The NCI GDC Cancer Scraper queries the NCI Genomic Data Commons public REST API and returns rich records across four entity types: projects, cases, files, and annotations. GDC is the National Cancer Institute's open data platform for cancer research, hosting harmonized genomic and clinical data from TCGA, CPTAC, TARGET, HCMI, BEATAML, MMRF, CCDI, and 20+ other landmark programs.

The catalog covers 50+ primary tumour sites (breast, lung, brain, colon, ovary, pancreas, prostate, kidney, liver, and more) and 26 NCI cancer programs, totalling thousands of projects, hundreds of thousands of cases, and millions of harmonized data files. This Actor exposes both site and program filters at the API level, so disease-specific or program-specific exports are fast.

🎯 Target Audience	💡 Primary Use Cases
Cancer biologists, computational oncologists, clinical bioinformaticians, biostatisticians, pharma R&D teams, journalists, regulatory analysts, ML researchers	Cohort discovery, program-level metadata audits, file inventory, annotation tracking, ML training datasets, demographic surveys, cross-program comparisons

📋 What the NCI GDC Scraper does

Four entity modes in a single Actor:

🏥 Projects. Project IDs (TCGA-BRCA, CPTAC-3, TARGET-AML, etc.), program affiliation, primary sites, disease types, dbGaP accession, releasable / released state, and a summary block with file count, case count, and total file size in bytes.
👤 Cases. Case IDs, submitter IDs, primary site, disease type, project context, demographic block, diagnoses (with stage, vital status, age at diagnosis), exposures, index date, and timestamps.
📁 Files. File IDs, file names, data category (Sequencing Reads, Transcriptome Profiling, etc.), data format (BAM, VCF, TSV, etc.), data type, experimental strategy (WGS, WXS, RNA-Seq, etc.), file size, MD5 sum, access (open / controlled), state, associated cases, and analysis workflow.
📝 Annotations. Annotation IDs, entity ID and type, submitter ID, category, classification, notes, status, project context, and timestamps.

Filter any mode by primary site (50 options) or NCI program (26 options) to scope your export. Filters are pushed to the GDC API server-side.

💡 Why it matters: TCGA and CPTAC underpin most modern cancer-genomics research. Building your own GDC filter compiler and paginator means days of plumbing; this Actor returns ready-joined records on every run.

🎬 Full Demo

🚧 Coming soon: a 3-minute walkthrough showing how to go from sign-up to a downloaded GDC dataset.

⚙️ Input

Input	Type	Default	Behavior
entity	enum	"projects"	One of projects, cases, files, annotations.
maxItems	integer	10	Records to return. Free plan caps at 10, paid plan at 1,000,000.
primarySite	string	""	One of 50 primary tumour sites. Empty = all.
program	string	""	One of 26 NCI programs (TCGA, CPTAC, TARGET, HCMI, BEATAML1.0, MMRF, CCDI, etc.). Empty = all.

Example: every TCGA case in a single dataset.

{
    "entity": "cases",
    "program": "TCGA",
    "maxItems": 10000
}

Example: all open BAM files for breast-cancer cases.

{
    "entity": "files",
    "primarySite": "Breast",
    "maxItems": 500
}

⚠️ Good to Know: filter field paths differ by entity. For projects, primary_site is checked at the project level; for cases, at the case level; for files, at cases.primary_site; for annotations, at case.primary_site. The Actor handles this routing for you, so the same primarySite input works across all four modes. The same applies to program.name versus project.program.name.

📊 Output

Output shape varies by entity. Each record always carries a url to the GDC portal and a scrapedAt timestamp.

🧾 Projects schema

Field	Type	Example
🆔 `project_id`	string	`"TCGA-LGG"`
🔗 `url`	string	`"https://portal.gdc.cancer.gov/projects/TCGA-LGG"`
📛 `name`	string	`"Brain Lower Grade Glioma"`
📍 `primary_site`	string[]	`["Brain"]`
🦠 `disease_type`	string[]	`["Gliomas"]`
🏥 `program`	object	`{ name: "TCGA", program_id, dbgap_accession_number }`
🔓 `releasable`	boolean	`true`
✅ `released`	boolean	`true`
📂 `state`	string	`"open"`
📊 `summary`	object	`{ file_count, case_count, file_size }`
🕒 `scrapedAt`	ISO 8601	`"2026-05-13T..."`

🧾 Cases / Files / Annotations

Each mode emits its native ID, GDC portal URL, the expand block (demographic, diagnoses, exposures for cases; cases + analysis for files; project for annotations), and timestamps.

📦 Sample records

✨ Why choose this Actor

	Capability
🏥	All four GDC entities. Projects, cases, files, and annotations in one Actor.
🎯	Server-side filtering. Primary site and program filters compile to GDC filter JSON and run at the API level.
🧬	Expand-ready. Demographics, diagnoses, exposures, project, and analysis blocks expanded per record.
🔬	26 NCI programs. TCGA, CPTAC, TARGET, HCMI, BEATAML, MMRF, CCDI, ALCHEMIST, MATCH, ORGANOID, and more.
📍	50 primary sites. Every ICD-O site from adrenal gland to vulva.
⚡	Fast. REST pagination with offset, page size 100-1000.
🚫	No authentication. Works on the public GDC API. No login or API key.

📊 GDC hosts the largest harmonized cancer-genomics dataset in the world.

📈 How it compares to alternatives

Approach	Cost	Coverage	Refresh	Filters	Setup
⭐ NCI GDC Scraper (this Actor)	$5 free credit, then pay-per-use	Full GDC catalogue	Live per run	site, program, entity	⚡ 2 min
Hand-rolled GDC REST queries	Free	Full	Manual	Manual	🐢 Days
cBioPortal	Free	Curated subset	Quarterly	Many	⏳ Hours
Commercial cancer-genomics platforms	$$$/year	Proprietary	Per-release	Many	⏳ Weeks

Pick this Actor when you want broad cancer-genomics coverage, four ready-built entity modes, and no pipeline maintenance.

🚀 How to use

📝 Sign up. Create a free account with $5 credit (takes 2 minutes).
🌐 Open the Actor. Go to the NCI GDC Cancer Scraper page on the Apify Store.
🎯 Set input. Pick an entity, optionally filter by primary site or program, and set maxItems.
🚀 Run it. Click Start and let the Actor collect your data.
📥 Download. Grab your results in the Dataset tab as CSV, Excel, JSON, or XML.

⏱️ Total time from signup to downloaded dataset: 3-5 minutes. No coding required.

💼 Business use cases

💊 Pharma & Translational Research

Cohort discovery for trial recruitment
Tumour-type prevalence dashboards
Biomarker prevalence by program
Indication-specific case inventories

🧬 Computational Oncology

TCGA / CPTAC harmonized data audits
WGS, WXS, and RNA-Seq file inventories
Demographic distributions per disease cohort
Cross-program survival analyses

📰 Reporting & Scientometrics

Public-data programme accounting
Coverage maps by primary site
Program-level open-data benchmarks
Annotation tracking for data quality

🤖 ML & AI for Cancer Genomics

Training-cohort assembly with site / program filters
File-inventory feeds for download pipelines
Demographic feature engineering
Knowledge graphs joining cases, files, and projects

🔌 Automating NCI GDC Scraper

Control the scraper programmatically for scheduled runs and pipeline integrations:

🟢 Node.js. Install the apify-client NPM package.
🐍 Python. Use the apify-client PyPI package.
📚 See the Apify API documentation for full details.

The Apify Schedules feature lets you trigger this Actor on any cron interval. Weekly or monthly refreshes catch every GDC data-release.

🌟 Beyond business use cases

Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.

🎓 Research and academia

Reproducible cancer-genomics studies with versioned dataset pulls
Teaching datasets for oncology and bioinformatics
Open-source benchmark publications
Cross-database joins with cBioPortal and Ensembl

🎨 Personal and creative

Indie cancer-data visualization apps
Educational dashboards for science communication
Public-health storytelling
Portfolio projects on biomedical NLP

🤝 Non-profit and civic

Cancer-charity awareness dashboards
Open-science cancer-equity tracking
Public-domain references for journalism
Civic transparency on federally funded cancer research

🧪 Experimentation

Train tumour-classification models on real cohorts
Prototype agentic tools that resolve TCGA IDs
Benchmark cancer-genomics libraries on real data
Generate cohort embeddings at scale

🤖 Ask an AI assistant about this scraper

Open a ready-to-send prompt about this ParseForge actor in the AI of your choice:

❓ Frequently Asked Questions

🧩 How does it work?

Pick an entity (projects / cases / files / annotations), optionally apply site or program filters, click Start, and the Actor hits the GDC REST API with server-side filtering and pagination. Records are emitted as clean JSON ready for download. No browser automation, no captchas, no setup.

🏥 Where does the data come from?

Directly from the GDC public API at api.gdc.cancer.gov. The GDC is operated by the National Cancer Institute (NCI) at the U.S. National Institutes of Health.

🔓 Does it return controlled-access data?

No. This Actor returns only open-access metadata. Files marked access: "controlled" show up in the inventory but their contents require dbGaP authorization to download.

🧬 What is the difference between projects, cases, files, and annotations?

A project is a study (e.g. TCGA-BRCA). A case is a patient or participant within a project. A file is a single data artifact (BAM, VCF, TSV) tied to one or more cases. An annotation is a curated note on a case, file, or sample (e.g. data quality flags).

📂 Which file types can the files mode list?

Every type GDC hosts, including aligned and unaligned sequencing reads (BAM, FASTQ), variant calls (VCF, MAF), copy-number data, gene expression matrices, methylation tables, and clinical TSVs. Use the data_format, data_type, and experimental_strategy fields in the output to filter downstream.

🔁 How often is the dataset refreshed?

GDC ships data releases on a roughly quarterly cadence and adds smaller updates between releases. Every run of this Actor hits the live API, so your dataset reflects the current GDC release.

⏰ Can I schedule regular runs?

Yes. Use Apify Schedules to run this Actor on any cron interval (weekly, monthly) and keep a downstream cancer-genomics database in sync.

⚖️ Is this data legal to use?

GDC metadata is publicly accessible under U.S. federal open-data policies. Review the GDC data-use policies for your specific use case, especially for redistribution. Controlled-access raw data requires dbGaP authorization and is not returned by this Actor.

💳 Do I need a paid Apify plan to use this Actor?

No. The free Apify plan is enough for testing and small runs (10 records per run). A paid plan lifts the limit and unlocks scheduling, higher concurrency, and larger datasets.

🧪 What if I need download URLs for files?

The file_id field is the canonical GDC handle. Pass it to the GDC Data Download API at api.gdc.cancer.gov/data/{file_id} for open-access files, or request a companion downloader via the contact form below.

🆘 What if I need help?

Our support team is here to help. Contact us through the Apify platform or use the Tally form linked below.

🔌 Integrate with any app

NCI GDC Scraper connects to any cloud service via Apify integrations:

Make - Automate multi-step workflows
Zapier - Connect with 5,000+ apps
Slack - Get run notifications in your channels
Airbyte - Pipe cancer-genomics metadata into your warehouse
GitHub - Trigger runs from commits and releases
Google Drive - Export datasets straight to Sheets

You can also use webhooks to trigger downstream actions when a run finishes. Push fresh GDC case lists into your cohort builder, or alert your team in Slack when a new project releases.

🔗 Recommended Actors

🤗 Hugging Face Model Scraper - Model metadata, downloads, and benchmarks
🏥 FINRA BrokerCheck Scraper - U.S. broker and firm regulatory disclosures
🏨 Greatschools Scraper - U.S. school ratings and demographics
📈 Smart Apify Actor Scraper - Apify Store actor metadata and quality signals

💡 Pro Tip: browse the complete ParseForge collection for more reference-data scrapers.

🆘 Need Help? Open our contact form to request a new scraper, propose a custom data project, or report an issue.

⚠️ Disclaimer: this Actor is an independent tool and is not affiliated with, endorsed by, or sponsored by the NCI Genomic Data Commons, the National Cancer Institute, or the National Institutes of Health. All trademarks mentioned are the property of their respective owners. Only publicly available open cancer-genomics metadata is collected.

NCI GDC Cases Scraper

parseforge/nci-gdc-cases-scraper

Tap the National Cancer Institute Genomic Data Commons cases endpoint filtered by project identifier or primary site. Returns case UUID, submitter ID, project, primary site, disease type, and demographics. Useful for cancer cohort discovery, biomarker studies, and translational research.

ParseForge

NCI Thesaurus Concept Scraper

parseforge/nci-thesaurus-concept-scraper

Pull biomedical and cancer concepts from the NCI Thesaurus via the NCI EVS REST API. Search by keyword or list NCIt codes to return preferred name, synonyms, definition, semantic type, status, and parent and child concepts. Great for ontology mapping and clinical research.

ParseForge

TCIA Collections Scraper

parseforge/tcia-collections-scraper

Browse The Cancer Imaging Archive collection catalog filtered by cancer type. Returns collection name, cancer location, modalities, subject count, supporting data, and license terms. Useful for radiology research, medical imaging AI training set selection, and oncology benchmarks.

ParseForge

Cases.com Scraper

lexis-solutions/cases-com-scraper

USA Cases.com scraper for phone, tablet and laptop cases. Extract prices, brands, SKUs, stock, images and attributes from Cases.com for ecommerce pricing, competitor monitoring and market research in the phone case accessories niche.

Lexis Solutions

Creative Commons Search Scraper (Openverse)

gio21/creative-commons-scraper

Search Creative Commons-licensed content via Openverse API. Get free-to-use images, audio with attribution.

Gio

Target Product Scraper

crawlerbros/target-scraper

Scrape Target.com products with search by keyword, browse by category, look up by TCIN, or fetch trending deals. Returns price, ratings, availability, images, and fulfillment options via Target's RedSky API. No proxy or auth required.

Crawler Bros

NIH Grants Tracker - RePORTER API

wiry_kingdom/nih-grants-tracker

Pull NIH grant awards from RePORTER. Filter by agency (NCI, NIAID, NIMH...), institution, PI, fiscal year, keyword. Free official NIH API. For biotech analysts, grant writers, research offices, competitive intel.

Mohieldin Mohamed

NIH RePORTER Grant Scraper - Funding, PI & Award Data

themineworks/nih-reporter-grants

Scrape NIH RePORTER grants: title, abstract, award $, principal investigator, institution, agency (NCI, NIAID, NIMH), fiscal year. 800K+ grants, no API key. Use in Claude, ChatGPT & any MCP agent.

The Mine Works

Target Keyword Scraper

crafter/target-keyword-scraper

Eve Ezetta

Target Email Scraper

scrapeflow/target-email-scraper

ScrapeFlow

NCI GDC Cancer Genomics Scraper

🧬 NCI Genomic Data Commons (GDC) Cancer Scraper

📋 What the NCI GDC Scraper does

🎬 Full Demo

⚙️ Input

📊 Output

🧾 Projects schema

🧾 Cases / Files / Annotations

📦 Sample records

✨ Why choose this Actor

📈 How it compares to alternatives

🚀 How to use

💼 Business use cases

💊 Pharma & Translational Research

🧬 Computational Oncology

📰 Reporting & Scientometrics

🤖 ML & AI for Cancer Genomics

🔌 Automating NCI GDC Scraper

🌟 Beyond business use cases

🎓 Research and academia

🎨 Personal and creative

🤝 Non-profit and civic

🧪 Experimentation

🤖 Ask an AI assistant about this scraper

❓ Frequently Asked Questions

🧩 How does it work?

🏥 Where does the data come from?

🔓 Does it return controlled-access data?

🧬 What is the difference between projects, cases, files, and annotations?

📂 Which file types can the files mode list?

🔁 How often is the dataset refreshed?

⏰ Can I schedule regular runs?

⚖️ Is this data legal to use?

💳 Do I need a paid Apify plan to use this Actor?

🧪 What if I need download URLs for files?

🆘 What if I need help?

🔌 Integrate with any app

🔗 Recommended Actors

You might also like

NCI GDC Cases Scraper

NCI Thesaurus Concept Scraper

TCIA Collections Scraper

Cases.com Scraper

Creative Commons Search Scraper (Openverse)

Target Product Scraper

NIH Grants Tracker - RePORTER API

NIH RePORTER Grant Scraper - Funding, PI & Award Data

Target Keyword Scraper

Target Email Scraper