RCSB PDB Protein Structure Scraper avatar

RCSB PDB Protein Structure Scraper

Pricing

from $28.87 / 1,000 results

Go to Apify Store
RCSB PDB Protein Structure Scraper

RCSB PDB Protein Structure Scraper

Scrape protein structure entries from the RCSB Protein Data Bank including title, authors, citation, experimental method (X-ray, EM, NMR), resolution, cell parameters, symmetry, polymer entities, keywords and entry metadata. No API key required.

Pricing

from $28.87 / 1,000 results

Rating

0.0

(0)

Developer

ParseForge

ParseForge

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Share

ParseForge Banner

🧬 RCSB Protein Data Bank Scraper

🚀 Export 3D macromolecular structure metadata in seconds. Pull 220,000+ PDB entries with resolution, experimental method, unit cell, primary citation, and deposit history. No API key, no registration, no manual REST stitching.

🕒 Last updated: 2026-05-13 · 📊 22 fields per record · 🧬 220,000+ structures · 🔬 9 experimental methods · 🌐 RCSB public API

The RCSB PDB Scraper queries the RCSB Search API and Data API and returns 22 fields per structure, including the 4-character PDB ID, title and descriptor, classification keywords, experimental method (X-ray, cryo-EM, NMR, neutron, fiber, powder, scattering), combined resolution, unit-cell dimensions and crystal symmetry (for X-ray entries), deposit and release dates, polymer composition and atom count, the audit-author list, and the full primary citation (title, journal, year, authors, DOI, PubMed ID). The Protein Data Bank has been the global archive of 3D biological macromolecular structures since 1971.

The catalog covers proteins, nucleic acids, complexes, viruses, ribosomes, membrane proteins, and small-molecule ligands across X-ray diffraction, electron microscopy (cryo-EM), solution and solid-state NMR, neutron diffraction, fiber, powder, electron crystallography, and solution scattering. This Actor makes the data downloadable as CSV, Excel, JSON, or XML in under a minute. Crystallographic fields (unit cell, space group, resolution refinement) are surfaced only when relevant to the experiment.

🎯 Target Audience💡 Primary Use Cases
Structural biologists, cryo-EM researchers, computational chemists, drug discovery teams, bioinformaticians, journal editors, citation analysts, ML researchersStructure browsing, citation graphs, method benchmarking, drug-target validation, training sets for AI structure prediction, deposition tracking, journal scientometrics

📋 What the RCSB PDB Scraper does

Two retrieval modes in a single run:

  • 🔎 Full-text search. Query the RCSB search API for any text (e.g. hemoglobin, SARS-CoV-2 spike, kinase inhibitor).
  • 🆔 Explicit IDs. Pass a list of 4-character PDB entry IDs (e.g. ["3GOU", "1HHO"]) to fetch full metadata directly.
  • 🔬 Method filter. Restrict by experimental method (X-ray, cryo-EM, NMR, neutron, fiber, powder, scattering, electron crystallography).

Each record returns the PDB ID, RCSB explorer URL, structure title and descriptor, classification keywords, experimental method, combined resolution, unit-cell dimensions and space group (for X-ray only), refinement resolution, deposit and release dates, polymer entity count, atom and monomer counts, the audit-author list, and the full primary citation block.

💡 Why it matters: PDB structures are the bedrock of structural biology, drug discovery, and the AlphaFold era. The RCSB API surfaces fields across multiple endpoints; this Actor joins them into a single, denormalized row per entry, complete with citation metadata.


🎬 Full Demo

🚧 Coming soon: a 3-minute walkthrough showing how to go from sign-up to a downloaded PDB dataset.


⚙️ Input

InputTypeDefaultBehavior
maxItemsinteger10Records to return. Free plan caps at 10, paid plan at 1,000,000.
searchQuerystring"hemoglobin"Full-text search query. Empty if you pass explicit IDs.
pdbIdsstring[][]Explicit 4-character PDB IDs. Overrides the search query when set.
experimentalMethodstring""One of 9 experimental methods. Empty = any.

Example: 100 cryo-EM structures matching "spike".

{
"maxItems": 100,
"searchQuery": "spike",
"experimentalMethod": "ELECTRON MICROSCOPY"
}

Example: explicit IDs for hemoglobin classics.

{
"maxItems": 5,
"pdbIds": ["3GOU", "1HHO", "2DN1", "1A3N", "4HHB"]
}

⚠️ Good to Know: the unit-cell block, crystal count, and refinement resolution apply only to X-ray entries; the Actor omits these fields cleanly for cryo-EM and NMR structures rather than emitting always-null columns. Resolution comes from rcsb_entry_info.resolution_combined and is always an array (some methods return more than one value). The primary citation is taken from rcsb_primary_citation and falls back to the first citation entry.


📊 Output

Each PDB record contains up to 22 fields. Download the dataset as CSV, Excel, JSON, or XML.

🧾 Schema

FieldTypeExample
🆔 rcsb_idstring"10AD"
🔗 urlstring"https://www.rcsb.org/structure/10AD"
🏷️ titlestring | null"Cryo-EM structure of the human BK channel bound to the agonist NS1619"
🧪 descriptorstringstructure descriptor
🔖 keywordsstring | null"MEMBRANE PROTEIN"
📝 keyword_textstring | null"BK, Slo1, MEMBRANE PROTEIN"
🔬 experimental_methodstring | null"EM"
📏 resolution_combinednumber[] | null[3.44]
🧊 crystals_numbernumber(X-ray only) 1
📐 cellobject(X-ray only) { length_a, length_b, length_c, angle_alpha, angle_beta, angle_gamma, Z_PDB }
🔷 symmetryobject(X-ray only) { space_group_name_H_M, Int_Tables_number }
🎯 ls_d_res_highnumber(X-ray only) refinement resolution
📅 deposit_datestring | null"2026-01-08T00:00:00.000+00:00"
📤 release_datestring | null"2026-02-04T00:00:00.000+00:00"
🔁 revision_datestring | null"2026-02-11T00:00:00.000+00:00"
🧬 polymer_entity_countnumber | null1
🍯 branched_entity_countnumber | null0
🧩 polymer_compositionstring | null"homomeric protein"
⚛️ deposited_atom_countnumber | null28028
🔗 deposited_polymer_monomer_countnumber | null4452
👥 audit_authorsstring[]["Gonzalez-Sanabria, N.", "Contreras, G.F."]
📰 primary_citationobject | null{ title, journal, year, doi, pubmed, authors }
🕒 scrapedAtISO 8601"2026-05-13T22:26:22.583Z"

📦 Sample records


✨ Why choose this Actor

Capability
🧬Global coverage. 220,000+ macromolecular structures across all PDB experimental methods.
🎯Two retrieval modes. Run a full-text search or pass explicit PDB IDs in a single input.
🔬Method-aware fields. Crystallographic fields appear only for X-ray entries; cryo-EM and NMR rows stay clean.
📰Full citation block. Title, journal, year, DOI, PubMed ID, and author list per structure.
Fast. Parallel detail fetches (concurrency 8) bring 50 entries from 42 s to under 10 s.
🔁Always fresh. Every run hits the RCSB API live, so newly released entries appear within hours of public release.
🚫No authentication. Works on the public RCSB search and data APIs. No login or API key.

📊 The Protein Data Bank powers every modern structure-based drug discovery pipeline and is the training ground for the AlphaFold era.


📈 How it compares to alternatives

ApproachCostCoverageRefreshFiltersSetup
⭐ RCSB PDB Scraper (this Actor)$5 free credit, then pay-per-use220,000+ entriesLive per runtext, IDs, method⚡ 2 min
RCSB REST + custom scriptsFreeFull PDBManualMany, hand-rolled🐢 Days
PDBe SOLR APIFreeMirror of PDBLiveMany⏳ Hours
Crystallographic supplements from journalsPaidPer-paperPer-issueNone🕒 Variable

Pick this Actor when you want broad structural-biology coverage, ready-joined records, and no pipeline maintenance.


🚀 How to use

  1. 📝 Sign up. Create a free account with $5 credit (takes 2 minutes).
  2. 🌐 Open the Actor. Go to the RCSB Protein Data Bank Scraper page on the Apify Store.
  3. 🎯 Set input. Enter a search query or paste a list of PDB IDs, optionally filter by method.
  4. 🚀 Run it. Click Start and let the Actor collect your data.
  5. 📥 Download. Grab your results in the Dataset tab as CSV, Excel, JSON, or XML.

⏱️ Total time from signup to downloaded dataset: 3-5 minutes. No coding required.


💼 Business use cases

💊 Structure-Based Drug Discovery

  • Target-validation surveys across approved drug classes
  • Cryo-EM resolution surveys for cohorts of GPCRs or kinases
  • Ligand-bound vs apo audits for screening campaigns
  • Competitive intel on filed structures

🧬 Structural Biology Research

  • Annual deposition trends across methods
  • X-ray vs cryo-EM resolution distributions
  • Polymer composition statistics by therapeutic area
  • Collaborations and author network analyses

📰 Scientometrics & Citation Analysis

  • DOI and PubMed cross-link feeds for biblio databases
  • Author productivity dashboards
  • Journal coverage of structural biology output
  • Time-to-publication after deposit

🤖 ML & AI for Structure Prediction

  • Curated training sets filtered by resolution
  • Method-stratified evaluation sets
  • Citation-linked benchmark suites
  • Multi-modal joins with UniProt and ChEMBL

🔌 Automating RCSB PDB Scraper

Control the scraper programmatically for scheduled runs and pipeline integrations:

  • 🟢 Node.js. Install the apify-client NPM package.
  • 🐍 Python. Use the apify-client PyPI package.
  • 📚 See the Apify API documentation for full details.

The Apify Schedules feature lets you trigger this Actor on any cron interval. Weekly refreshes catch every new PDB release.


🌟 Beyond business use cases

Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.

🎓 Research and academia

  • Reproducible structural-biology studies with versioned dataset pulls
  • Teaching datasets for crystallography and cryo-EM coursework
  • Open-source benchmarks for structure-prediction models
  • Cross-database joins with UniProt, ChEMBL, and AlphaFold DB

🎨 Personal and creative

  • Indie 3D-structure viewers and educational apps
  • Visualizations for science-communication content
  • Hobbyist databases for crystallography enthusiasts
  • Portfolio projects on protein-structure analysis

🤝 Non-profit and civic

  • Open-access pathogen-structure feeds during outbreaks
  • Pandemic-response structural-biology mapping
  • Public-domain references for science journalism
  • Open-data education for high schools and museums

🧪 Experimentation

  • Train surface-prediction or pocket-detection models
  • Prototype agentic tools that resolve PDB IDs
  • Benchmark structure-search libraries on real data
  • Generate structural embeddings at scale

🤖 Ask an AI assistant about this scraper

Open a ready-to-send prompt about this ParseForge actor in the AI of your choice:


❓ Frequently Asked Questions

🧩 How does it work?

Enter a search query or a list of PDB IDs, click Start, and the Actor hits the RCSB Search API to resolve IDs, then fetches full metadata per entry from the RCSB Data API at concurrency 8. Records are emitted as clean, joined JSON. No browser automation, no captchas, no setup.

🧬 Where does the data come from?

Directly from the RCSB Search API (search.rcsb.org/rcsbsearch/v2/query) and Data API (data.rcsb.org/rest/v1/core/entry). The Protein Data Bank is maintained jointly by RCSB PDB (USA), PDBe (Europe), and PDBj (Japan) under the wwPDB.

🔬 Why are unit-cell and refinement fields missing for some structures?

Crystallographic fields apply only to X-ray entries. Cryo-EM, NMR, neutron, and fiber-diffraction structures do not have a unit cell or a ls_d_res_high refinement resolution. The Actor omits those fields for non-X-ray rows to keep the dataset clean.

📏 What does resolution_combined actually contain?

A numeric array from rcsb_entry_info.resolution_combined. Most entries return a single value; multi-experiment structures may return multiple. Units are angstroms.

📰 Which citation field is primary_citation?

It is taken from rcsb_primary_citation if present, otherwise the first item in the citation array. It contains the publication title, journal, year, DOI, PubMed ID, and authors.

🔁 How often is the dataset refreshed?

RCSB releases new and updated entries weekly on Wednesdays. Every run of this Actor pulls live, so your dataset reflects the current state of the PDB at run time.

🆔 Can I fetch one specific PDB ID?

Yes. Pass it in the pdbIds array, leave searchQuery empty, and you will get back a single record with the full metadata block.

⏰ Can I schedule regular runs?

Yes. Use Apify Schedules to run this Actor on any cron interval (daily, weekly) and keep a downstream structural-biology database in sync.

The Protein Data Bank is released under a CC0 dedication. The raw structure metadata is publicly accessible. Review wwPDB licensing for your specific use case, especially for redistribution.

💳 Do I need a paid Apify plan to use this Actor?

No. The free Apify plan is enough for testing and small runs (10 records per run). A paid plan lifts the limit and unlocks scheduling, higher concurrency, and larger datasets.

🧪 What if I need atom coordinates?

This Actor returns metadata only, not the mmCIF or PDB coordinate files. For coordinates, fetch directly from the RCSB Files API, or reach out via the contact form below to request a companion coordinate fetcher.

🆘 What if I need help?

Our support team is here to help. Contact us through the Apify platform or use the Tally form linked below.


🔌 Integrate with any app

RCSB PDB Scraper connects to any cloud service via Apify integrations:

  • Make - Automate multi-step workflows
  • Zapier - Connect with 5,000+ apps
  • Slack - Get run notifications in your channels
  • Airbyte - Pipe PDB data into your warehouse
  • GitHub - Trigger runs from commits and releases
  • Google Drive - Export datasets straight to Sheets

You can also use webhooks to trigger downstream actions when a run finishes. Push fresh PDB metadata into your research backend, or alert your team in Slack when a watched ID is released.


💡 Pro Tip: browse the complete ParseForge collection for more reference-data scrapers.


🆘 Need Help? Open our contact form to request a new scraper, propose a custom data project, or report an issue.


⚠️ Disclaimer: this Actor is an independent tool and is not affiliated with, endorsed by, or sponsored by RCSB PDB, the wwPDB, or any of its partner sites. All trademarks mentioned are the property of their respective owners. Only publicly available open structural-biology data is collected.