RCSB PDB Protein Structure Scraper
Pricing
from $28.87 / 1,000 results
RCSB PDB Protein Structure Scraper
Scrape protein structure entries from the RCSB Protein Data Bank including title, authors, citation, experimental method (X-ray, EM, NMR), resolution, cell parameters, symmetry, polymer entities, keywords and entry metadata. No API key required.
Pricing
from $28.87 / 1,000 results
Rating
0.0
(0)
Developer
ParseForge
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
4 days ago
Last modified
Categories
Share

🧬 RCSB Protein Data Bank Scraper
🚀 Export 3D macromolecular structure metadata in seconds. Pull 220,000+ PDB entries with resolution, experimental method, unit cell, primary citation, and deposit history. No API key, no registration, no manual REST stitching.
🕒 Last updated: 2026-05-13 · 📊 22 fields per record · 🧬 220,000+ structures · 🔬 9 experimental methods · 🌐 RCSB public API
The RCSB PDB Scraper queries the RCSB Search API and Data API and returns 22 fields per structure, including the 4-character PDB ID, title and descriptor, classification keywords, experimental method (X-ray, cryo-EM, NMR, neutron, fiber, powder, scattering), combined resolution, unit-cell dimensions and crystal symmetry (for X-ray entries), deposit and release dates, polymer composition and atom count, the audit-author list, and the full primary citation (title, journal, year, authors, DOI, PubMed ID). The Protein Data Bank has been the global archive of 3D biological macromolecular structures since 1971.
The catalog covers proteins, nucleic acids, complexes, viruses, ribosomes, membrane proteins, and small-molecule ligands across X-ray diffraction, electron microscopy (cryo-EM), solution and solid-state NMR, neutron diffraction, fiber, powder, electron crystallography, and solution scattering. This Actor makes the data downloadable as CSV, Excel, JSON, or XML in under a minute. Crystallographic fields (unit cell, space group, resolution refinement) are surfaced only when relevant to the experiment.
| 🎯 Target Audience | 💡 Primary Use Cases |
|---|---|
| Structural biologists, cryo-EM researchers, computational chemists, drug discovery teams, bioinformaticians, journal editors, citation analysts, ML researchers | Structure browsing, citation graphs, method benchmarking, drug-target validation, training sets for AI structure prediction, deposition tracking, journal scientometrics |
📋 What the RCSB PDB Scraper does
Two retrieval modes in a single run:
- 🔎 Full-text search. Query the RCSB search API for any text (e.g.
hemoglobin,SARS-CoV-2 spike,kinase inhibitor). - 🆔 Explicit IDs. Pass a list of 4-character PDB entry IDs (e.g.
["3GOU", "1HHO"]) to fetch full metadata directly. - 🔬 Method filter. Restrict by experimental method (X-ray, cryo-EM, NMR, neutron, fiber, powder, scattering, electron crystallography).
Each record returns the PDB ID, RCSB explorer URL, structure title and descriptor, classification keywords, experimental method, combined resolution, unit-cell dimensions and space group (for X-ray only), refinement resolution, deposit and release dates, polymer entity count, atom and monomer counts, the audit-author list, and the full primary citation block.
💡 Why it matters: PDB structures are the bedrock of structural biology, drug discovery, and the AlphaFold era. The RCSB API surfaces fields across multiple endpoints; this Actor joins them into a single, denormalized row per entry, complete with citation metadata.
🎬 Full Demo
🚧 Coming soon: a 3-minute walkthrough showing how to go from sign-up to a downloaded PDB dataset.
⚙️ Input
| Input | Type | Default | Behavior |
|---|---|---|---|
maxItems | integer | 10 | Records to return. Free plan caps at 10, paid plan at 1,000,000. |
searchQuery | string | "hemoglobin" | Full-text search query. Empty if you pass explicit IDs. |
pdbIds | string[] | [] | Explicit 4-character PDB IDs. Overrides the search query when set. |
experimentalMethod | string | "" | One of 9 experimental methods. Empty = any. |
Example: 100 cryo-EM structures matching "spike".
{"maxItems": 100,"searchQuery": "spike","experimentalMethod": "ELECTRON MICROSCOPY"}
Example: explicit IDs for hemoglobin classics.
{"maxItems": 5,"pdbIds": ["3GOU", "1HHO", "2DN1", "1A3N", "4HHB"]}
⚠️ Good to Know: the unit-cell block, crystal count, and refinement resolution apply only to X-ray entries; the Actor omits these fields cleanly for cryo-EM and NMR structures rather than emitting always-null columns. Resolution comes from
rcsb_entry_info.resolution_combinedand is always an array (some methods return more than one value). The primary citation is taken fromrcsb_primary_citationand falls back to the firstcitationentry.
📊 Output
Each PDB record contains up to 22 fields. Download the dataset as CSV, Excel, JSON, or XML.
🧾 Schema
| Field | Type | Example |
|---|---|---|
🆔 rcsb_id | string | "10AD" |
🔗 url | string | "https://www.rcsb.org/structure/10AD" |
🏷️ title | string | null | "Cryo-EM structure of the human BK channel bound to the agonist NS1619" |
🧪 descriptor | string | structure descriptor |
🔖 keywords | string | null | "MEMBRANE PROTEIN" |
📝 keyword_text | string | null | "BK, Slo1, MEMBRANE PROTEIN" |
🔬 experimental_method | string | null | "EM" |
📏 resolution_combined | number[] | null | [3.44] |
🧊 crystals_number | number | (X-ray only) 1 |
📐 cell | object | (X-ray only) { length_a, length_b, length_c, angle_alpha, angle_beta, angle_gamma, Z_PDB } |
🔷 symmetry | object | (X-ray only) { space_group_name_H_M, Int_Tables_number } |
🎯 ls_d_res_high | number | (X-ray only) refinement resolution |
📅 deposit_date | string | null | "2026-01-08T00:00:00.000+00:00" |
📤 release_date | string | null | "2026-02-04T00:00:00.000+00:00" |
🔁 revision_date | string | null | "2026-02-11T00:00:00.000+00:00" |
🧬 polymer_entity_count | number | null | 1 |
🍯 branched_entity_count | number | null | 0 |
🧩 polymer_composition | string | null | "homomeric protein" |
⚛️ deposited_atom_count | number | null | 28028 |
🔗 deposited_polymer_monomer_count | number | null | 4452 |
👥 audit_authors | string[] | ["Gonzalez-Sanabria, N.", "Contreras, G.F."] |
📰 primary_citation | object | null | { title, journal, year, doi, pubmed, authors } |
🕒 scrapedAt | ISO 8601 | "2026-05-13T22:26:22.583Z" |
📦 Sample records
✨ Why choose this Actor
| Capability | |
|---|---|
| 🧬 | Global coverage. 220,000+ macromolecular structures across all PDB experimental methods. |
| 🎯 | Two retrieval modes. Run a full-text search or pass explicit PDB IDs in a single input. |
| 🔬 | Method-aware fields. Crystallographic fields appear only for X-ray entries; cryo-EM and NMR rows stay clean. |
| 📰 | Full citation block. Title, journal, year, DOI, PubMed ID, and author list per structure. |
| ⚡ | Fast. Parallel detail fetches (concurrency 8) bring 50 entries from 42 s to under 10 s. |
| 🔁 | Always fresh. Every run hits the RCSB API live, so newly released entries appear within hours of public release. |
| 🚫 | No authentication. Works on the public RCSB search and data APIs. No login or API key. |
📊 The Protein Data Bank powers every modern structure-based drug discovery pipeline and is the training ground for the AlphaFold era.
📈 How it compares to alternatives
| Approach | Cost | Coverage | Refresh | Filters | Setup |
|---|---|---|---|---|---|
| ⭐ RCSB PDB Scraper (this Actor) | $5 free credit, then pay-per-use | 220,000+ entries | Live per run | text, IDs, method | ⚡ 2 min |
| RCSB REST + custom scripts | Free | Full PDB | Manual | Many, hand-rolled | 🐢 Days |
| PDBe SOLR API | Free | Mirror of PDB | Live | Many | ⏳ Hours |
| Crystallographic supplements from journals | Paid | Per-paper | Per-issue | None | 🕒 Variable |
Pick this Actor when you want broad structural-biology coverage, ready-joined records, and no pipeline maintenance.
🚀 How to use
- 📝 Sign up. Create a free account with $5 credit (takes 2 minutes).
- 🌐 Open the Actor. Go to the RCSB Protein Data Bank Scraper page on the Apify Store.
- 🎯 Set input. Enter a search query or paste a list of PDB IDs, optionally filter by method.
- 🚀 Run it. Click Start and let the Actor collect your data.
- 📥 Download. Grab your results in the Dataset tab as CSV, Excel, JSON, or XML.
⏱️ Total time from signup to downloaded dataset: 3-5 minutes. No coding required.
💼 Business use cases
🔌 Automating RCSB PDB Scraper
Control the scraper programmatically for scheduled runs and pipeline integrations:
- 🟢 Node.js. Install the
apify-clientNPM package. - 🐍 Python. Use the
apify-clientPyPI package. - 📚 See the Apify API documentation for full details.
The Apify Schedules feature lets you trigger this Actor on any cron interval. Weekly refreshes catch every new PDB release.
🌟 Beyond business use cases
Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.
🤖 Ask an AI assistant about this scraper
Open a ready-to-send prompt about this ParseForge actor in the AI of your choice:
- 💬 ChatGPT
- 🧠 Claude
- 🔍 Perplexity
- 🅒 Copilot
❓ Frequently Asked Questions
🧩 How does it work?
Enter a search query or a list of PDB IDs, click Start, and the Actor hits the RCSB Search API to resolve IDs, then fetches full metadata per entry from the RCSB Data API at concurrency 8. Records are emitted as clean, joined JSON. No browser automation, no captchas, no setup.
🧬 Where does the data come from?
Directly from the RCSB Search API (search.rcsb.org/rcsbsearch/v2/query) and Data API (data.rcsb.org/rest/v1/core/entry). The Protein Data Bank is maintained jointly by RCSB PDB (USA), PDBe (Europe), and PDBj (Japan) under the wwPDB.
🔬 Why are unit-cell and refinement fields missing for some structures?
Crystallographic fields apply only to X-ray entries. Cryo-EM, NMR, neutron, and fiber-diffraction structures do not have a unit cell or a ls_d_res_high refinement resolution. The Actor omits those fields for non-X-ray rows to keep the dataset clean.
📏 What does resolution_combined actually contain?
A numeric array from rcsb_entry_info.resolution_combined. Most entries return a single value; multi-experiment structures may return multiple. Units are angstroms.
📰 Which citation field is primary_citation?
It is taken from rcsb_primary_citation if present, otherwise the first item in the citation array. It contains the publication title, journal, year, DOI, PubMed ID, and authors.
🔁 How often is the dataset refreshed?
RCSB releases new and updated entries weekly on Wednesdays. Every run of this Actor pulls live, so your dataset reflects the current state of the PDB at run time.
🆔 Can I fetch one specific PDB ID?
Yes. Pass it in the pdbIds array, leave searchQuery empty, and you will get back a single record with the full metadata block.
⏰ Can I schedule regular runs?
Yes. Use Apify Schedules to run this Actor on any cron interval (daily, weekly) and keep a downstream structural-biology database in sync.
⚖️ Is this data legal to use?
The Protein Data Bank is released under a CC0 dedication. The raw structure metadata is publicly accessible. Review wwPDB licensing for your specific use case, especially for redistribution.
💳 Do I need a paid Apify plan to use this Actor?
No. The free Apify plan is enough for testing and small runs (10 records per run). A paid plan lifts the limit and unlocks scheduling, higher concurrency, and larger datasets.
🧪 What if I need atom coordinates?
This Actor returns metadata only, not the mmCIF or PDB coordinate files. For coordinates, fetch directly from the RCSB Files API, or reach out via the contact form below to request a companion coordinate fetcher.
🆘 What if I need help?
Our support team is here to help. Contact us through the Apify platform or use the Tally form linked below.
🔌 Integrate with any app
RCSB PDB Scraper connects to any cloud service via Apify integrations:
- Make - Automate multi-step workflows
- Zapier - Connect with 5,000+ apps
- Slack - Get run notifications in your channels
- Airbyte - Pipe PDB data into your warehouse
- GitHub - Trigger runs from commits and releases
- Google Drive - Export datasets straight to Sheets
You can also use webhooks to trigger downstream actions when a run finishes. Push fresh PDB metadata into your research backend, or alert your team in Slack when a watched ID is released.
🔗 Recommended Actors
- 🤗 Hugging Face Model Scraper - Model metadata, downloads, and benchmarks
- 🏥 FINRA BrokerCheck Scraper - U.S. broker and firm regulatory disclosures
- 🏨 Greatschools Scraper - U.S. school ratings and demographics
- 📈 Smart Apify Actor Scraper - Apify Store actor metadata and quality signals
💡 Pro Tip: browse the complete ParseForge collection for more reference-data scrapers.
🆘 Need Help? Open our contact form to request a new scraper, propose a custom data project, or report an issue.
⚠️ Disclaimer: this Actor is an independent tool and is not affiliated with, endorsed by, or sponsored by RCSB PDB, the wwPDB, or any of its partner sites. All trademarks mentioned are the property of their respective owners. Only publicly available open structural-biology data is collected.