PubChem Compound Scraper
Pricing
from $20.00 / 1,000 result items
PubChem Compound Scraper
Export chemical compound data from PubChem, the world's largest open chemistry database with 119M+ compounds. Look up by CID, name, SMILES, or InChIKey. Pull molecular formulas, weights, structures, synonyms, IUPAC names, and properties.
Pricing
from $20.00 / 1,000 result items
Rating
0.0
(0)
Developer
ParseForge
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share

🧪 PubChem Compound Scraper
🚀 Export chemistry data from PubChem in seconds. Look up 119M+ compounds by CID, name, SMILES, or InChIKey. Pull molecular formulas, weights, structures, IUPAC names, synonyms, and 23+ computed properties.
🕒 Last updated: 2026-05-22 · 📊 19 fields per record · 🧪 119M+ compounds · 🔬 NIH official source · 🔍 4 lookup modes
The PubChem Compound Scraper taps PubChem, the world's largest open chemistry database, maintained by the NIH National Library of Medicine. The Actor returns 19 structured fields per record, including PubChem CID, IUPAC name, molecular formula and weight, canonical and isomeric SMILES, InChI, InChIKey, computed physicochemical properties, and the full synonym list.
The catalog covers 119 million unique chemical compounds, drawn from hundreds of contributing organizations, including the FDA, EPA, DrugBank, ChEMBL, NIST, and pharma research consortia. This Actor exposes four lookup modes (CID, name, SMILES, InChIKey) and lets you cherry-pick which of 23 PubChem-computed properties to return.
| 🎯 Target Audience | 💡 Primary Use Cases |
|---|---|
| Chemists, pharma R&D, cheminformaticians, materials scientists, drug-discovery teams, regulatory analysts, chemistry educators | Compound lookup and enrichment, SAR/QSAR feature engineering, ADMET screening inputs, regulatory dossiers, synonym normalization, structure-to-property mapping |
📋 What the PubChem Compound Scraper does
Four lookup workflows in a single Actor:
- 🔢 CID lookup. Numeric PubChem identifiers like
2244(aspirin),3672(ibuprofen). - 📛 Name lookup. Common names like
aspirin,caffeine,paclitaxel. - 🧬 SMILES lookup. Pass a structure string and resolve to the canonical PubChem record.
- 🔑 InChIKey lookup. Hash-based exact-match lookup, ideal for deduplication.
Pick from 23 PubChem-computed properties (molecular formula, weight, exact mass, SMILES variants, InChI, IUPAC name, XLogP, TPSA, complexity, charge, H-bond donor/acceptor counts, rotatable bonds, heavy atoms, stereocenters, 3D volume, feature count, and more). Toggle synonym fetching to also pull every common name registered for each compound.
💡 Why it matters: PubChem is the de facto reference for compound metadata in cheminformatics. Building your own client means juggling the PUG REST API, throttling, retries, and per-property batching. This Actor delivers a tidy record per compound, ready for downstream modelling, dashboards, or reports.
🎬 Full Demo
🚧 Coming soon: a 3-minute walkthrough showing how to go from sign-up to a downloaded dataset.
⚙️ Input
| Input | Type | Default | Behavior |
|---|---|---|---|
maxItems | integer | 10 | Records to return. Free plan caps at 10, paid plan at 1,000,000. |
mode | enum | "cid" | One of cid, name, smiles, inchikey. |
identifiers | string[] | 5 example CIDs | List of identifiers to resolve, in the chosen mode. |
properties | string[] | 13 core properties | Subset of 23 PubChem-computed properties. |
includeSynonyms | boolean | true | Also fetch the list of common names and synonyms per compound. |
Example: lookup 5 common drugs by name with synonyms.
{"maxItems": 5,"mode": "name","identifiers": ["aspirin", "ibuprofen", "caffeine", "paracetamol", "metformin"],"includeSynonyms": true}
Example: minimal property pull by CID for a screening library.
{"maxItems": 1000,"mode": "cid","identifiers": ["2244", "3672", "1983", "5793", "2519"],"properties": ["MolecularFormula", "MolecularWeight", "CanonicalSMILES", "XLogP", "TPSA"],"includeSynonyms": false}
⚠️ Good to Know: PubChem PUG REST applies rate limits to free public callers. The Actor batches and paces requests automatically so you avoid 503s.
📊 Output
Each record contains 19 fields. Download the dataset as CSV, Excel, JSON, or XML.
🧾 Schema
| Field | Type | Example |
|---|---|---|
🆔 cid | integer | 2244 |
🏷️ title | string | null | "Aspirin" |
🧬 iupacName | string | null | "2-acetyloxybenzoic acid" |
⚗️ molecularFormula | string | null | "C9H8O4" |
⚖️ molecularWeight | string | null | "180.16" |
📐 canonicalSMILES | string | null | "CC(=O)OC1=CC=CC=C1C(=O)O" |
🌀 isomericSMILES | string | null | "CC(=O)OC1=CC=CC=C1C(=O)O" |
🔗 inchi | string | null | "InChI=1S/C9H8O4/..." |
🔑 inchiKey | string | null | "BSYNRYMUTXBXSQ-UHFFFAOYSA-N" |
💧 xLogP | number | null | 1.2 |
🎯 exactMass | string | null | "180.04225873" |
🧮 tpsa | number | null | 63.6 |
🔋 hBondDonorCount | integer | null | 1 |
🔌 hBondAcceptorCount | integer | null | 4 |
🔄 rotatableBondCount | integer | null | 3 |
📝 synonyms | string[] | null | ["Aspirin", "Acetylsalicylic acid", "ASA", ...] |
🧱 properties | object | null | { "Complexity": 212, "HeavyAtomCount": 13, ... } |
🔗 url | string | "https://pubchem.ncbi.nlm.nih.gov/compound/2244" |
🕓 scrapedAt | ISO 8601 | "2026-05-22T00:00:00.000Z" |
📦 Sample records
✨ Why choose this Actor
| Capability | |
|---|---|
| 🌐 | Massive coverage. 119M+ compounds from the NIH National Library of Medicine. |
| 🔍 | Four lookup modes. CID, name, SMILES, and InChIKey resolve to the same canonical record. |
| 🧱 | 23 computed properties. Pick only the ones your model needs and save downstream cleanup. |
| 📝 | Synonym lists. Resolve trade names, salts, generics, and historical spellings in one shot. |
| ⚡ | Fast. 100 compounds in under a minute, paced under the public rate limit. |
| 🔁 | Always fresh. Every run hits the live PubChem feed. |
| 🚫 | No API key. Public PubChem REST needs no registration. |
📊 PubChem is the most widely cited chemical reference in modern cheminformatics, drug discovery, and materials research.
📈 How it compares to alternatives
| Approach | Cost | Coverage | Refresh | Filters | Setup |
|---|---|---|---|---|---|
| ⭐ PubChem Compound Scraper (this Actor) | $5 free credit, then pay-per-use | 119M+ compounds | Live per run | CID, name, SMILES, InChIKey | ⚡ 2 min |
| Manual web download from PubChem | Free | Per-compound | Manual | None | 🐢 Hours |
| Hand-coded PUG REST client | Free | Full | Per-build | Custom | ⏳ Days |
| Commercial cheminformatics suites | $$$$/year | Curated | Vendor schedule | Vendor-defined | 🕒 Sales cycle |
Pick this Actor when you want broad coverage, multi-mode lookup, and zero infrastructure to maintain.
🚀 How to use
- 📝 Sign up. Create a free account with $5 credit (takes 2 minutes).
- 🌐 Open the Actor. Go to the PubChem Compound Scraper page on the Apify Store.
- 🎯 Set input. Pick a lookup mode, paste identifiers, choose which properties to fetch.
- 🚀 Run it. Click Start and let the Actor collect your data.
- 📥 Download. Grab your results in the Dataset tab as CSV, Excel, JSON, or XML.
⏱️ Total time from signup to downloaded dataset: 3-5 minutes. No coding required.
💼 Business use cases
🔌 Automating PubChem Compound Scraper
Control the scraper programmatically for scheduled runs and pipeline integrations:
- 🟢 Node.js. Install the
apify-clientNPM package. - 🐍 Python. Use the
apify-clientPyPI package. - 📚 See the Apify API documentation for full details.
The Apify Schedules feature lets you trigger this Actor on any cron interval. Daily or weekly refreshes keep downstream databases in sync automatically.
🌟 Beyond business use cases
Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.
🤖 Ask an AI assistant about this scraper
Open a ready-to-send prompt about this ParseForge actor in the AI of your choice:
- 💬 ChatGPT
- 🧠 Claude
- 🔍 Perplexity
- 🅒 Copilot
❓ Frequently Asked Questions
🧩 How does it work?
Pick a lookup mode, paste your identifiers, choose which PubChem-computed properties to return, and click Start. The Actor calls the public PubChem feed, paces requests to stay within rate limits, and emits one tidy record per compound.
📏 How accurate is the data?
All numeric properties are PubChem-computed values served live from the NIH source. Synonyms are aggregated from PubChem's depositor network and cover trade names, salts, generics, and historical spellings.
🔁 How often is the dataset refreshed?
PubChem updates continuously as depositors submit new compounds and properties. Every Actor run pulls the current state of each compound at run time.
🧬 What's the difference between canonical and isomeric SMILES?
Canonical SMILES is a normalized 2D representation. Isomeric SMILES preserves stereochemistry and isotope information. Use isomeric for accurate structure handling in modelling.
⏰ Can I schedule regular runs?
Yes. Use Apify Schedules to run this Actor on any cron interval and keep a downstream database in sync.
⚖️ Is this data legal to use?
PubChem data is in the public domain in the United States. Many international jurisdictions treat it similarly. Review the downstream terms of your specific use case before redistribution.
💼 Can I use this data commercially?
Yes. PubChem's data policy permits commercial use. You are responsible for complying with any downstream regulatory requirements and the terms of contributing depositors.
💳 Do I need a paid Apify plan to use this Actor?
No. The free Apify plan is enough for testing and small runs (10 records per run). A paid plan lifts the limit and gives you access to scheduling, higher concurrency, and larger datasets.
🔁 What happens if a run fails or gets interrupted?
Apify automatically retries transient errors. If a run still fails, you can inspect the log in the Runs tab, fix the input, and re-run. Partial datasets from failed runs are preserved so you never lose progress.
🆘 What if I need help?
Our support team is here to help. Contact us through the Apify platform or use the Tally form linked below.
🔌 Integrate with any app
PubChem Compound Scraper connects to any cloud service via Apify integrations:
- Make - Automate multi-step workflows
- Zapier - Connect with 5,000+ apps
- Slack - Get run notifications in your channels
- Airbyte - Pipe compound data into your warehouse
- GitHub - Trigger runs from commits and releases
- Google Drive - Export datasets straight to Sheets
You can also use webhooks to trigger downstream actions when a run finishes. Push fresh compound data into your product backend, or alert your team in Slack.
🔗 Recommended Actors
- 🧬 KEGG Pathways Scraper - Biological pathways, compounds, genes, drugs
- 🏥 ClinicalTrials.gov Scraper - Global clinical research registry
- 📚 PubMed Scraper - Biomedical literature search
- 🔬 ArXiv Scraper - Preprint research papers
- 📊 GBIF Biodiversity Scraper - Global species occurrence data
💡 Pro Tip: browse the complete ParseForge collection for more reference-data scrapers.
🆘 Need Help? Open our contact form to request a new scraper, propose a custom data project, or report an issue.
⚠️ Disclaimer: this Actor is an independent tool and is not affiliated with, endorsed by, or sponsored by the NIH National Library of Medicine, PubChem, or any government body. All trademarks mentioned are the property of their respective owners. Only publicly available open data is collected.