PubChem Compound Scraper avatar

PubChem Compound Scraper

Pricing

from $20.00 / 1,000 result items

Go to Apify Store
PubChem Compound Scraper

PubChem Compound Scraper

Export chemical compound data from PubChem, the world's largest open chemistry database with 119M+ compounds. Look up by CID, name, SMILES, or InChIKey. Pull molecular formulas, weights, structures, synonyms, IUPAC names, and properties.

Pricing

from $20.00 / 1,000 result items

Rating

0.0

(0)

Developer

ParseForge

ParseForge

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

ParseForge Banner

🧪 PubChem Compound Scraper

🚀 Export chemistry data from PubChem in seconds. Look up 119M+ compounds by CID, name, SMILES, or InChIKey. Pull molecular formulas, weights, structures, IUPAC names, synonyms, and 23+ computed properties.

🕒 Last updated: 2026-05-22 · 📊 19 fields per record · 🧪 119M+ compounds · 🔬 NIH official source · 🔍 4 lookup modes

The PubChem Compound Scraper taps PubChem, the world's largest open chemistry database, maintained by the NIH National Library of Medicine. The Actor returns 19 structured fields per record, including PubChem CID, IUPAC name, molecular formula and weight, canonical and isomeric SMILES, InChI, InChIKey, computed physicochemical properties, and the full synonym list.

The catalog covers 119 million unique chemical compounds, drawn from hundreds of contributing organizations, including the FDA, EPA, DrugBank, ChEMBL, NIST, and pharma research consortia. This Actor exposes four lookup modes (CID, name, SMILES, InChIKey) and lets you cherry-pick which of 23 PubChem-computed properties to return.

🎯 Target Audience💡 Primary Use Cases
Chemists, pharma R&D, cheminformaticians, materials scientists, drug-discovery teams, regulatory analysts, chemistry educatorsCompound lookup and enrichment, SAR/QSAR feature engineering, ADMET screening inputs, regulatory dossiers, synonym normalization, structure-to-property mapping

📋 What the PubChem Compound Scraper does

Four lookup workflows in a single Actor:

  • 🔢 CID lookup. Numeric PubChem identifiers like 2244 (aspirin), 3672 (ibuprofen).
  • 📛 Name lookup. Common names like aspirin, caffeine, paclitaxel.
  • 🧬 SMILES lookup. Pass a structure string and resolve to the canonical PubChem record.
  • 🔑 InChIKey lookup. Hash-based exact-match lookup, ideal for deduplication.

Pick from 23 PubChem-computed properties (molecular formula, weight, exact mass, SMILES variants, InChI, IUPAC name, XLogP, TPSA, complexity, charge, H-bond donor/acceptor counts, rotatable bonds, heavy atoms, stereocenters, 3D volume, feature count, and more). Toggle synonym fetching to also pull every common name registered for each compound.

💡 Why it matters: PubChem is the de facto reference for compound metadata in cheminformatics. Building your own client means juggling the PUG REST API, throttling, retries, and per-property batching. This Actor delivers a tidy record per compound, ready for downstream modelling, dashboards, or reports.


🎬 Full Demo

🚧 Coming soon: a 3-minute walkthrough showing how to go from sign-up to a downloaded dataset.


⚙️ Input

InputTypeDefaultBehavior
maxItemsinteger10Records to return. Free plan caps at 10, paid plan at 1,000,000.
modeenum"cid"One of cid, name, smiles, inchikey.
identifiersstring[]5 example CIDsList of identifiers to resolve, in the chosen mode.
propertiesstring[]13 core propertiesSubset of 23 PubChem-computed properties.
includeSynonymsbooleantrueAlso fetch the list of common names and synonyms per compound.

Example: lookup 5 common drugs by name with synonyms.

{
"maxItems": 5,
"mode": "name",
"identifiers": ["aspirin", "ibuprofen", "caffeine", "paracetamol", "metformin"],
"includeSynonyms": true
}

Example: minimal property pull by CID for a screening library.

{
"maxItems": 1000,
"mode": "cid",
"identifiers": ["2244", "3672", "1983", "5793", "2519"],
"properties": ["MolecularFormula", "MolecularWeight", "CanonicalSMILES", "XLogP", "TPSA"],
"includeSynonyms": false
}

⚠️ Good to Know: PubChem PUG REST applies rate limits to free public callers. The Actor batches and paces requests automatically so you avoid 503s.


📊 Output

Each record contains 19 fields. Download the dataset as CSV, Excel, JSON, or XML.

🧾 Schema

FieldTypeExample
🆔 cidinteger2244
🏷️ titlestring | null"Aspirin"
🧬 iupacNamestring | null"2-acetyloxybenzoic acid"
⚗️ molecularFormulastring | null"C9H8O4"
⚖️ molecularWeightstring | null"180.16"
📐 canonicalSMILESstring | null"CC(=O)OC1=CC=CC=C1C(=O)O"
🌀 isomericSMILESstring | null"CC(=O)OC1=CC=CC=C1C(=O)O"
🔗 inchistring | null"InChI=1S/C9H8O4/..."
🔑 inchiKeystring | null"BSYNRYMUTXBXSQ-UHFFFAOYSA-N"
💧 xLogPnumber | null1.2
🎯 exactMassstring | null"180.04225873"
🧮 tpsanumber | null63.6
🔋 hBondDonorCountinteger | null1
🔌 hBondAcceptorCountinteger | null4
🔄 rotatableBondCountinteger | null3
📝 synonymsstring[] | null["Aspirin", "Acetylsalicylic acid", "ASA", ...]
🧱 propertiesobject | null{ "Complexity": 212, "HeavyAtomCount": 13, ... }
🔗 urlstring"https://pubchem.ncbi.nlm.nih.gov/compound/2244"
🕓 scrapedAtISO 8601"2026-05-22T00:00:00.000Z"

📦 Sample records


✨ Why choose this Actor

Capability
🌐Massive coverage. 119M+ compounds from the NIH National Library of Medicine.
🔍Four lookup modes. CID, name, SMILES, and InChIKey resolve to the same canonical record.
🧱23 computed properties. Pick only the ones your model needs and save downstream cleanup.
📝Synonym lists. Resolve trade names, salts, generics, and historical spellings in one shot.
Fast. 100 compounds in under a minute, paced under the public rate limit.
🔁Always fresh. Every run hits the live PubChem feed.
🚫No API key. Public PubChem REST needs no registration.

📊 PubChem is the most widely cited chemical reference in modern cheminformatics, drug discovery, and materials research.


📈 How it compares to alternatives

ApproachCostCoverageRefreshFiltersSetup
⭐ PubChem Compound Scraper (this Actor)$5 free credit, then pay-per-use119M+ compoundsLive per runCID, name, SMILES, InChIKey⚡ 2 min
Manual web download from PubChemFreePer-compoundManualNone🐢 Hours
Hand-coded PUG REST clientFreeFullPer-buildCustom⏳ Days
Commercial cheminformatics suites$$$$/yearCuratedVendor scheduleVendor-defined🕒 Sales cycle

Pick this Actor when you want broad coverage, multi-mode lookup, and zero infrastructure to maintain.


🚀 How to use

  1. 📝 Sign up. Create a free account with $5 credit (takes 2 minutes).
  2. 🌐 Open the Actor. Go to the PubChem Compound Scraper page on the Apify Store.
  3. 🎯 Set input. Pick a lookup mode, paste identifiers, choose which properties to fetch.
  4. 🚀 Run it. Click Start and let the Actor collect your data.
  5. 📥 Download. Grab your results in the Dataset tab as CSV, Excel, JSON, or XML.

⏱️ Total time from signup to downloaded dataset: 3-5 minutes. No coding required.


💼 Business use cases

💊 Pharma R&D

  • Hit triage and library enrichment
  • ADMET property pulls for early screening
  • Synonym normalization across legacy datasets
  • Regulatory dossier reference checks

🧪 Cheminformatics and ML

  • Build SAR/QSAR feature tables
  • Train generative-chemistry models with real properties
  • Standardize SMILES/InChI representations
  • Benchmark predicted vs PubChem-computed properties

🧱 Materials and chemicals

  • Specialty-chemical sourcing reference data
  • Polymer monomer property tables
  • Catalyst and ligand databases
  • Raw-material substitution screens

📋 Regulatory and EHS

  • Synonym matching for hazardous-substance lists
  • Inventory reconciliation across regulatory IDs
  • Safety data sheet (SDS) cross-referencing
  • Tracking ingredient identifiers across jurisdictions

🔌 Automating PubChem Compound Scraper

Control the scraper programmatically for scheduled runs and pipeline integrations:

  • 🟢 Node.js. Install the apify-client NPM package.
  • 🐍 Python. Use the apify-client PyPI package.
  • 📚 See the Apify API documentation for full details.

The Apify Schedules feature lets you trigger this Actor on any cron interval. Daily or weekly refreshes keep downstream databases in sync automatically.


🌟 Beyond business use cases

Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.

🎓 Research and academia

  • Course datasets for medicinal-chemistry and cheminformatics classes
  • Reproducible papers with cited, versioned compound pulls
  • Open-science notebooks that ground analyses in PubChem
  • Thesis projects on structure-property relationships

🎨 Personal and creative

  • Hobbyist science blogs and explainers
  • Visualization projects on molecular property distributions
  • Educational apps that teach chemistry through real compounds
  • Side projects exploring natural-product chemistry

🤝 Non-profit and civic

  • Public-health communication around medicines and toxins
  • Environmental advocacy with chemical-property evidence
  • Citizen-science projects on consumer-product ingredients
  • Educational resources for under-served STEM programs

🧪 Experimentation

  • Train property-prediction ML models on real labels
  • Validate generative-chemistry tools against PubChem ground truth
  • Prototype agent pipelines that answer chemistry questions
  • Build LLM-grounded chemistry assistants with cited records

🤖 Ask an AI assistant about this scraper

Open a ready-to-send prompt about this ParseForge actor in the AI of your choice:


❓ Frequently Asked Questions

🧩 How does it work?

Pick a lookup mode, paste your identifiers, choose which PubChem-computed properties to return, and click Start. The Actor calls the public PubChem feed, paces requests to stay within rate limits, and emits one tidy record per compound.

📏 How accurate is the data?

All numeric properties are PubChem-computed values served live from the NIH source. Synonyms are aggregated from PubChem's depositor network and cover trade names, salts, generics, and historical spellings.

🔁 How often is the dataset refreshed?

PubChem updates continuously as depositors submit new compounds and properties. Every Actor run pulls the current state of each compound at run time.

🧬 What's the difference between canonical and isomeric SMILES?

Canonical SMILES is a normalized 2D representation. Isomeric SMILES preserves stereochemistry and isotope information. Use isomeric for accurate structure handling in modelling.

⏰ Can I schedule regular runs?

Yes. Use Apify Schedules to run this Actor on any cron interval and keep a downstream database in sync.

PubChem data is in the public domain in the United States. Many international jurisdictions treat it similarly. Review the downstream terms of your specific use case before redistribution.

💼 Can I use this data commercially?

Yes. PubChem's data policy permits commercial use. You are responsible for complying with any downstream regulatory requirements and the terms of contributing depositors.

💳 Do I need a paid Apify plan to use this Actor?

No. The free Apify plan is enough for testing and small runs (10 records per run). A paid plan lifts the limit and gives you access to scheduling, higher concurrency, and larger datasets.

🔁 What happens if a run fails or gets interrupted?

Apify automatically retries transient errors. If a run still fails, you can inspect the log in the Runs tab, fix the input, and re-run. Partial datasets from failed runs are preserved so you never lose progress.

🆘 What if I need help?

Our support team is here to help. Contact us through the Apify platform or use the Tally form linked below.


🔌 Integrate with any app

PubChem Compound Scraper connects to any cloud service via Apify integrations:

  • Make - Automate multi-step workflows
  • Zapier - Connect with 5,000+ apps
  • Slack - Get run notifications in your channels
  • Airbyte - Pipe compound data into your warehouse
  • GitHub - Trigger runs from commits and releases
  • Google Drive - Export datasets straight to Sheets

You can also use webhooks to trigger downstream actions when a run finishes. Push fresh compound data into your product backend, or alert your team in Slack.


💡 Pro Tip: browse the complete ParseForge collection for more reference-data scrapers.


🆘 Need Help? Open our contact form to request a new scraper, propose a custom data project, or report an issue.


⚠️ Disclaimer: this Actor is an independent tool and is not affiliated with, endorsed by, or sponsored by the NIH National Library of Medicine, PubChem, or any government body. All trademarks mentioned are the property of their respective owners. Only publicly available open data is collected.