PubMed Scraper — Papers, DOI & MeSH to JSON avatar

PubMed Scraper — Papers, DOI & MeSH to JSON

Pricing

Pay per event

Go to Apify Store
PubMed Scraper — Papers, DOI & MeSH to JSON

PubMed Scraper — Papers, DOI & MeSH to JSON

Search PubMed by query and export structured paper rows — title, authors, abstract, journal, DOI, PMID, MeSH terms, publication date — to JSON or CSV. A clean PubMed API wrapper that handles NCBI pagination, rate limits, and retries for research and ML pipelines.

Pricing

Pay per event

Rating

0.0

(0)

Developer

DevilScrapes

DevilScrapes

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

an hour ago

Last modified

Categories

Share


🎯 What this scrapes

NCBI's E-utilities is the canonical gateway to PubMed's 36+ million records — and it punishes naive callers with hard rate limits, chained esearch → efetch calls, XML quirks, and intermittent 500s. This pubmed scraper turns a free-form search query into a fully typed dataset (PMID, title, authors, abstract, journal, DOI, MeSH terms, publication types, author-supplied keywords, full citation URL) and absorbs every piece of upstream friction: paged fetches, backoff on 429s, transient-error retries, XML-to-JSON coercion. Provide your NCBI API key to lift throughput from 3 req/s to 10 req/s; either way the rows come out identical — clean and consistent.

This is a research and metadata tool only — abstracts, titles, identifiers, and controlled vocabulary. It never touches patient records, never claims to be clinical decision support, and deliberately does not fetch full text (full-text licensing lives on the publisher's side, not PubMed's). We scrape what PubMed openly indexes.

🔥 Features

  • 🛡️ Browser fingerprint rotationcurl-cffi impersonates real Chrome / Firefox / Safari TLS handshakes so the target sees a browser, not Python.
  • 🌐 Residential proxy rotation via Apify Proxy — fresh session and exit IP on every block.
  • 🔁 Retries with exponential backoff on 408 / 429 / 5xx — up to 5 attempts per page, Retry-After honoured.
  • 🧱 Rate-limit-aware pacing — when NCBI pushes back, we slow down, surface a status message, and keep going. You never get a silent empty dataset.
  • 🧊 Clean, typed dataset rows — Pydantic-validated, ISO-8601 timestamps, stable IDs, JSON / CSV / Excel export straight from the Apify Console.
  • 💰 Pay-Per-Event pricing — you only pay for results that hit your dataset. No data, no charge.

💡 Use cases

  • Clinical RAG pipelines — pull fresh PubMed metadata on a schedule and embed abstracts into a vector store for a medical-literature chatbot or pharmacovigilance alert.
  • Literature reviews and meta-analyses — retrieve every paper matching a topic + date range in one run; export to CSV for your review management tool.
  • Pharma competitive intel — track new mentions of a drug, compound, or trial ID across PubMed as they appear.
  • Author publication monitoring — daily [Author] diff to feed a personal or departmental RSS-style alert.
  • MeSH-based corpus assembly — extract every paper tagged with specific MeSH headings to build a training corpus or annotation benchmark.
  • Bulk PubMed dataset download — run a broad query (e.g. "CRISPR"[MeSH] AND 2020:2025[PDat]) and export thousands of records in a single job.

⚙️ How to use it

  1. Click Try for free at the top of the page.
  2. Fill in the input form — searchQuery is the only required field; everything else has sensible defaults.
  3. Click Start. Output streams into the run's dataset in real time.
  4. Export from Storage → Dataset as JSON, CSV, or Excel — or pull via the Apify API.

Need a repeating feed? Wire a Schedule to the Actor. Each run picks up new results; combine with a named dataset to build an append-only archive.

📥 Input

FieldTypeRequiredDefaultNotes
searchQuerystringyesdiabetes mellitus type 2 reviewPubMed-style query. Field tags like [Author], [Title], [MeSH], [PDat] are fully supported.
maxResultsintegerno30Total PubMed records to fetch. No hard cap — set a large number for bulk pubmed dataset downloads.
sortBystringnomost_recentField used to order results. Accepts any value supported by the E-utilities sort parameter.
apiKeystringnoNCBI API key. Lifts rate limit from 3 req/s to 10 req/s.
proxyConfigurationobjectno{"useApifyProxy": false}NCBI does not IP-block under standard use, so proxy is optional. Residential proxies are available if your environment requires them.

Example input

{
"searchQuery": "crispr review 2024",
"maxResults": 50,
"sortBy": "most_recent",
"proxyConfiguration": {
"useApifyProxy": false
}
}

📤 Output

Every row is one PubMed record. All fields are Pydantic-validated before they hit your dataset.

FieldTypeNotes
pmidstringPubMed ID — the canonical stable identifier.
pmcidstring | nullPubMed Central ID when the record is in PMC.
doistring | nullDigital Object Identifier.
titlestringFull paper title.
abstractstring | nullAbstract text, including structured headings where present.
authorsarrayAuthor names (Last F format), preserving original order.
journalstring | nullJournal full name.
journal_isostring | nullJournal ISO abbreviation (e.g. Nat Rev Drug Discov).
publication_typesarrayPublication-type labels e.g. Review, Journal Article, Clinical Trial.
mesh_termsarrayMeSH headings assigned by NCBI indexers.
keywordsarrayAuthor-supplied keywords.
pub_datestring | nullBest-available publication date — ISO-8601 (2024-03-01) or year-only when that is all PubMed records.
pubmed_urlstringCanonical PubMed URL for this record.
scraped_atstringISO-8601 timestamp of when this row was recorded.

Example output

{
"pmid": "39000123",
"pmcid": "PMC11234567",
"doi": "10.1038/s41573-024-00901-3",
"title": "Advances in CRISPR-Cas12a therapeutics — a 2024 review",
"abstract": "CRISPR-based gene editing has matured rapidly ...",
"authors": ["Smith J", "Patel R", "Chen W"],
"journal": "Nature Reviews Drug Discovery",
"journal_iso": "Nat Rev Drug Discov",
"publication_types": ["Review", "Journal Article"],
"mesh_terms": ["CRISPR-Cas Systems", "Gene Editing", "Therapeutics"],
"keywords": ["CRISPR", "gene therapy", "Cas12a"],
"pub_date": "2024-03-01",
"pubmed_url": "https://pubmed.ncbi.nlm.nih.gov/39000123/",
"scraped_at": "2026-06-01T09:12:00Z"
}

💰 Pricing

Pay-Per-Event — you pay only when these events fire:

EventUSDWhat it is
actor-start$0.005One-off warm-up charge per run
result$0.002Per dataset item added

Example: 1 000 results ≈ $2.00. No subscription, no monthly minimum, no card to start — every new Apify account gets $5 of free credit.

🚧 Limitations

  • Metadata only — this Actor hits E-utilities (esearch + efetch). Full text lives on publisher sites and is out of scope. The pmcid field gives you a pointer to PubMed Central when the paper is openly available there.
  • Citation graphs — which papers cite which — are not in scope. Use the iCite API for that.
  • Older records — some fields (especially abstract, doi, mesh_terms) may be absent for pre-1970 records. The Actor surfaces null rather than fabricating data.
  • NCBI rate limits — the Actor honours NCBI's stated quota (3 req/s without an API key, 10 req/s with one). We will not race past these limits; doing so gets the entire endpoint burned for everyone. Provide an apiKey for high-volume jobs.
  • Patient data and PHI — PubMed indexes abstracts and metadata only. There is no patient data here, and this tool must not be used as clinical decision support.

❓ FAQ

Do I need an NCBI API key to run this pubmed scraper?

No — without one you get ~3 req/s, which is enough for queries returning up to a few thousand records in a reasonable time. With an API key you lift to 10 req/s. Get yours free at the NCBI account portal.

Is this a pubmed api wrapper I can call programmatically?

Yes — once the Actor runs, the output dataset is accessible via the Apify REST API (JSON / NDJSON / CSV / XLSX). You can also trigger runs via API and poll for completion. See the Apify API docs for details.

Can I do a pubmed bulk download — thousands of records?

Yes. Set maxResults to however many records you need. The Actor pages through E-utilities results and streams rows into your dataset as it goes. For very large jobs, provide an apiKey to get the 10 req/s quota.

Can I filter by date range?

Use the [PDat] qualifier in your searchQuery — e.g. "COVID-19"[MeSH] AND 2020:2024[PDat]. NCBI's Entrez query syntax is documented here.

Why are some abstracts empty?

PubMed does not always store abstract text — older records, letters, and some conference papers are abstract-less. The Actor returns null for missing fields rather than inserting placeholder text.

What about full text and the clinical literature search API?

Full text lives on the publisher's site. When a paper is openly available in PubMed Central, the pmcid field gives you the identifier to fetch it directly from PMC. For an integrated clinical literature search API experience, pair this Actor with your own embedding pipeline — the output schema is designed to drop straight into LangChain's Document format.

Does this handle retracted papers?

PubMed keeps retracted papers in the index with a "Retraction of Publication" publication type. The Actor surfaces the publication_types array so you can filter these out downstream.

💬 Your feedback

Spotted a bug, hit a weird edge case, or need a new field? Open an issue on the Actor's Issues tab on Apify Console — we ship fixes weekly and we read every report.