PubMed Search Scraper avatar

PubMed Search Scraper

Pricing

from $1.00 / 1,000 results

Go to Apify Store
PubMed Search Scraper

PubMed Search Scraper

Search PubMed (NCBI E-utilities) for biomedical articles by keyword, date range, and article type. Returns title, authors, journal, abstract, DOI, MeSH terms, keywords, and citation. Free public API, no proxy, no cookies. Optional NCBI API key for higher rate limits.

Pricing

from $1.00 / 1,000 results

Rating

5.0

(9)

Developer

Crawler Bros

Crawler Bros

Maintained by Community

Actor stats

9

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Share

Search PubMed — the world's largest biomedical literature database, covering 35M+ articles from 1966 onwards — by keyword, date range, and article type. Returns structured records with title, authors, journal, abstract, DOI, MeSH terms, keywords, and a clean citation. HTTP-only via NCBI's free public E-utilities API. No proxy, no cookies.

What it does

You provide one or more search terms (or paste full PubMed search URLs); the actor:

  1. Builds a PubMed query for each term, optionally narrowed by date range, article type, and free-full-text flag.
  2. Calls esearch to get the matching PMIDs (paginated).
  3. Calls esummary in batches of 200 for the metadata (title, authors, journal, dates, DOI, publication types).
  4. Calls efetch in batches of 50 for the abstract, MeSH headings, and author keywords.
  5. Merges everything into one flat record per article, dedupes across search terms by PMID, and pushes to the dataset.

Empty fields are omitted (no nulls) — when an article has no abstract or no MeSH terms, those keys are simply absent.

Input

FieldTypeDefaultDescription
searchTermsarray of strings (required)["machine learning oncology"]One or more PubMed queries. Supports full PubMed syntax — boolean (AND/OR/NOT), MeSH (cancer[MeSH]), field tags, etc.
searchUrlsarray of strings[]Optional. Paste full PubMed search URLs; the actor extracts the term= param and merges with searchTerms.
pmidListarray of strings[]Optional direct-lookup mode — list of PubMed IDs to fetch without searching (e.g. ["38123456", "36438426"]). Bypasses esearch and goes straight to esummary/efetch. Combine with or use instead of searchTerms.
maxItemsPerTerminteger25 (1–500)Per-term result cap. Total dataset = sum across terms minus duplicates.
dateFromstringEarliest publication date, YYYY/MM/DD (e.g. 2024/01/01). Optional.
dateTostringLatest publication date, YYYY/MM/DD. Optional.
articleTypeenumanyOne of any, review, clinical_trial, meta_analysis, case_report, randomized_controlled_trial, systematic_review, editorial, letter, comment, practice_guideline, observational_study, comparative_study, multicenter_study.
freeFullTextOnlybooleanfalseIf true, restrict to articles with free full-text access (PMC).
languageenumanyRestrict by article language. One of any, english, spanish, french, german, chinese, japanese, italian, portuguese, russian, korean.
journalFilterstringRestrict to a specific journal (exact name, e.g. Nature or New England Journal of Medicine). Optional.
authorFilterstringRestrict to articles by a specific author in PubMed format (e.g. Smith J or Smith JR). Optional.
meshFilterarray of strings[]Restrict to articles tagged with these MeSH (Medical Subject Headings) terms — AND-joined (e.g. ["Lung Neoplasms", "Machine Learning"] returns articles tagged with both).
affiliationFilterstringRestrict to articles where any author's affiliation contains this substring (e.g. Harvard, Mayo Clinic, Beijing). Optional.
includeCitedByCountbooleanfalseAdd a citedByCount field to each record via NCBI elink. Adds 1 elink call per page of PMIDs. Useful for citation-network analysis.
apiKeystring (Secret, optional)Free NCBI API key. Raises rate limit from 3 → 10 req/s. Sign up at https://www.ncbi.nlm.nih.gov/account/. Useful for bulk runs.

Example input

{
"searchTerms": ["machine learning oncology", "covid vaccine efficacy"],
"maxItemsPerTerm": 50,
"dateFrom": "2023/01/01",
"dateTo": "2024/12/31",
"articleType": "review",
"freeFullTextOnly": true
}

Output

One record per unique article. Empty fields are omitted (no nulls).

{
"pmid": "38123456",
"title": "Machine learning for early lung-cancer detection: a systematic review",
"authors": ["Smith J", "Doe JR", "Brown KL"],
"authorsAbbreviated": ["Smith J", "Doe JR", "Brown KL"],
"authorCount": 3,
"journal": "Nature Reviews Oncology",
"journalAbbrev": "Nat Rev Oncol",
"publicationDate": "2024-03-15",
"epubDate": "2024-02-20",
"volume": "21",
"issue": "5",
"pages": "300-315",
"issn": "1759-4774",
"elocationId": "doi: 10.1000/foo",
"language": "eng",
"doi": "10.1000/foo",
"pmcId": "PMC9876543",
"pmcUrl": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9876543/",
"abstract": "BACKGROUND: Lung cancer remains... METHODS: We searched MEDLINE...",
"meshTerms": ["Lung Neoplasms", "Machine Learning", "Early Detection of Cancer"],
"keywords": ["deep learning", "screening", "CT imaging"],
"authorAffiliations": ["Harvard Medical School, Boston, USA.", "MIT, Cambridge, USA."],
"conflictOfInterest": "The authors declare no conflict of interest.",
"grants": [
{"grantId": "R01-CA-12345", "agency": "NCI NIH HHS", "country": "United States"}
],
"referenceCount": 86,
"citedByCount": 8,
"articleTypes": ["Journal Article", "Systematic Review"],
"tags": ["Systematic Review", "Free article"],
"articleUrl": "https://pubmed.ncbi.nlm.nih.gov/38123456/",
"shareLinks": {
"twitter": "https://twitter.com/intent/tweet?text=...",
"facebook": "https://www.facebook.com/sharer/sharer.php?u=...",
"permalink": "https://pubmed.ncbi.nlm.nih.gov/38123456/"
},
"citation": "Smith J, Doe JR, Brown KL. Machine learning for early lung-cancer detection: a systematic review. Nat Rev Oncol. 2024;21(5):300-315. doi:10.1000/foo",
"inputQuery": "machine learning oncology",
"scrapedAt": "2024-12-16T14:23:11+00:00"
}

Output fields

  • pmid — PubMed ID (stable, never changes).
  • title — article title (trailing period stripped).
  • authors / authorsAbbreviated / authorCount — full and abbreviated author lists + count.
  • journal / journalAbbrev — full journal name + abbreviated (NLM-style) name.
  • publicationDate — ISO date YYYY-MM-DD (best parse from PubMed's free-form pubdate).
  • epubDate — ISO date of electronic publication when available.
  • volume / issue / pages / issn / elocationId — bibliographic identifiers when present.
  • language — ISO 639-2 code (e.g. eng, spa, chi).
  • doi — Digital Object Identifier when registered.
  • pmcId — PubMed Central ID (e.g. PMC1234567) when the article is in PMC's open-access archive.
  • pmcUrl — direct URL to the free full-text version on PubMed Central (when pmcId is set).
  • abstract — full abstract text. Multi-section abstracts are flattened with section labels (e.g. "BACKGROUND: ... METHODS: ...").
  • meshTerms — array of MeSH descriptor names (curated medical subject headings).
  • keywords — author-supplied keywords (when available).
  • authorAffiliations — deduped list of institutional affiliations parsed from author metadata (e.g. ["Harvard Medical School, Boston, USA.", "MIT, Cambridge, USA."]).
  • conflictOfInterest — author-declared conflict-of-interest statement (when present).
  • grants — funding sources as [{grantId, agency, country}] rows (NIH grants, foundation grants, etc.).
  • referenceCount — number of references cited by this article (parsed from PubMed's reference list).
  • citedByCount — number of PubMed articles citing this article. Only populated when includeCitedByCount: true.
  • articleTypes — raw publication types from PubMed.
  • tags — derived human-readable tags from articleTypes (Review, Systematic Review, Meta-Analysis, Clinical Trial, Case Report, Randomized Controlled Trial) plus Free article when freeFullTextOnly is set.
  • articleUrl — direct link to the PubMed page.
  • shareLinks{twitter, facebook, permalink} pre-filled share URLs.
  • citation — compact AMA-style citation string assembled from the metadata.
  • inputQuery — the search term (or extracted-from-URL term) that surfaced this article.
  • scrapedAt — ISO-8601 UTC timestamp.

Use cases

  • Literature reviews — build a structured corpus of every relevant paper for a systematic review or meta-analysis.
  • Research-trend tracking — monitor weekly/monthly volume of publications on a topic to spot rising fields.
  • Bibliometric analysis — pull thousands of records with structured authors/journals for citation-network analysis.
  • Author / journal mapping — find every paper by a specific researcher or in a target journal across a date range.
  • Curated newsletters — generate a fresh weekly list of new articles matching your topic + filter combination.

FAQ

Does it need a proxy or cookies? No. PubMed's E-utilities is fully public and works from datacenter IPs. No login or auth required.

Do I need an API key? No — the free public limit is 3 requests/second, plenty for default-sized runs. Supply an optional apiKey (free signup at https://www.ncbi.nlm.nih.gov/account/) to raise the limit to 10 req/s for bulk extraction.

Can I search by author, journal, or MeSH term? Yes — PubMed query syntax is fully supported. Examples:

  • Author: Smith J[Author]
  • Journal: "Nature Reviews Oncology"[Journal]
  • MeSH: Cancer[MeSH] AND machine learning
  • Combine: (cancer OR tumor) AND machine learning AND 2023:2024[dp]

Why is my abstract empty? ~5% of PubMed records ship without an abstract — usually editorials, letters, or older articles. The abstract field is simply omitted in those cases (omit-empty contract).

How does deduplication work? Across search terms within the same run, articles are deduped by PMID. Each article appears at most once in the dataset, with inputQuery set to the first term that surfaced it.

What if my query has zero results? You get a single sentinel record {type: "pubmed_scraper_error", reason: "no_results", searchTerms: [...]} so the dataset is non-empty. The run completes successfully — empty datasets aren't treated as failures.

Can I paste PubMed URLs instead of writing terms? Yes — drop full URLs like https://pubmed.ncbi.nlm.nih.gov/?term=cancer+immunotherapy into searchUrls. The actor extracts the term= param and treats it as a search term.

Is the data fresh? PubMed updates within hours of publication for indexed journals. The actor pulls live every run; results reflect PubMed's current state at fetch time.