PubMed Search Scraper
Pricing
from $1.00 / 1,000 results
PubMed Search Scraper
Search PubMed (NCBI E-utilities) for biomedical articles by keyword, date range, and article type. Returns title, authors, journal, abstract, DOI, MeSH terms, keywords, and citation. Free public API, no proxy, no cookies. Optional NCBI API key for higher rate limits.
Pricing
from $1.00 / 1,000 results
Rating
5.0
(9)
Developer
Crawler Bros
Actor stats
9
Bookmarked
2
Total users
1
Monthly active users
4 days ago
Last modified
Categories
Share
Search PubMed — the world's largest biomedical literature database, covering 35M+ articles from 1966 onwards — by keyword, date range, and article type. Returns structured records with title, authors, journal, abstract, DOI, MeSH terms, keywords, and a clean citation. HTTP-only via NCBI's free public E-utilities API. No proxy, no cookies.
What it does
You provide one or more search terms (or paste full PubMed search URLs); the actor:
- Builds a PubMed query for each term, optionally narrowed by date range, article type, and free-full-text flag.
- Calls
esearchto get the matching PMIDs (paginated). - Calls
esummaryin batches of 200 for the metadata (title, authors, journal, dates, DOI, publication types). - Calls
efetchin batches of 50 for the abstract, MeSH headings, and author keywords. - Merges everything into one flat record per article, dedupes across search terms by PMID, and pushes to the dataset.
Empty fields are omitted (no nulls) — when an article has no abstract or no MeSH terms, those keys are simply absent.
Input
| Field | Type | Default | Description |
|---|---|---|---|
searchTerms | array of strings (required) | ["machine learning oncology"] | One or more PubMed queries. Supports full PubMed syntax — boolean (AND/OR/NOT), MeSH (cancer[MeSH]), field tags, etc. |
searchUrls | array of strings | [] | Optional. Paste full PubMed search URLs; the actor extracts the term= param and merges with searchTerms. |
pmidList | array of strings | [] | Optional direct-lookup mode — list of PubMed IDs to fetch without searching (e.g. ["38123456", "36438426"]). Bypasses esearch and goes straight to esummary/efetch. Combine with or use instead of searchTerms. |
maxItemsPerTerm | integer | 25 (1–500) | Per-term result cap. Total dataset = sum across terms minus duplicates. |
dateFrom | string | – | Earliest publication date, YYYY/MM/DD (e.g. 2024/01/01). Optional. |
dateTo | string | – | Latest publication date, YYYY/MM/DD. Optional. |
articleType | enum | any | One of any, review, clinical_trial, meta_analysis, case_report, randomized_controlled_trial, systematic_review, editorial, letter, comment, practice_guideline, observational_study, comparative_study, multicenter_study. |
freeFullTextOnly | boolean | false | If true, restrict to articles with free full-text access (PMC). |
language | enum | any | Restrict by article language. One of any, english, spanish, french, german, chinese, japanese, italian, portuguese, russian, korean. |
journalFilter | string | – | Restrict to a specific journal (exact name, e.g. Nature or New England Journal of Medicine). Optional. |
authorFilter | string | – | Restrict to articles by a specific author in PubMed format (e.g. Smith J or Smith JR). Optional. |
meshFilter | array of strings | [] | Restrict to articles tagged with these MeSH (Medical Subject Headings) terms — AND-joined (e.g. ["Lung Neoplasms", "Machine Learning"] returns articles tagged with both). |
affiliationFilter | string | – | Restrict to articles where any author's affiliation contains this substring (e.g. Harvard, Mayo Clinic, Beijing). Optional. |
includeCitedByCount | boolean | false | Add a citedByCount field to each record via NCBI elink. Adds 1 elink call per page of PMIDs. Useful for citation-network analysis. |
apiKey | string (Secret, optional) | – | Free NCBI API key. Raises rate limit from 3 → 10 req/s. Sign up at https://www.ncbi.nlm.nih.gov/account/. Useful for bulk runs. |
Example input
{"searchTerms": ["machine learning oncology", "covid vaccine efficacy"],"maxItemsPerTerm": 50,"dateFrom": "2023/01/01","dateTo": "2024/12/31","articleType": "review","freeFullTextOnly": true}
Output
One record per unique article. Empty fields are omitted (no nulls).
{"pmid": "38123456","title": "Machine learning for early lung-cancer detection: a systematic review","authors": ["Smith J", "Doe JR", "Brown KL"],"authorsAbbreviated": ["Smith J", "Doe JR", "Brown KL"],"authorCount": 3,"journal": "Nature Reviews Oncology","journalAbbrev": "Nat Rev Oncol","publicationDate": "2024-03-15","epubDate": "2024-02-20","volume": "21","issue": "5","pages": "300-315","issn": "1759-4774","elocationId": "doi: 10.1000/foo","language": "eng","doi": "10.1000/foo","pmcId": "PMC9876543","pmcUrl": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9876543/","abstract": "BACKGROUND: Lung cancer remains... METHODS: We searched MEDLINE...","meshTerms": ["Lung Neoplasms", "Machine Learning", "Early Detection of Cancer"],"keywords": ["deep learning", "screening", "CT imaging"],"authorAffiliations": ["Harvard Medical School, Boston, USA.", "MIT, Cambridge, USA."],"conflictOfInterest": "The authors declare no conflict of interest.","grants": [{"grantId": "R01-CA-12345", "agency": "NCI NIH HHS", "country": "United States"}],"referenceCount": 86,"citedByCount": 8,"articleTypes": ["Journal Article", "Systematic Review"],"tags": ["Systematic Review", "Free article"],"articleUrl": "https://pubmed.ncbi.nlm.nih.gov/38123456/","shareLinks": {"twitter": "https://twitter.com/intent/tweet?text=...","facebook": "https://www.facebook.com/sharer/sharer.php?u=...","permalink": "https://pubmed.ncbi.nlm.nih.gov/38123456/"},"citation": "Smith J, Doe JR, Brown KL. Machine learning for early lung-cancer detection: a systematic review. Nat Rev Oncol. 2024;21(5):300-315. doi:10.1000/foo","inputQuery": "machine learning oncology","scrapedAt": "2024-12-16T14:23:11+00:00"}
Output fields
pmid— PubMed ID (stable, never changes).title— article title (trailing period stripped).authors/authorsAbbreviated/authorCount— full and abbreviated author lists + count.journal/journalAbbrev— full journal name + abbreviated (NLM-style) name.publicationDate— ISO dateYYYY-MM-DD(best parse from PubMed's free-formpubdate).epubDate— ISO date of electronic publication when available.volume/issue/pages/issn/elocationId— bibliographic identifiers when present.language— ISO 639-2 code (e.g.eng,spa,chi).doi— Digital Object Identifier when registered.pmcId— PubMed Central ID (e.g.PMC1234567) when the article is in PMC's open-access archive.pmcUrl— direct URL to the free full-text version on PubMed Central (whenpmcIdis set).abstract— full abstract text. Multi-section abstracts are flattened with section labels (e.g."BACKGROUND: ... METHODS: ...").meshTerms— array of MeSH descriptor names (curated medical subject headings).keywords— author-supplied keywords (when available).authorAffiliations— deduped list of institutional affiliations parsed from author metadata (e.g.["Harvard Medical School, Boston, USA.", "MIT, Cambridge, USA."]).conflictOfInterest— author-declared conflict-of-interest statement (when present).grants— funding sources as[{grantId, agency, country}]rows (NIH grants, foundation grants, etc.).referenceCount— number of references cited by this article (parsed from PubMed's reference list).citedByCount— number of PubMed articles citing this article. Only populated whenincludeCitedByCount: true.articleTypes— raw publication types from PubMed.tags— derived human-readable tags fromarticleTypes(Review,Systematic Review,Meta-Analysis,Clinical Trial,Case Report,Randomized Controlled Trial) plusFree articlewhenfreeFullTextOnlyis set.articleUrl— direct link to the PubMed page.shareLinks—{twitter, facebook, permalink}pre-filled share URLs.citation— compact AMA-style citation string assembled from the metadata.inputQuery— the search term (or extracted-from-URL term) that surfaced this article.scrapedAt— ISO-8601 UTC timestamp.
Use cases
- Literature reviews — build a structured corpus of every relevant paper for a systematic review or meta-analysis.
- Research-trend tracking — monitor weekly/monthly volume of publications on a topic to spot rising fields.
- Bibliometric analysis — pull thousands of records with structured authors/journals for citation-network analysis.
- Author / journal mapping — find every paper by a specific researcher or in a target journal across a date range.
- Curated newsletters — generate a fresh weekly list of new articles matching your topic + filter combination.
FAQ
Does it need a proxy or cookies? No. PubMed's E-utilities is fully public and works from datacenter IPs. No login or auth required.
Do I need an API key?
No — the free public limit is 3 requests/second, plenty for default-sized runs. Supply an optional apiKey (free signup at https://www.ncbi.nlm.nih.gov/account/) to raise the limit to 10 req/s for bulk extraction.
Can I search by author, journal, or MeSH term? Yes — PubMed query syntax is fully supported. Examples:
- Author:
Smith J[Author] - Journal:
"Nature Reviews Oncology"[Journal] - MeSH:
Cancer[MeSH] AND machine learning - Combine:
(cancer OR tumor) AND machine learning AND 2023:2024[dp]
Why is my abstract empty?
~5% of PubMed records ship without an abstract — usually editorials, letters, or older articles. The abstract field is simply omitted in those cases (omit-empty contract).
How does deduplication work?
Across search terms within the same run, articles are deduped by PMID. Each article appears at most once in the dataset, with inputQuery set to the first term that surfaced it.
What if my query has zero results?
You get a single sentinel record {type: "pubmed_scraper_error", reason: "no_results", searchTerms: [...]} so the dataset is non-empty. The run completes successfully — empty datasets aren't treated as failures.
Can I paste PubMed URLs instead of writing terms?
Yes — drop full URLs like https://pubmed.ncbi.nlm.nih.gov/?term=cancer+immunotherapy into searchUrls. The actor extracts the term= param and treats it as a search term.
Is the data fresh? PubMed updates within hours of publication for indexed journals. The actor pulls live every run; results reflect PubMed's current state at fetch time.