Pricing

from $20.00 / 1,000 papers

PubMed Scraper for RAG: Papers as Chunked JSON

Scrape PubMed citations by search term, MeSH, and article type. Returns RAG-ready JSON with full-text chunks from PMC Open Access (cl100k_base, 512/50) and abstract fallback. Drop-in for LangChain, LlamaIndex, Qdrant, Pinecone, Weaviate, pgvector. Skip GROBID / Pubmed Parser. $0.02 per paper.

Pricing

from $20.00 / 1,000 papers

Rating

0.0

(0)

Developer

Devansh Tiwari

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

What does PubMed Scraper for RAG do?

This Apify Actor scrapes PubMed citations matching your search criteria, fetches PMC Open Access full-text where available, and splits the resulting plain text into tokenizer-aware chunks (512 tokens, 50-token overlap, tiktoken cl100k_base) ready to embed or feed into a RAG index.

Each output record contains clean metadata (PMID, PMCID, DOI, title, authors, journal, MeSH terms, article types, publication date) and a chunks array of { idx, text, tokens } ready for direct ingestion into a vector database.

Try it in the Apify Console. Fill in a search term (e.g. SARS-CoV-2 vaccines), optional MeSH terms or article types, a date range, a paper cap, and hit Start. Download results as JSON, CSV, or Excel.

Built on the Apify platform, you also get: scheduled runs, HTTP API access, integrations with Zapier / n8n / Make, proxy rotation, monitoring, and alerts. No infrastructure to run yourself.

Why use PubMed Scraper for RAG?

Skip the PubMed XML parsing grind. Clean Paper records, not raw E-utilities <PubmedArticle> trees.
PMC Open Access full-text when it exists. Around 20-30% of PubMed articles are in PMC OA; this Actor grabs NXML and strips it to readable prose. Abstract fallback otherwise.
MeSH + article-type filtering built in. Search by MeSH ontology (e.g. Neoplasms) or limit to Review / Meta-Analysis / Clinical Trial.
Pre-chunked for RAG. tiktoken cl100k_base tokenization, compatible with OpenAI text-embedding-3, Claude, Cohere, and most BGE/E5/nomic embedding models.
Vector-DB neutral. Drop into Qdrant, Pinecone, Weaviate, pgvector (Supabase / Neon), Chroma, Milvus without reformatting.
Framework-ready. Works with LangChain, LlamaIndex, Haystack, LangGraph.
Respects NCBI rate limits. 3 requests per second without an API key, 10 per second with one.
Cheap. $0.02 per paper. A week of clinical-trial reviews (~200 papers) costs around $4.

How to use PubMed Scraper for RAG

Open the Actor in Apify Console.
Set searchTerm (free-text query; supports PubMed syntax like SARS-CoV-2[mesh] AND vaccine[ti]).
(Optional) Set meshTerms to filter by Medical Subject Headings.
(Optional) Set articleTypes to narrow to clinical_trial / review / meta_analysis / case_report / observational_study.
Set dateFrom / dateTo in YYYY-MM-DD format.
(Optional) Set ncbiApiKey for 10 req/s (free from NCBI). Without it you get 3 req/s.
Click Start. Expect ~3 sec per paper without an API key, ~1 sec with.
Download results from the Storage tab.

Input

Field	Type	Default	Description
`searchTerm`	string	`SARS-CoV-2 vaccines`	Free-text PubMed query, supports `[mh]`, `[ti]`, `[pt]` syntax
`meshTerms`	string[]	`[]`	MeSH headings. Each entry wrapped with `[mh]`, OR-combined
`articleTypes`	string[]	`[]`	Enum values: clinical_trial / review / meta_analysis / case_report / observational_study
`dateFrom`	string (YYYY-MM-DD)	`2024-01-01`	Inclusive publication-date lower bound
`dateTo`	string (YYYY-MM-DD)	`2024-01-15`	Inclusive publication-date upper bound
`maxPapers`	integer	`50`	Hard cap on papers returned (1 to 100000)
`ncbiApiKey`	string (secret)		Optional NCBI API key for 10 req/s instead of 3

Example input:

{
    "searchTerm": "SARS-CoV-2 vaccines",
    "articleTypes": ["review", "meta_analysis"],
    "dateFrom": "2024-01-01",
    "dateTo": "2024-01-31",
    "maxPapers": 100
}

Output

Each paper becomes one dataset item. You can download the dataset in JSON, HTML, CSV, or Excel.

{
    "pmid": "38012345",
    "pmcid": "PMC10123456",
    "doi": "10.1038/s41586-024-01234-5",
    "title": "Durable protection against SARS-CoV-2 variants after boosting",
    "abstract": "...",
    "authors": ["Jane Smith", "John Doe"],
    "journal": "Nature",
    "mesh_terms": ["SARS-CoV-2", "Vaccines", "Immunization, Secondary"],
    "article_types": ["Journal Article", "Research Support, N.I.H., Extramural"],
    "publication_date": "2024-03-15",
    "pubmed_url": "https://pubmed.ncbi.nlm.nih.gov/38012345",
    "pdf_url": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10123456/pdf/",
    "source": "full_text",
    "chunks": [
        { "idx": 0, "text": "...", "tokens": 487 },
        { "idx": 1, "text": "...", "tokens": 512 }
    ]
}

pmcid, doi, pdf_url may be null. source is "full_text" when PMC OA NXML was parsed, "abstract" when it fell back.

Data table

Field	Type	Description
`pmid`	string	PubMed ID (primary identifier)
`pmcid`	string?	PubMed Central ID (present for PMC OA papers)
`doi`	string?	DOI (when provided by PubMed)
`title`	string	Paper title
`abstract`	string	Abstract as returned by E-utilities
`authors`	string[]	Author display names, order preserved
`journal`	string	Journal title
`mesh_terms`	string[]	MeSH Descriptor headings assigned by NLM
`article_types`	string[]	PubMed publication types
`publication_date`	ISO date	`YYYY-MM-DD`
`pubmed_url`	string	Link to the PubMed citation
`pdf_url`	string?	PMC PDF link, when available
`source`	`"full_text"` \| `"abstract"`	Text origin
`chunks`	Chunk[]	Token-aware chunks for RAG
`chunks[].idx`	number	0-indexed position
`chunks[].text`	string	Chunk text
`chunks[].tokens`	number	Token count (≤ 512)

Pricing

$0.02 per paper (PPR, pay per result).

How much does it cost to scrape PubMed?

Volume	Estimated cost
100 papers	~$2
1,000 papers	~$20
10,000 papers	~$200
100,000 papers	~$2,000

No subscription. No minimum. You pay only for successful records.

Tips

Get an NCBI API key (free at https://www.ncbi.nlm.nih.gov/account/settings/) for 10 req/s. Turns a 55-min run into roughly 17 min for 1000 papers.
Use specific searchTerm syntax. PubMed query operators like [ti], [mh], [au], [dp] work inside the searchTerm field. Example: cancer[mh] AND immunotherapy[ti] AND 2024[dp].
Filter article types to cut noise. ["review", "meta_analysis"] removes editorials, letters, and news pieces.
source: "abstract" is expected for most records. Only PMC OA papers have full-text NXML. Plan your RAG index accordingly (abstracts are dense and still make good chunks).
Schedule weekly runs to keep an embedding index fresh on newly-published biomedical research.
For large backfills, split into month-sized windows and run in parallel (separate Actor runs) to stay under the rate limit cleanly.

FAQ and limitations

Is scraping PubMed legal?

PubMed's E-utilities API is explicitly designed for programmatic access (E-utilities docs). This Actor respects NCBI's stated rate limits (3 req/s without API key, 10 with) and fetches only publicly-available content. Full-text comes from the PMC Open Access subset, which is licensed for reuse under CC-BY or equivalent.

What's not in v1?

Other databases (bioRxiv, medRxiv, ClinicalTrials.gov, Semantic Scholar). arXiv has its own companion Actor in this portfolio.
Species / affiliation / journal / language filters are deferred to a v2 premium tier.
Citation graph extraction.
PDF OCR for non-PMC papers (they fall back to abstract).
Section-aware splitting (Abstract / Methods / Results as separate records).
MeSH hierarchy expansion (searching Cancer does not auto-include child terms).
Real-time streaming or incremental sync.

Rate limits

Without API key: 3 req/s (~55 min for 1000 papers)
With API key: 10 req/s (~17 min for 1000 papers)

Get a free key at https://www.ncbi.nlm.nih.gov/account/settings/.

Support

Found a bug or want a feature? Use the Issues tab on the Actor's page. Custom requirements (other databases, different chunking, section-aware splitting)? Reach out via the Actor's Support link.

Disclaimer

Output metadata is from PubMed's public E-utilities API. Full text, when available, comes from the PMC Open Access archive, which is licensed under CC-BY or equivalent open licenses. Check individual papers' licenses on their PMC page before downstream commercial use.

Built with Apify + Crawlee + TypeScript. Part of the actorstack portfolio. Sister Actor: arXiv Scraper for RAG.

arXiv Scraper for RAG: Papers as Chunked JSON

devanshlive/arxiv-rag-extractor

Scrape arXiv papers by date and category. Strips LaTeX and returns RAG-ready JSON with tokenizer-aware chunks (cl100k_base, 512/50). Drop-in for LangChain, LlamaIndex, Qdrant, Pinecone, Weaviate, pgvector, Chroma. Skip GROBID / Nougat / pandoc. $0.015 per paper.

Devansh Tiwari

bioRxiv + medRxiv Scraper for RAG

devanshlive/biorxiv-medrxiv-rag-extractor

Scrape bioRxiv and medRxiv preprints by server, category, and date range. Returns RAG-ready JSON with JATS full-text chunks (cl100k_base, 512/50) when available and abstract fallback otherwise. Drop-in for LangChain, LlamaIndex, Qdrant, Pinecone, Weaviate, pgvector. $0.02 per preprint.

Devansh Tiwari

PubMed Scraper

labrat011/pubmed-scraper

Search 35M+ medical citations from PubMed/MEDLINE. Extract articles, abstracts, authors, MeSH terms, and citations for research, competitive intelligence, or AI/RAG pipelines. No API key required.

mick_

PubMed Search Scraper

easyapi/pubmed-search-scraper

Scrape research papers and academic articles from PubMed based on search terms. Extract comprehensive article metadata including titles, authors, citations, abstracts, and more. Perfect for medical research and literature reviews.

EasyApi

PubMed Article Scraper - Medical Research Extractor

klondikeking/pubmed-articles-scraper

Pierrick McD0nald

PubMed Medical Papers Scraper - 35M+ Citations

wetyr_corporation/pubmed-medical-papers-scraper

Bulk extract medical research papers from PubMed/NIH. 35M+ biomedical citations. Search by query, MeSH term, author, journal. Pulls abstracts, DOIs, MeSH classifications. For pharma research and clinical reviews.

WETYR

🔬 PubMed Research Search — Medical Papers & Data

nexgendata/pubmed-research-search

Search PubMed's 35M+ biomedical research papers. Extract abstracts, authors, citations, MeSH terms, and publication data. Ideal for literature reviews, meta-analyses, and medical research.

Stephan Corbeil

PubMed Research Intelligence

funnyvalentine69/pubmed-research-intelligence

AI-powered natural language queries against PubMed. Search biomedical literature by topic, author, journal, or MeSH term. Returns structured article data with AI-synthesized literature analysis. Runs as actor or MCP server for AI agents.

Samson Southafeng

PubMed Scraper — Bulk Abstracts API for Pharma R&D

azureblue/pubmed-abstract-scraper

Scrape PubMed abstracts by keyword with optional date filtering. Returns title, authors, DOI, abstract, journal, and publication date as structured JSON.

azureblue

PubMed Biomedical Paper Scraper

brilliant_gum/pubmed-scraper

Scrapes PubMed biomedical papers using the official NCBI Entrez API. Extracts full metadata including abstracts, MeSH terms, authors with affiliations, citations, grants, and more. Includes smart analytics for author networks, topic trends, and geographic distribution.