PubMed Scraper for RAG: Papers as Chunked JSON
Pricing
from $20.00 / 1,000 papers
PubMed Scraper for RAG: Papers as Chunked JSON
Scrape PubMed citations by search term, MeSH, and article type. Returns RAG-ready JSON with full-text chunks from PMC Open Access (cl100k_base, 512/50) and abstract fallback. Drop-in for LangChain, LlamaIndex, Qdrant, Pinecone, Weaviate, pgvector. Skip GROBID / Pubmed Parser. $0.02 per paper.
Pricing
from $20.00 / 1,000 papers
Rating
0.0
(0)
Developer
Devansh Tiwari
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
5 days ago
Last modified
Categories
Share
Scrape PubMed biomedical research into RAG-ready JSON in one call. Pulls papers by free-text search, MeSH ontology, article type, and date range. Fetches PMC Open Access full-text when available and falls back to the abstract otherwise. Returns fixed-token chunks (cl100k_base, 512 tokens / 50 overlap) with full metadata, ready to drop into LangChain, LlamaIndex, Qdrant, Pinecone, Weaviate, pgvector, or Chroma. Built for AI training data teams, pharma/biotech researchers, and anyone who's written "just one more PubMed Parser patch" at 2 AM.
What does PubMed Scraper for RAG do?
This Apify Actor scrapes PubMed citations matching your search criteria, fetches PMC Open Access full-text where available, and splits the resulting plain text into tokenizer-aware chunks (512 tokens, 50-token overlap, tiktoken cl100k_base) ready to embed or feed into a RAG index.
Each output record contains clean metadata (PMID, PMCID, DOI, title, authors, journal, MeSH terms, article types, publication date) and a chunks array of { idx, text, tokens } ready for direct ingestion into a vector database.
Try it in the Apify Console. Fill in a search term (e.g. SARS-CoV-2 vaccines), optional MeSH terms or article types, a date range, a paper cap, and hit Start. Download results as JSON, CSV, or Excel.
Built on the Apify platform, you also get: scheduled runs, HTTP API access, integrations with Zapier / n8n / Make, proxy rotation, monitoring, and alerts. No infrastructure to run yourself.
Why use PubMed Scraper for RAG?
- Skip the PubMed XML parsing grind. Clean Paper records, not raw E-utilities
<PubmedArticle>trees. - PMC Open Access full-text when it exists. Around 20-30% of PubMed articles are in PMC OA; this Actor grabs NXML and strips it to readable prose. Abstract fallback otherwise.
- MeSH + article-type filtering built in. Search by MeSH ontology (e.g.
Neoplasms) or limit toReview/Meta-Analysis/Clinical Trial. - Pre-chunked for RAG.
tiktoken cl100k_basetokenization, compatible with OpenAItext-embedding-3, Claude, Cohere, and most BGE/E5/nomic embedding models. - Vector-DB neutral. Drop into Qdrant, Pinecone, Weaviate, pgvector (Supabase / Neon), Chroma, Milvus without reformatting.
- Framework-ready. Works with LangChain, LlamaIndex, Haystack, LangGraph.
- Respects NCBI rate limits. 3 requests per second without an API key, 10 per second with one.
- Cheap. $0.02 per paper. A week of clinical-trial reviews (~200 papers) costs around $4.
How to use PubMed Scraper for RAG
- Open the Actor in Apify Console.
- Set
searchTerm(free-text query; supports PubMed syntax likeSARS-CoV-2[mesh] AND vaccine[ti]). - (Optional) Set
meshTermsto filter by Medical Subject Headings. - (Optional) Set
articleTypesto narrow toclinical_trial/review/meta_analysis/case_report/observational_study. - Set
dateFrom/dateToinYYYY-MM-DDformat. - (Optional) Set
ncbiApiKeyfor 10 req/s (free from NCBI). Without it you get 3 req/s. - Click Start. Expect ~3 sec per paper without an API key, ~1 sec with.
- Download results from the Storage tab.
Input
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
searchTerm | string | SARS-CoV-2 vaccines | Free-text PubMed query, supports [mh], [ti], [pt] syntax | |
meshTerms | string[] | [] | MeSH headings. Each entry wrapped with [mh], OR-combined | |
articleTypes | string[] | [] | Enum values: clinical_trial / review / meta_analysis / case_report / observational_study | |
dateFrom | string (YYYY-MM-DD) | 2024-01-01 | Inclusive publication-date lower bound | |
dateTo | string (YYYY-MM-DD) | 2024-01-15 | Inclusive publication-date upper bound | |
maxPapers | integer | 50 | Hard cap on papers returned (1 to 100000) | |
ncbiApiKey | string (secret) | Optional NCBI API key for 10 req/s instead of 3 |
Example input:
{"searchTerm": "SARS-CoV-2 vaccines","articleTypes": ["review", "meta_analysis"],"dateFrom": "2024-01-01","dateTo": "2024-01-31","maxPapers": 100}
Output
Each paper becomes one dataset item. You can download the dataset in JSON, HTML, CSV, or Excel.
{"pmid": "38012345","pmcid": "PMC10123456","doi": "10.1038/s41586-024-01234-5","title": "Durable protection against SARS-CoV-2 variants after boosting","abstract": "...","authors": ["Jane Smith", "John Doe"],"journal": "Nature","mesh_terms": ["SARS-CoV-2", "Vaccines", "Immunization, Secondary"],"article_types": ["Journal Article", "Research Support, N.I.H., Extramural"],"publication_date": "2024-03-15","pubmed_url": "https://pubmed.ncbi.nlm.nih.gov/38012345","pdf_url": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10123456/pdf/","source": "full_text","chunks": [{ "idx": 0, "text": "...", "tokens": 487 },{ "idx": 1, "text": "...", "tokens": 512 }]}
pmcid, doi, pdf_url may be null. source is "full_text" when PMC OA NXML was parsed, "abstract" when it fell back.
Data table
| Field | Type | Description |
|---|---|---|
pmid | string | PubMed ID (primary identifier) |
pmcid | string? | PubMed Central ID (present for PMC OA papers) |
doi | string? | DOI (when provided by PubMed) |
title | string | Paper title |
abstract | string | Abstract as returned by E-utilities |
authors | string[] | Author display names, order preserved |
journal | string | Journal title |
mesh_terms | string[] | MeSH Descriptor headings assigned by NLM |
article_types | string[] | PubMed publication types |
publication_date | ISO date | YYYY-MM-DD |
pubmed_url | string | Link to the PubMed citation |
pdf_url | string? | PMC PDF link, when available |
source | "full_text" | "abstract" | Text origin |
chunks | Chunk[] | Token-aware chunks for RAG |
chunks[].idx | number | 0-indexed position |
chunks[].text | string | Chunk text |
chunks[].tokens | number | Token count (≤ 512) |
Pricing
$0.02 per paper (PPR, pay per result).
How much does it cost to scrape PubMed?
| Volume | Estimated cost |
|---|---|
| 100 papers | ~$2 |
| 1,000 papers | ~$20 |
| 10,000 papers | ~$200 |
| 100,000 papers | ~$2,000 |
No subscription. No minimum. You pay only for successful records.
Tips
- Get an NCBI API key (free at https://www.ncbi.nlm.nih.gov/account/settings/) for 10 req/s. Turns a 55-min run into roughly 17 min for 1000 papers.
- Use specific
searchTermsyntax. PubMed query operators like[ti],[mh],[au],[dp]work inside thesearchTermfield. Example:cancer[mh] AND immunotherapy[ti] AND 2024[dp]. - Filter article types to cut noise.
["review", "meta_analysis"]removes editorials, letters, and news pieces. source: "abstract"is expected for most records. Only PMC OA papers have full-text NXML. Plan your RAG index accordingly (abstracts are dense and still make good chunks).- Schedule weekly runs to keep an embedding index fresh on newly-published biomedical research.
- For large backfills, split into month-sized windows and run in parallel (separate Actor runs) to stay under the rate limit cleanly.
FAQ and limitations
Is scraping PubMed legal?
PubMed's E-utilities API is explicitly designed for programmatic access (E-utilities docs). This Actor respects NCBI's stated rate limits (3 req/s without API key, 10 with) and fetches only publicly-available content. Full-text comes from the PMC Open Access subset, which is licensed for reuse under CC-BY or equivalent.
What's not in v1?
- Other databases (bioRxiv, medRxiv, ClinicalTrials.gov, Semantic Scholar). arXiv has its own companion Actor in this portfolio.
- Species / affiliation / journal / language filters are deferred to a v2 premium tier.
- Citation graph extraction.
- PDF OCR for non-PMC papers (they fall back to abstract).
- Section-aware splitting (Abstract / Methods / Results as separate records).
- MeSH hierarchy expansion (searching
Cancerdoes not auto-include child terms). - Real-time streaming or incremental sync.
Rate limits
- Without API key: 3 req/s (~55 min for 1000 papers)
- With API key: 10 req/s (~17 min for 1000 papers)
Get a free key at https://www.ncbi.nlm.nih.gov/account/settings/.
Support
Found a bug or want a feature? Use the Issues tab on the Actor's page. Custom requirements (other databases, different chunking, section-aware splitting)? Reach out via the Actor's Support link.
Disclaimer
Output metadata is from PubMed's public E-utilities API. Full text, when available, comes from the PMC Open Access archive, which is licensed under CC-BY or equivalent open licenses. Check individual papers' licenses on their PMC page before downstream commercial use.
Built with Apify + Crawlee + TypeScript. Part of the actorstack portfolio. Sister Actor: arXiv Scraper for RAG.