PubMed Scraper for RAG: Papers as Chunked JSON avatar

PubMed Scraper for RAG: Papers as Chunked JSON

Pricing

from $20.00 / 1,000 papers

Go to Apify Store
PubMed Scraper for RAG: Papers as Chunked JSON

PubMed Scraper for RAG: Papers as Chunked JSON

Scrape PubMed citations by search term, MeSH, and article type. Returns RAG-ready JSON with full-text chunks from PMC Open Access (cl100k_base, 512/50) and abstract fallback. Drop-in for LangChain, LlamaIndex, Qdrant, Pinecone, Weaviate, pgvector. Skip GROBID / Pubmed Parser. $0.02 per paper.

Pricing

from $20.00 / 1,000 papers

Rating

0.0

(0)

Developer

Devansh Tiwari

Devansh Tiwari

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

5 days ago

Last modified

Share

Scrape PubMed biomedical research into RAG-ready JSON in one call. Pulls papers by free-text search, MeSH ontology, article type, and date range. Fetches PMC Open Access full-text when available and falls back to the abstract otherwise. Returns fixed-token chunks (cl100k_base, 512 tokens / 50 overlap) with full metadata, ready to drop into LangChain, LlamaIndex, Qdrant, Pinecone, Weaviate, pgvector, or Chroma. Built for AI training data teams, pharma/biotech researchers, and anyone who's written "just one more PubMed Parser patch" at 2 AM.

What does PubMed Scraper for RAG do?

This Apify Actor scrapes PubMed citations matching your search criteria, fetches PMC Open Access full-text where available, and splits the resulting plain text into tokenizer-aware chunks (512 tokens, 50-token overlap, tiktoken cl100k_base) ready to embed or feed into a RAG index.

Each output record contains clean metadata (PMID, PMCID, DOI, title, authors, journal, MeSH terms, article types, publication date) and a chunks array of { idx, text, tokens } ready for direct ingestion into a vector database.

Try it in the Apify Console. Fill in a search term (e.g. SARS-CoV-2 vaccines), optional MeSH terms or article types, a date range, a paper cap, and hit Start. Download results as JSON, CSV, or Excel.

Built on the Apify platform, you also get: scheduled runs, HTTP API access, integrations with Zapier / n8n / Make, proxy rotation, monitoring, and alerts. No infrastructure to run yourself.

Why use PubMed Scraper for RAG?

  • Skip the PubMed XML parsing grind. Clean Paper records, not raw E-utilities <PubmedArticle> trees.
  • PMC Open Access full-text when it exists. Around 20-30% of PubMed articles are in PMC OA; this Actor grabs NXML and strips it to readable prose. Abstract fallback otherwise.
  • MeSH + article-type filtering built in. Search by MeSH ontology (e.g. Neoplasms) or limit to Review / Meta-Analysis / Clinical Trial.
  • Pre-chunked for RAG. tiktoken cl100k_base tokenization, compatible with OpenAI text-embedding-3, Claude, Cohere, and most BGE/E5/nomic embedding models.
  • Vector-DB neutral. Drop into Qdrant, Pinecone, Weaviate, pgvector (Supabase / Neon), Chroma, Milvus without reformatting.
  • Framework-ready. Works with LangChain, LlamaIndex, Haystack, LangGraph.
  • Respects NCBI rate limits. 3 requests per second without an API key, 10 per second with one.
  • Cheap. $0.02 per paper. A week of clinical-trial reviews (~200 papers) costs around $4.

How to use PubMed Scraper for RAG

  1. Open the Actor in Apify Console.
  2. Set searchTerm (free-text query; supports PubMed syntax like SARS-CoV-2[mesh] AND vaccine[ti]).
  3. (Optional) Set meshTerms to filter by Medical Subject Headings.
  4. (Optional) Set articleTypes to narrow to clinical_trial / review / meta_analysis / case_report / observational_study.
  5. Set dateFrom / dateTo in YYYY-MM-DD format.
  6. (Optional) Set ncbiApiKey for 10 req/s (free from NCBI). Without it you get 3 req/s.
  7. Click Start. Expect ~3 sec per paper without an API key, ~1 sec with.
  8. Download results from the Storage tab.

Input

FieldTypeRequiredDefaultDescription
searchTermstringSARS-CoV-2 vaccinesFree-text PubMed query, supports [mh], [ti], [pt] syntax
meshTermsstring[][]MeSH headings. Each entry wrapped with [mh], OR-combined
articleTypesstring[][]Enum values: clinical_trial / review / meta_analysis / case_report / observational_study
dateFromstring (YYYY-MM-DD)2024-01-01Inclusive publication-date lower bound
dateTostring (YYYY-MM-DD)2024-01-15Inclusive publication-date upper bound
maxPapersinteger50Hard cap on papers returned (1 to 100000)
ncbiApiKeystring (secret)Optional NCBI API key for 10 req/s instead of 3

Example input:

{
"searchTerm": "SARS-CoV-2 vaccines",
"articleTypes": ["review", "meta_analysis"],
"dateFrom": "2024-01-01",
"dateTo": "2024-01-31",
"maxPapers": 100
}

Output

Each paper becomes one dataset item. You can download the dataset in JSON, HTML, CSV, or Excel.

{
"pmid": "38012345",
"pmcid": "PMC10123456",
"doi": "10.1038/s41586-024-01234-5",
"title": "Durable protection against SARS-CoV-2 variants after boosting",
"abstract": "...",
"authors": ["Jane Smith", "John Doe"],
"journal": "Nature",
"mesh_terms": ["SARS-CoV-2", "Vaccines", "Immunization, Secondary"],
"article_types": ["Journal Article", "Research Support, N.I.H., Extramural"],
"publication_date": "2024-03-15",
"pubmed_url": "https://pubmed.ncbi.nlm.nih.gov/38012345",
"pdf_url": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10123456/pdf/",
"source": "full_text",
"chunks": [
{ "idx": 0, "text": "...", "tokens": 487 },
{ "idx": 1, "text": "...", "tokens": 512 }
]
}

pmcid, doi, pdf_url may be null. source is "full_text" when PMC OA NXML was parsed, "abstract" when it fell back.

Data table

FieldTypeDescription
pmidstringPubMed ID (primary identifier)
pmcidstring?PubMed Central ID (present for PMC OA papers)
doistring?DOI (when provided by PubMed)
titlestringPaper title
abstractstringAbstract as returned by E-utilities
authorsstring[]Author display names, order preserved
journalstringJournal title
mesh_termsstring[]MeSH Descriptor headings assigned by NLM
article_typesstring[]PubMed publication types
publication_dateISO dateYYYY-MM-DD
pubmed_urlstringLink to the PubMed citation
pdf_urlstring?PMC PDF link, when available
source"full_text" | "abstract"Text origin
chunksChunk[]Token-aware chunks for RAG
chunks[].idxnumber0-indexed position
chunks[].textstringChunk text
chunks[].tokensnumberToken count (≤ 512)

Pricing

$0.02 per paper (PPR, pay per result).

How much does it cost to scrape PubMed?

VolumeEstimated cost
100 papers~$2
1,000 papers~$20
10,000 papers~$200
100,000 papers~$2,000

No subscription. No minimum. You pay only for successful records.

Tips

  • Get an NCBI API key (free at https://www.ncbi.nlm.nih.gov/account/settings/) for 10 req/s. Turns a 55-min run into roughly 17 min for 1000 papers.
  • Use specific searchTerm syntax. PubMed query operators like [ti], [mh], [au], [dp] work inside the searchTerm field. Example: cancer[mh] AND immunotherapy[ti] AND 2024[dp].
  • Filter article types to cut noise. ["review", "meta_analysis"] removes editorials, letters, and news pieces.
  • source: "abstract" is expected for most records. Only PMC OA papers have full-text NXML. Plan your RAG index accordingly (abstracts are dense and still make good chunks).
  • Schedule weekly runs to keep an embedding index fresh on newly-published biomedical research.
  • For large backfills, split into month-sized windows and run in parallel (separate Actor runs) to stay under the rate limit cleanly.

FAQ and limitations

PubMed's E-utilities API is explicitly designed for programmatic access (E-utilities docs). This Actor respects NCBI's stated rate limits (3 req/s without API key, 10 with) and fetches only publicly-available content. Full-text comes from the PMC Open Access subset, which is licensed for reuse under CC-BY or equivalent.

What's not in v1?

  • Other databases (bioRxiv, medRxiv, ClinicalTrials.gov, Semantic Scholar). arXiv has its own companion Actor in this portfolio.
  • Species / affiliation / journal / language filters are deferred to a v2 premium tier.
  • Citation graph extraction.
  • PDF OCR for non-PMC papers (they fall back to abstract).
  • Section-aware splitting (Abstract / Methods / Results as separate records).
  • MeSH hierarchy expansion (searching Cancer does not auto-include child terms).
  • Real-time streaming or incremental sync.

Rate limits

  • Without API key: 3 req/s (~55 min for 1000 papers)
  • With API key: 10 req/s (~17 min for 1000 papers)

Get a free key at https://www.ncbi.nlm.nih.gov/account/settings/.

Support

Found a bug or want a feature? Use the Issues tab on the Actor's page. Custom requirements (other databases, different chunking, section-aware splitting)? Reach out via the Actor's Support link.

Disclaimer

Output metadata is from PubMed's public E-utilities API. Full text, when available, comes from the PMC Open Access archive, which is licensed under CC-BY or equivalent open licenses. Check individual papers' licenses on their PMC page before downstream commercial use.


Built with Apify + Crawlee + TypeScript. Part of the actorstack portfolio. Sister Actor: arXiv Scraper for RAG.