Pricing

from $10.00 / 1,000 results

Go to Apify Store

Unified Preprint Search

Try for free

One Apify Actor, five sources: PubMed, arXiv, bioRxiv, medRxiv, chemRxiv.

Pricing

from $10.00 / 1,000 results

Rating

0.0

(0)

Developer

Logical Vivacity

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

Unified Preprint & Journal Search

One Apify Actor, five sources: PubMed, arXiv, bioRxiv, medRxiv, chemRxiv.

Run a single keyword query, get a normalized, deduped dataset with one paper per row and (optionally) citation counts.

What it does

Accepts a Boolean keyword query (AND between groups, OR within groups).
Queries each selected source.
Normalizes results into a unified record schema.
Deduplicates across sources by DOI.
Optionally enriches each paper with a Semantic Scholar citation count.
Streams every paper to the dataset as it's processed (no in-memory accumulation).

Inputs

Field	Type	Default	Description
`keywords`	array of arrays of strings (or flat array)	— (required)	Nested groups: outer = AND, inner = OR. A flat list is treated as AND of single-term groups.
`sources`	array of enum	all 5	Subset of `pubmed`, `arxiv`, `biorxiv`, `medrxiv`, `chemrxiv`.
`startDate`	ISO date string	—	Drop records with `date` before this.
`endDate`	ISO date string	—	Drop records with `date` after this.
`maxResultsPerSource`	integer	`200`	Hard cap per source.
`enrichWithCitations`	boolean	`false`	Look up citation counts (slow, rate-limited).

Example query

{
  "keywords": [
    ["COVID-19", "SARS-CoV-2"],
    ["deep learning", "machine learning"],
    ["medical imaging"]
  ],
  "sources": ["pubmed", "arxiv", "biorxiv"],
  "maxResultsPerSource": 100,
  "enrichWithCitations": false
}

This means: (COVID-19 OR SARS-CoV-2) AND (deep learning OR machine learning) AND (medical imaging).

Output sample

Each dataset item:

{
  "source": "arxiv",
  "title": "Deep learning for COVID-19 chest X-ray classification",
  "authors": ["Jane Doe", "John Smith"],
  "abstract": "We propose a CNN architecture ...",
  "doi": "10.1234/example.2024.0001",
  "date": "2024-03-12",
  "journal": "arXiv:2403.01234",
  "url": "https://arxiv.org/abs/2403.01234",
  "citations_count": 17,
  "raw": { "...": "any unmapped fields preserved here" }
}

Limitations

Rate limits. PubMed (NCBI E-utilities), arXiv and Semantic Scholar all rate-limit anonymous traffic. To raise the citation-enrichment limit, paste a free Semantic Scholar API key into the semanticScholarApiKey input field.
X-rxiv server dumps. bioRxiv, medRxiv and chemRxiv require a local JSONL dump. The first run on a fresh container will trigger a download of these dumps (medRxiv ~30 min, bioRxiv ~3 hr, chemRxiv ~15 min). Plan memory and timeout accordingly, or pre-bake dumps into the Docker image for production use.
Date filtering is applied client-side; not all sources return well-formed dates.
Dedup is by DOI only — papers without a DOI may appear once per source.

bioRxiv & medRxiv Preprint Scraper

crawlergang/biorxiv-medrxiv-scraper

Scrape preprints from bioRxiv and medRxiv with the leading open-access preprint servers for biology and medicine. Search by date range, fetch by DOI, or retrieve published journal version information.

Crawler Gang

5.0

bioRxiv & medRxiv Preprint Scraper

crawlerbros/biorxiv-medrxiv-scraper

Crawler Bros

arXiv Preprint Scraper

chrisp1211/arxiv-scraper-max

arXiv Preprint Scraper. No API key required. Pay only per result; empty or failed runs cost nothing.

Christian Pichichero

Preprint-to-Publication Lineage Resolver

flintglade/preprint-publication-lineage-resolver

Resolve evidence-backed links among preprints, publications, versions, and corrections using official Crossref, Europe PMC, arXiv, bioRxiv, and medRxiv metadata.

Flintglade

ArXiv Preprint Paper Search

scrupulous_waterbird_m4w/arxiv-papers

Search and extract arXiv preprint papers by category, author, title, and date range. Returns title, authors, abstract, PDF URL, categories, primary category, and submission date as structured records.

Mori

bioRxiv and medRxiv Preprints Scraper

parseforge/biorxiv-recent-scraper

Track the latest preprints from bioRxiv or medRxiv inside any date window. Returns DOI, title, authors, posting date, category, abstract, version, server, JATS XML link, and license. Useful for literature surveillance, competitive science intelligence, and rapid biomedical research review.

ParseForge

bioRxiv + medRxiv Scraper for RAG

getascraper/biorxiv-medrxiv-rag-extractor

Scrape bioRxiv and medRxiv preprints by server, category, and date range. Returns RAG-ready JSON with JATS full-text chunks (cl100k_base, 512/50) when available and abstract fallback otherwise. Drop-in for LangChain, LlamaIndex, Qdrant, Pinecone, Weaviate, pgvector.

GetAScraper

medRxiv Scraper

parseforge/medrxiv-scraper

Extract comprehensive preprint data from medRxiv, including titles, authors, abstracts, full text, DOIs, citations, and metadata. Automate access to health-science preprints with structured outputs, ideal for researchers and analysts who need reliable, large-scale article data without manual work.

ParseForge

ArXiv Preprint Paper Search

ryanclinton/arxiv-paper-search

Search and extract preprint research papers from the ArXiv open-access repository. Query over 2.4 million academic papers across physics, mathematics, computer science, biology, economics, and more with structured JSON output, no API key required.