Unified Preprint Search
Pricing
from $10.00 / 1,000 results
Go to Apify Store

Unified Preprint Search
One Apify Actor, five sources: PubMed, arXiv, bioRxiv, medRxiv, chemRxiv.
Pricing
from $10.00 / 1,000 results
Rating
0.0
(0)
Developer
Logical Vivacity
Maintained by Community
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
4 days ago
Last modified
Categories
Share
Unified Preprint & Journal Search
One Apify Actor, five sources: PubMed, arXiv, bioRxiv, medRxiv, chemRxiv.
Run a single keyword query, get a normalized, deduped dataset with one paper per row and (optionally) citation counts.
What it does
- Accepts a Boolean keyword query (AND between groups, OR within groups).
- Queries each selected source.
- Normalizes results into a unified record schema.
- Deduplicates across sources by DOI.
- Optionally enriches each paper with a Semantic Scholar citation count.
- Streams every paper to the dataset as it's processed (no in-memory accumulation).
Inputs
| Field | Type | Default | Description |
|---|---|---|---|
keywords | array of arrays of strings (or flat array) | — (required) | Nested groups: outer = AND, inner = OR. A flat list is treated as AND of single-term groups. |
sources | array of enum | all 5 | Subset of pubmed, arxiv, biorxiv, medrxiv, chemrxiv. |
startDate | ISO date string | — | Drop records with date before this. |
endDate | ISO date string | — | Drop records with date after this. |
maxResultsPerSource | integer | 200 | Hard cap per source. |
enrichWithCitations | boolean | false | Look up citation counts (slow, rate-limited). |
Example query
{"keywords": [["COVID-19", "SARS-CoV-2"],["deep learning", "machine learning"],["medical imaging"]],"sources": ["pubmed", "arxiv", "biorxiv"],"maxResultsPerSource": 100,"enrichWithCitations": false}
This means: (COVID-19 OR SARS-CoV-2) AND (deep learning OR machine learning) AND (medical imaging).
Output sample
Each dataset item:
{"source": "arxiv","title": "Deep learning for COVID-19 chest X-ray classification","authors": ["Jane Doe", "John Smith"],"abstract": "We propose a CNN architecture ...","doi": "10.1234/example.2024.0001","date": "2024-03-12","journal": "arXiv:2403.01234","url": "https://arxiv.org/abs/2403.01234","citations_count": 17,"raw": { "...": "any unmapped fields preserved here" }}
Limitations
- Rate limits. PubMed (NCBI E-utilities), arXiv and Semantic Scholar all rate-limit anonymous traffic. To raise the citation-enrichment limit, paste a free Semantic Scholar API key into the
semanticScholarApiKeyinput field. - X-rxiv server dumps. bioRxiv, medRxiv and chemRxiv require a local JSONL dump. The first run on a fresh container will trigger a download of these dumps (medRxiv ~30 min, bioRxiv ~3 hr, chemRxiv ~15 min). Plan memory and timeout accordingly, or pre-bake dumps into the Docker image for production use.
- Date filtering is applied client-side; not all sources return well-formed dates.
- Dedup is by DOI only — papers without a DOI may appear once per source.