Unified Preprint Search avatar

Unified Preprint Search

Pricing

from $10.00 / 1,000 results

Go to Apify Store
Unified Preprint Search

Unified Preprint Search

One Apify Actor, five sources: PubMed, arXiv, bioRxiv, medRxiv, chemRxiv.

Pricing

from $10.00 / 1,000 results

Rating

0.0

(0)

Developer

Logical Vivacity

Logical Vivacity

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Share

Unified Preprint & Journal Search

One Apify Actor, five sources: PubMed, arXiv, bioRxiv, medRxiv, chemRxiv.

Run a single keyword query, get a normalized, deduped dataset with one paper per row and (optionally) citation counts.

What it does

  • Accepts a Boolean keyword query (AND between groups, OR within groups).
  • Queries each selected source.
  • Normalizes results into a unified record schema.
  • Deduplicates across sources by DOI.
  • Optionally enriches each paper with a Semantic Scholar citation count.
  • Streams every paper to the dataset as it's processed (no in-memory accumulation).

Inputs

FieldTypeDefaultDescription
keywordsarray of arrays of strings (or flat array)— (required)Nested groups: outer = AND, inner = OR. A flat list is treated as AND of single-term groups.
sourcesarray of enumall 5Subset of pubmed, arxiv, biorxiv, medrxiv, chemrxiv.
startDateISO date stringDrop records with date before this.
endDateISO date stringDrop records with date after this.
maxResultsPerSourceinteger200Hard cap per source.
enrichWithCitationsbooleanfalseLook up citation counts (slow, rate-limited).

Example query

{
"keywords": [
["COVID-19", "SARS-CoV-2"],
["deep learning", "machine learning"],
["medical imaging"]
],
"sources": ["pubmed", "arxiv", "biorxiv"],
"maxResultsPerSource": 100,
"enrichWithCitations": false
}

This means: (COVID-19 OR SARS-CoV-2) AND (deep learning OR machine learning) AND (medical imaging).

Output sample

Each dataset item:

{
"source": "arxiv",
"title": "Deep learning for COVID-19 chest X-ray classification",
"authors": ["Jane Doe", "John Smith"],
"abstract": "We propose a CNN architecture ...",
"doi": "10.1234/example.2024.0001",
"date": "2024-03-12",
"journal": "arXiv:2403.01234",
"url": "https://arxiv.org/abs/2403.01234",
"citations_count": 17,
"raw": { "...": "any unmapped fields preserved here" }
}

Limitations

  • Rate limits. PubMed (NCBI E-utilities), arXiv and Semantic Scholar all rate-limit anonymous traffic. To raise the citation-enrichment limit, paste a free Semantic Scholar API key into the semanticScholarApiKey input field.
  • X-rxiv server dumps. bioRxiv, medRxiv and chemRxiv require a local JSONL dump. The first run on a fresh container will trigger a download of these dumps (medRxiv ~30 min, bioRxiv ~3 hr, chemRxiv ~15 min). Plan memory and timeout accordingly, or pre-bake dumps into the Docker image for production use.
  • Date filtering is applied client-side; not all sources return well-formed dates.
  • Dedup is by DOI only — papers without a DOI may appear once per source.