bioRxiv & medRxiv Preprint Scraper avatar

bioRxiv & medRxiv Preprint Scraper

Pricing

from $3.00 / 1,000 results

Go to Apify Store
bioRxiv & medRxiv Preprint Scraper

bioRxiv & medRxiv Preprint Scraper

Scrape preprints from bioRxiv and medRxiv with the leading open-access preprint servers for biology and medicine. Search by date range, fetch by DOI, or retrieve published journal version information.

Pricing

from $3.00 / 1,000 results

Rating

5.0

(11)

Developer

Crawler Gang

Crawler Gang

Maintained by Community

Actor stats

11

Bookmarked

2

Total users

1

Monthly active users

8 days ago

Last modified

Share

Scrape preprints from bioRxiv and medRxiv — the leading open-access preprint servers for biology and medicine — powered by the official bioRxiv/medRxiv API.

No account, no API key, and no proxy required. Works on the Apify free plan.


What It Does

  • Search by date range — retrieve all preprints posted within a date window (up to any span; automatically paginates through 90-day API chunks)
  • Fetch by DOI — look up one or more specific preprints using their DOI
  • Published version info — check whether a preprint has been published in a journal and retrieve the journal DOI and name
  • Filter by category — narrow results to a specific scientific field (neuroscience, genomics, immunology, etc.)
  • Both servers — query bioRxiv, medRxiv, or both simultaneously

Use Cases

  • Track new preprints in your research field
  • Build a literature monitoring or alerting pipeline
  • Analyze publishing trends across biomedical disciplines
  • Identify preprints that have been formally published in journals
  • Aggregate author/institution data for research network analysis

Input Parameters

ParameterTypeDefaultDescription
modeselectsearchsearch (date range), byDoi (DOI lookup), or published (journal version info)
serverselectbiorxivbiorxiv, medrxiv, or both
dateFromdate2024-01-01Start date (YYYY-MM-DD). Required for mode=search
dateTodate2024-01-07End date (YYYY-MM-DD). Required for mode=search
doisarrayOne or more DOIs to look up (required for mode=byDoi and mode=published)
categoryselectAllFilter to a specific scientific category (mode=search only)
maxItemsinteger50Maximum number of records to return (1–10000)

Supported Categories

Neuroscience, Bioinformatics, Genomics, Microbiology, Cell Biology, Biochemistry, Evolutionary Biology, Pharmacology and Toxicology, Immunology, Molecular Biology, Genetics, Cancer Biology, Scientific Communication, Pathology, Systems Biology, Ecology, Physiology, Epidemiology, Developmental Biology, Clinical Trials, Bioengineering, Plant Biology, Zoology, Biophysics, Synthetic Biology.


Output Fields

search and byDoi Modes

FieldTypeDescription
doistringPreprint DOI
titlestringPreprint title
authorsstringAll authors as a single string
authorListarrayAuthors as an array of strings
correspondingAuthorstringName of the corresponding author
institutionstringCorresponding author's institution
submittedDatestringDate submitted (YYYY-MM-DD)
versionintegerVersion number of the preprint
typestringPreprint type (e.g. "new results")
licensestringLicense code (e.g. "cc_by", "cc0")
categorystringScientific category
serverstringSource server (biorxiv or medrxiv)
abstractTextstringFull abstract text
jatsXmlUrlstringURL to the JATS/XML version
previewUrlstringURL to view the preprint on biorxiv/medrxiv
isPublishedbooleanWhether the preprint has a journal publication
publishedDoistringJournal publication DOI (if published)
scrapedAtstringTimestamp when the record was scraped (ISO-8601)

published Mode

FieldTypeDescription
doistringbioRxiv/medRxiv preprint DOI
titlestringPreprint title
authorsstringAuthors string
categorystringScientific category
serverstringPreprint server
isPublishedbooleanWhether a journal publication exists
publishedDoistringJournal publication DOI
publishedJournalstringJournal name
publishedDatestringJournal publication date
preprintDatestringDate originally submitted as preprint
preprintDoistringOriginal preprint DOI
scrapedAtstringTimestamp when the record was scraped

Sample Output

Preprint Record

{
"doi": "10.1101/2024.01.15.575123",
"title": "A Study of Neural Circuits in the Hippocampus",
"authors": "Smith J, Jones A, Brown C",
"authorList": ["Smith J", "Jones A", "Brown C"],
"correspondingAuthor": "Smith J",
"institution": "Harvard University",
"submittedDate": "2024-01-15",
"version": 1,
"type": "new results",
"license": "cc_by",
"category": "neuroscience",
"server": "biorxiv",
"abstractText": "This paper studies hippocampal circuits...",
"jatsXmlUrl": "https://www.biorxiv.org/content/10.1101/2024.01.15.575123v1.source.xml",
"previewUrl": "https://www.biorxiv.org/content/10.1101/2024.01.15.575123",
"isPublished": false,
"scrapedAt": "2026-05-23T10:00:00+00:00"
}

FAQ

Does this require an API key or account? No. The bioRxiv/medRxiv API is completely public and free. No registration required.

What is the maximum date range I can query? The bioRxiv API returns up to 100 preprints per call with a 90-day window. This scraper automatically splits larger date ranges into 90-day chunks and paginates through all of them.

How do I fetch a specific preprint? Use mode=byDoi and enter the DOI (e.g. 10.1101/2024.01.01.612345) in the dois field.

Can I check if preprints have been published? Yes — use mode=published with a list of DOIs to retrieve journal publication information including the journal name and published DOI.

What categories are available? bioRxiv covers biological sciences; medRxiv covers health sciences and clinical research. See the category dropdown in the input form for the full list.

Can I query both bioRxiv and medRxiv at once? Yes — set server=both and the scraper will query both servers and combine results.

Why are some preprints missing fields like institution or abstractText? These fields are only included when the data is available in the API response. Records with missing data will simply omit those fields rather than including null values.

How many records can I retrieve per run? Up to 10,000 records per run. For larger datasets, use narrower date ranges or run multiple times with offset date ranges.