medRxiv Scraper avatar

medRxiv Scraper

Pricing

Pay per event

Go to Apify Store
medRxiv Scraper

medRxiv Scraper

Extract comprehensive preprint data from medRxiv, including titles, authors, abstracts, full text, DOIs, citations, and metadata. Automate access to health-science preprints with structured outputs, ideal for researchers and analysts who need reliable, large-scale article data without manual work.

Pricing

Pay per event

Rating

0.0

(0)

Developer

ParseForge

ParseForge

Maintained by Community

Actor stats

0

Bookmarked

7

Total users

1

Monthly active users

a day ago

Last modified

Share

ParseForge Banner

🧬 medRxiv Preprint Scraper

🚀 Scrape medRxiv health-science preprints in seconds. Filter by topic, subject collection, date range, or author. No API key, no registration, no manual CSV wrangling.

🕒 Last updated: 2026-05-18 · 📊 18 fields per record · 210,000+ live preprints · 50 subject collections · 3 sort orders

medRxiv is the leading open preprint server for the health sciences, hosting clinical research, epidemiology, public health, and biomedical work months before formal peer review. This Actor turns any medRxiv search (keyword, subject collection, posted-date window, or author) into a structured dataset of full preprint records, complete with title, all authors, posting date, DOI, abstract, full text, PDF link, license, funding statement, competing-interest declarations, data-availability statement, and any data or code repository link the authors disclose. The output drops straight into Google Sheets, BigQuery, Postgres, Notion, or any other tool your team already uses.

Preprint data is hard to harvest at scale. medRxiv exposes a faceted search interface but no public bulk API for researchers; the site sits behind Cloudflare and a Varnish edge that rate-limits direct datacenter traffic. This Actor closes the gap: pick a search query, narrow it with subject collection, date range, or author filters, set how many records you want, and the data lands in your dataset within minutes. Systematic reviewers, clinical research teams, public-health analysts, bibliometric researchers, science journalists, and AI training-set curators all use this kind of feed for living reviews, evidence surveillance, citation tracking, and topic modelling on emerging health research.

👥 Target audience🎯 Primary use case
Systematic reviewers and meta-analystsRun living searches across health-science preprints
Clinical research and trial teamsTrack new evidence in a therapeutic area in near real time
Public-health analysts and epidemiologistsSurveil outbreak, vaccine, and intervention literature as it lands
Bibliometric and science-policy researchersBuild datasets on topic emergence, author networks, and funding patterns
Science journalists and editorsSpot newsworthy preprints in a chosen field within hours of posting
AI/ML and NLP teamsCurate domain-specific corpora of biomedical text for training and evaluation

📋 What the medRxiv Preprint Scraper does

  • 🔍 Any keyword search. Drive results from a free-text query just like the medRxiv search box.
  • 🗂️ 50 subject collections. Restrict to a specific medical specialty (Epidemiology, Cardiovascular Medicine, Infectious Diseases, HIV/AIDS, Public and Global Health, and 45 more).
  • 📅 Date range filter. Slice by posting date with dateFrom and dateTo (YYYY-MM-DD).
  • 👤 Author filter. Narrow to preprints where a given author name appears.
  • 📰 Full record per preprint. Title, full author list, posting date, DOI, abstract, full text, PDF URL, license, funding, competing interests, data-availability statement, and disclosed data or code URLs.
  • 🔁 Sort order. Choose newest first, oldest first, or relevance-ranked.

Each record returns the article URL, title, author list (with affiliation metadata where exposed), posting date, DOI, abstract, full text body, subject area assignment from medRxiv, PDF link, citation string, license text, funding statement, competing-interest statement, author declarations block, data-availability statement, disclosed data or code repository URL, supplementary materials, and the scrape timestamp.

💡 Why it matters: preprints often surface findings weeks or months before journal publication. In fast-moving health topics (pandemics, vaccines, drug repurposing) waiting for the indexed-in-MEDLINE version means working from stale evidence. This Actor lets analysts track the live preprint frontier without manual click-through.


🎬 Full Demo

🚧 Coming soon: a 3-minute walkthrough showing how to filter, run, and export medRxiv preprints into Google Sheets and the BI tool of your choice.


⚙️ Input

FieldTypeRequiredDescription
startUrlstringnoDirect medRxiv.org search URL. Apply filters on medrxiv.org and paste the URL. Mutually exclusive with the filter fields below.
maxItemsintegernoMaximum number of preprints to return. Free plan: capped at 10. Paid plan: up to 1,000,000.
searchQuerystringnoFree-text search query (the same string you would type in the medRxiv search box).
subjectCollectionenumnoOne of 50 medRxiv subject collections (slugs like epidemiology, cardiovascular-medicine, infectious-diseases, hiv-aids).
dateFromstringnoLower bound for posting date, YYYY-MM-DD.
dateTostringnoUpper bound for posting date, YYYY-MM-DD.
authorstringnoRestrict to preprints with this author name in the author list.
orderByenumnorelevance (best match), newest (newest first), or oldest (oldest first).

Example: pull the latest 50 infectious-disease preprints from January 2026 onward, sorted newest first.

{
"searchQuery": "antimicrobial resistance",
"subjectCollection": "infectious-diseases",
"dateFrom": "2026-01-01",
"orderBy": "newest",
"maxItems": 50
}

Example: paste a pre-filtered medRxiv search URL and grab the first 200 results.

{
"startUrl": "https://www.medrxiv.org/search/bacterial%20infection",
"maxItems": 200
}

⚠️ Good to Know: medRxiv sits behind Cloudflare and a Varnish edge that rate-limits datacenter IPs. The Actor routes every request through Apify Residential US proxies with per-attempt session rotation, so transient 503s retry transparently. Free-plan runs are capped at 10 preprints; upgrade to a paid plan for full batches.


📊 Output

Each record is one medRxiv preprint, normalised across subject collections.

🧾 Schema

FieldTypeExample
🔗 urlstringhttps://www.medrxiv.org/content/10.64898/2026.03.16.26348538v1
📝 titlestringGenomic epidemiology of ESBL-producing Escherichia coli and Klebsiella pneumoniae...
👥 authorsstring[]["Germanie Delaisie Abomo", "Gabriel Cedric Bessala", "..."]
🧑‍🔬 authorDetailsobject[][{ "name": "...", "affiliation": "..." }, ...]
📅 publicationDatestringPosted March 18, 2026.
🔢 doistringhttps://doi.org/10.64898/2026.03.16.26348538
📄 abstractstringBackground Livestock production systems in peri-urban areas are associated with...
📰 fullTextstringFull article body text
🗂️ subjectAreasstring[]["Epidemiology"]
📑 pdfUrlstringhttps://www.medrxiv.org/content/10.64898/2026.03.16.26348538v1.full.pdf
📚 citationInformationstringAuto-generated citation string
📜 licenseInformationstringIt is made available under a CC-BY 4.0 International license.
💰 fundingStatementstringFunding sources disclosed by authors
⚖️ competingInterestStatementstringCompeting-interest declaration
🪪 authorDeclarationsstringAuthor declarations block (IRB, consent, etc.)
🗄️ dataAvailabilitystringData-availability statement
🧪 dataCodeUrlstringhttps://data.wastewaterscan.org/about
📎 supplementaryMaterialsstring[]Links to supplementary files
🕒 scrapedTimestampISO date2026-05-18T01:30:00.000Z

📦 Sample records


✨ Why choose this Actor

🪄Capability
🧬Full health-science coverage. All 50 medRxiv subject collections, from Addiction Medicine to Urology, in a single Actor.
🔁Filter or paste. Drive the search from filter fields, or paste a pre-filtered medrxiv.org URL.
🛡️Cloudflare and Varnish handled. Apify Residential US proxies with per-attempt session rotation absorb the rate limits that block direct fetches.
📦18 normalised fields. Same shape across every collection, so an epidemiology preprint and a cardiology preprint drop into the same table.
📅Date and author filters. Narrow by posting date range or author name without touching the UI.
💸Pay-per-event or flat. Compatible with both Apify pricing models.
🪵Resilient. Per-request session rotation plus short exponential backoff retries shrug off the transient 503s that medRxiv's edge throws under load.

📊 medRxiv has hosted 210,000+ preprints across 50 medical specialties since 2019, with hundreds of new posts every week.


📈 How it compares to alternatives

ApproachCostCoverageRefreshFiltersSetup
⭐ medRxiv Preprint Scraper (this Actor)Pay only for runsAll 50 subject collections, 210,000+ preprintsOn-demandKeyword, collection, date range, author, sortNone
Official PubMed mirrorsFreeIndexed-after-peer-review onlyDays to weeks behind preprintsMeSH terms, not preprint fieldsAPI key, query syntax
Paid scholarly aggregatorsSubscriptionMixed; preprints often partialDailyVendor-specificSales contract
Manual browser savesTimeLimitedManualManualSlow
Generic web-scraping toolsDIYBrittle, breaks on edge updatesDIYDIYHigh

Most teams already have a citation database. This Actor is the live-preprint layer that sits in front of it.


🚀 How to use

  1. 🔐 Sign up. Create a free Apify account (no credit card needed for the free tier).
  2. 🔎 Open the Actor. Search the Apify Store for "medRxiv" or open this page.
  3. 🧩 Fill in input. Either paste a medRxiv search URL into startUrl, or set searchQuery, subjectCollection, dateFrom / dateTo, and author as needed.
  4. ▶️ Click Start. The Actor uses Apify Residential US proxies to load and parse search and detail pages.
  5. 📥 Export. Download the dataset as JSON, CSV, or Excel, or push directly to Google Sheets, BigQuery, Webhook, or S3.

⏱️ Total time: about 2 minutes from sign-up to first export.


💼 Business use cases

🏥 Clinical research teams

  • Run weekly searches in a therapeutic area to surface new preprints
  • Build evidence dashboards for internal review committees
  • Track preprints by competing labs or trial sponsors
  • Feed a regulatory-affairs intelligence pipeline

🦠 Public-health and epidemiology

  • Surveil outbreak literature (respiratory, vector-borne, AMR) in near real time
  • Track vaccine effectiveness and uptake research as it posts
  • Map study-design trends across regions and pathogens
  • Brief ministries and public-health agencies on emerging evidence

📊 Bibliometric and policy analysts

  • Build datasets on topic emergence and growth curves
  • Map author networks and institutional output across subject collections
  • Compare preprint-to-publication conversion rates over time
  • Quantify funding-statement patterns by topic and country

📰 Science journalism and comms

  • Spot newsworthy preprints in a beat within hours of posting
  • Pull DOIs and abstracts directly into an editorial CMS
  • Track corrections and version updates across an evidence story
  • Build "what's new on medRxiv this week" digests

🌟 Beyond business use cases

Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.

🎓 Research and academia

  • Empirical datasets for papers, thesis work, and coursework
  • Longitudinal studies tracking changes across snapshots
  • Reproducible research with cited, versioned data pulls
  • Classroom exercises on data analysis and ethical scraping

🎨 Personal and creative

  • Side projects, portfolio demos, and indie app launches
  • Data visualizations, dashboards, and infographics
  • Content research for bloggers, YouTubers, and podcasters
  • Hobbyist collections and personal trackers

🤝 Non-profit and civic

  • Transparency reporting and accountability projects
  • Advocacy campaigns backed by public-interest data
  • Community-run databases for local issues
  • Investigative journalism on public records

🧪 Experimentation

  • Prototype AI and machine-learning pipelines with real data
  • Validate product-market hypotheses before engineering spend
  • Train small domain-specific models on niche corpora
  • Test dashboard concepts with live input

🔌 Automating medRxiv Preprint Scraper

Trigger this Actor from any code path or scheduler.

Schedule it on Apify with Schedules to refresh evidence pulls daily, weekly, or hourly. Pipe the dataset to Google Sheets, BigQuery, Snowflake, or your own webhook for downstream analytics.


❓ Frequently Asked Questions


🔌 Integrate with any app

Pipe medRxiv data into the tools you already use.

  • Google Sheets - drop preprint records into a live sheet for analysts.
  • BigQuery - warehouse evidence pulls for SQL analytics.
  • Webhooks - notify Slack, Zapier, or your own service when a run completes.
  • Airbyte - sync to Snowflake, Postgres, or any Airbyte destination.
  • Zapier - trigger downstream automations on every new run.
  • Make - build no-code workflows around scraped preprints.

💡 Pro Tip: browse the complete ParseForge collection for more research, scholarly, and health-data scrapers.


🆘 Need Help? Open our contact form and we will reply within one business day.


⚠️ Disclaimer: This Actor extracts publicly accessible preprint data from medRxiv.org for legitimate research, evidence-surveillance, and analytics purposes. It does not collect personal user data beyond what authors and institutions publicly publish on their preprints. Use of this Actor is at your own risk and subject to medRxiv.org's terms of service and the per-preprint Creative Commons or CC0 license. ParseForge is not affiliated with, endorsed by, or sponsored by medRxiv, Cold Spring Harbor Laboratory, BMJ, or Yale University.