Pricing

Pay per event

medRxiv Scraper

Extract comprehensive preprint data from medRxiv, including titles, authors, abstracts, full text, DOIs, citations, and metadata. Automate access to health-science preprints with structured outputs, ideal for researchers and analysts who need reliable, large-scale article data without manual work.

Pricing

Pay per event

Rating

0.0

(0)

Developer

ParseForge

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

🧬 medRxiv Preprint Scraper

🚀 Scrape medRxiv health-science preprints in seconds. Filter by topic, subject collection, date range, or author. No API key, no registration, no manual CSV wrangling.

🕒 Last updated: 2026-05-18 · 📊 18 fields per record · 210,000+ live preprints · 50 subject collections · 3 sort orders

medRxiv is the leading open preprint server for the health sciences, hosting clinical research, epidemiology, public health, and biomedical work months before formal peer review. This Actor turns any medRxiv search (keyword, subject collection, posted-date window, or author) into a structured dataset of full preprint records, complete with title, all authors, posting date, DOI, abstract, full text, PDF link, license, funding statement, competing-interest declarations, data-availability statement, and any data or code repository link the authors disclose. The output drops straight into Google Sheets, BigQuery, Postgres, Notion, or any other tool your team already uses.

Preprint data is hard to harvest at scale. medRxiv exposes a faceted search interface but no public bulk API for researchers; the site sits behind Cloudflare and a Varnish edge that rate-limits direct datacenter traffic. This Actor closes the gap: pick a search query, narrow it with subject collection, date range, or author filters, set how many records you want, and the data lands in your dataset within minutes. Systematic reviewers, clinical research teams, public-health analysts, bibliometric researchers, science journalists, and AI training-set curators all use this kind of feed for living reviews, evidence surveillance, citation tracking, and topic modelling on emerging health research.

👥 Target audience	🎯 Primary use case
Systematic reviewers and meta-analysts	Run living searches across health-science preprints
Clinical research and trial teams	Track new evidence in a therapeutic area in near real time
Public-health analysts and epidemiologists	Surveil outbreak, vaccine, and intervention literature as it lands
Bibliometric and science-policy researchers	Build datasets on topic emergence, author networks, and funding patterns
Science journalists and editors	Spot newsworthy preprints in a chosen field within hours of posting
AI/ML and NLP teams	Curate domain-specific corpora of biomedical text for training and evaluation

📋 What the medRxiv Preprint Scraper does

🔍 Any keyword search. Drive results from a free-text query just like the medRxiv search box.
🗂️ 50 subject collections. Restrict to a specific medical specialty (Epidemiology, Cardiovascular Medicine, Infectious Diseases, HIV/AIDS, Public and Global Health, and 45 more).
📅 Date range filter. Slice by posting date with dateFrom and dateTo (YYYY-MM-DD).
👤 Author filter. Narrow to preprints where a given author name appears.
📰 Full record per preprint. Title, full author list, posting date, DOI, abstract, full text, PDF URL, license, funding, competing interests, data-availability statement, and disclosed data or code URLs.
🔁 Sort order. Choose newest first, oldest first, or relevance-ranked.

Each record returns the article URL, title, author list (with affiliation metadata where exposed), posting date, DOI, abstract, full text body, subject area assignment from medRxiv, PDF link, citation string, license text, funding statement, competing-interest statement, author declarations block, data-availability statement, disclosed data or code repository URL, supplementary materials, and the scrape timestamp.

💡 Why it matters: preprints often surface findings weeks or months before journal publication. In fast-moving health topics (pandemics, vaccines, drug repurposing) waiting for the indexed-in-MEDLINE version means working from stale evidence. This Actor lets analysts track the live preprint frontier without manual click-through.

🎬 Full Demo

🚧 Coming soon: a 3-minute walkthrough showing how to filter, run, and export medRxiv preprints into Google Sheets and the BI tool of your choice.

⚙️ Input

Field	Type	Required	Description
startUrl	string	no	Direct medRxiv.org search URL. Apply filters on medrxiv.org and paste the URL. Mutually exclusive with the filter fields below.
maxItems	integer	no	Maximum number of preprints to return. Free plan: capped at 10. Paid plan: up to 1,000,000.
searchQuery	string	no	Free-text search query (the same string you would type in the medRxiv search box).
subjectCollection	enum	no	One of 50 medRxiv subject collections (slugs like epidemiology, cardiovascular-medicine, infectious-diseases, hiv-aids).
dateFrom	string	no	Lower bound for posting date, YYYY-MM-DD.
dateTo	string	no	Upper bound for posting date, YYYY-MM-DD.
author	string	no	Restrict to preprints with this author name in the author list.
orderBy	enum	no	relevance (best match), newest (newest first), or oldest (oldest first).

Example: pull the latest 50 infectious-disease preprints from January 2026 onward, sorted newest first.

{
    "searchQuery": "antimicrobial resistance",
    "subjectCollection": "infectious-diseases",
    "dateFrom": "2026-01-01",
    "orderBy": "newest",
    "maxItems": 50
}

Example: paste a pre-filtered medRxiv search URL and grab the first 200 results.

{
    "startUrl": "https://www.medrxiv.org/search/bacterial%20infection",
    "maxItems": 200
}

⚠️ Good to Know: medRxiv sits behind Cloudflare and a Varnish edge that rate-limits datacenter IPs. The Actor routes every request through Apify Residential US proxies with per-attempt session rotation, so transient 503s retry transparently. Free-plan runs are capped at 10 preprints; upgrade to a paid plan for full batches.

📊 Output

Each record is one medRxiv preprint, normalised across subject collections.

🧾 Schema

Field	Type	Example
🔗 `url`	string	`https://www.medrxiv.org/content/10.64898/2026.03.16.26348538v1`
📝 `title`	string	`Genomic epidemiology of ESBL-producing Escherichia coli and Klebsiella pneumoniae...`
👥 `authors`	string[]	`["Germanie Delaisie Abomo", "Gabriel Cedric Bessala", "..."]`
🧑‍🔬 `authorDetails`	object[]	`[{ "name": "...", "affiliation": "..." }, ...]`
📅 `publicationDate`	string	`Posted March 18, 2026.`
🔢 `doi`	string	`https://doi.org/10.64898/2026.03.16.26348538`
📄 `abstract`	string	`Background Livestock production systems in peri-urban areas are associated with...`
📰 `fullText`	string	Full article body text
🗂️ `subjectAreas`	string[]	`["Epidemiology"]`
📑 `pdfUrl`	string	`https://www.medrxiv.org/content/10.64898/2026.03.16.26348538v1.full.pdf`
📚 `citationInformation`	string	Auto-generated citation string
📜 `licenseInformation`	string	`It is made available under a CC-BY 4.0 International license.`
💰 `fundingStatement`	string	Funding sources disclosed by authors
⚖️ `competingInterestStatement`	string	Competing-interest declaration
🪪 `authorDeclarations`	string	Author declarations block (IRB, consent, etc.)
🗄️ `dataAvailability`	string	Data-availability statement
🧪 `dataCodeUrl`	string	`https://data.wastewaterscan.org/about`
📎 `supplementaryMaterials`	string[]	Links to supplementary files
🕒 `scrapedTimestamp`	ISO date	`2026-05-18T01:30:00.000Z`

📦 Sample records

✨ Why choose this Actor

🪄	Capability
🧬	Full health-science coverage. All 50 medRxiv subject collections, from Addiction Medicine to Urology, in a single Actor.
🔁	Filter or paste. Drive the search from filter fields, or paste a pre-filtered medrxiv.org URL.
🛡️	Cloudflare and Varnish handled. Apify Residential US proxies with per-attempt session rotation absorb the rate limits that block direct fetches.
📦	18 normalised fields. Same shape across every collection, so an epidemiology preprint and a cardiology preprint drop into the same table.
📅	Date and author filters. Narrow by posting date range or author name without touching the UI.
💸	Pay-per-event or flat. Compatible with both Apify pricing models.
🪵	Resilient. Per-request session rotation plus short exponential backoff retries shrug off the transient 503s that medRxiv's edge throws under load.

📊 medRxiv has hosted 210,000+ preprints across 50 medical specialties since 2019, with hundreds of new posts every week.

📈 How it compares to alternatives

Approach	Cost	Coverage	Refresh	Filters	Setup
⭐ medRxiv Preprint Scraper (this Actor)	Pay only for runs	All 50 subject collections, 210,000+ preprints	On-demand	Keyword, collection, date range, author, sort	None
Official PubMed mirrors	Free	Indexed-after-peer-review only	Days to weeks behind preprints	MeSH terms, not preprint fields	API key, query syntax
Paid scholarly aggregators	Subscription	Mixed; preprints often partial	Daily	Vendor-specific	Sales contract
Manual browser saves	Time	Limited	Manual	Manual	Slow
Generic web-scraping tools	DIY	Brittle, breaks on edge updates	DIY	DIY	High

Most teams already have a citation database. This Actor is the live-preprint layer that sits in front of it.

🚀 How to use

🔐 Sign up. Create a free Apify account (no credit card needed for the free tier).
🔎 Open the Actor. Search the Apify Store for "medRxiv" or open this page.
🧩 Fill in input. Either paste a medRxiv search URL into startUrl, or set searchQuery, subjectCollection, dateFrom / dateTo, and author as needed.
▶️ Click Start. The Actor uses Apify Residential US proxies to load and parse search and detail pages.
📥 Export. Download the dataset as JSON, CSV, or Excel, or push directly to Google Sheets, BigQuery, Webhook, or S3.

⏱️ Total time: about 2 minutes from sign-up to first export.

💼 Business use cases

🏥 Clinical research teams

Run weekly searches in a therapeutic area to surface new preprints
Build evidence dashboards for internal review committees
Track preprints by competing labs or trial sponsors
Feed a regulatory-affairs intelligence pipeline

🦠 Public-health and epidemiology

Surveil outbreak literature (respiratory, vector-borne, AMR) in near real time
Track vaccine effectiveness and uptake research as it posts
Map study-design trends across regions and pathogens
Brief ministries and public-health agencies on emerging evidence

📊 Bibliometric and policy analysts

Build datasets on topic emergence and growth curves
Map author networks and institutional output across subject collections
Compare preprint-to-publication conversion rates over time
Quantify funding-statement patterns by topic and country

📰 Science journalism and comms

Spot newsworthy preprints in a beat within hours of posting
Pull DOIs and abstracts directly into an editorial CMS
Track corrections and version updates across an evidence story
Build "what's new on medRxiv this week" digests

🌟 Beyond business use cases

Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.

🎓 Research and academia

Empirical datasets for papers, thesis work, and coursework
Longitudinal studies tracking changes across snapshots
Reproducible research with cited, versioned data pulls
Classroom exercises on data analysis and ethical scraping

🎨 Personal and creative

Side projects, portfolio demos, and indie app launches
Data visualizations, dashboards, and infographics
Content research for bloggers, YouTubers, and podcasters
Hobbyist collections and personal trackers

🤝 Non-profit and civic

Transparency reporting and accountability projects
Advocacy campaigns backed by public-interest data
Community-run databases for local issues
Investigative journalism on public records

🧪 Experimentation

Prototype AI and machine-learning pipelines with real data
Validate product-market hypotheses before engineering spend
Train small domain-specific models on niche corpora
Test dashboard concepts with live input

🔌 Automating medRxiv Preprint Scraper

Trigger this Actor from any code path or scheduler.

🟨 Node.js / JavaScript: use the apify-client npm package.
🐍 Python: use the apify-client PyPI package.
📘 HTTP API: see the Apify API docs for direct REST calls.

Schedule it on Apify with Schedules to refresh evidence pulls daily, weekly, or hourly. Pipe the dataset to Google Sheets, BigQuery, Snowflake, or your own webhook for downstream analytics.

❓ Frequently Asked Questions

🔌 Integrate with any app

Pipe medRxiv data into the tools you already use.

Google Sheets - drop preprint records into a live sheet for analysts.
BigQuery - warehouse evidence pulls for SQL analytics.
Webhooks - notify Slack, Zapier, or your own service when a run completes.
Airbyte - sync to Snowflake, Postgres, or any Airbyte destination.
Zapier - trigger downstream automations on every new run.
Make - build no-code workflows around scraped preprints.

🔗 Recommended Actors

🧪 bioRxiv & medRxiv Preprint Scraper - sister Actor covering bioRxiv and the broader preprint network.
📚 PubMed Citation Scraper - peer-reviewed biomedical citations from PubMed.
🌐 OpenAlex Scholarly Works Scraper - cross-disciplinary scholarly metadata at scale.
📐 arXiv Preprint Scraper - preprint coverage for physics, math, CS, and quantitative biology.
🧠 Semantic Scholar Scraper - citation graph and paper metadata across academic disciplines.

💡 Pro Tip: browse the complete ParseForge collection for more research, scholarly, and health-data scrapers.

🆘 Need Help? Open our contact form and we will reply within one business day.

⚠️ Disclaimer: This Actor extracts publicly accessible preprint data from medRxiv.org for legitimate research, evidence-surveillance, and analytics purposes. It does not collect personal user data beyond what authors and institutions publicly publish on their preprints. Use of this Actor is at your own risk and subject to medRxiv.org's terms of service and the per-preprint Creative Commons or CC0 license. ParseForge is not affiliated with, endorsed by, or sponsored by medRxiv, Cold Spring Harbor Laboratory, BMJ, or Yale University.

Medrxiv Scraper

outstanding_vegetable/medrxiv-scraper

Scrape medRxiv medical preprints by date range. Get title, authors, abstract, DOI, category, license. Public API, free.

Peter Skotte

bioRxiv & medRxiv Preprint Scraper

crawlergang/biorxiv-medrxiv-scraper

Scrape preprints from bioRxiv and medRxiv with the leading open-access preprint servers for biology and medicine. Search by date range, fetch by DOI, or retrieve published journal version information.

Crawler Gang

5.0

bioRxiv & medRxiv Preprint Scraper

crawlerbros/biorxiv-medrxiv-scraper

Crawler Bros

bioRxiv and medRxiv Preprints Scraper

parseforge/biorxiv-recent-scraper

Track the latest preprints from bioRxiv or medRxiv inside any date window. Returns DOI, title, authors, posting date, category, abstract, version, server, JATS XML link, and license. Useful for literature surveillance, competitive science intelligence, and rapid biomedical research review.

ParseForge

bioRxiv + medRxiv Scraper for RAG

getascraper/biorxiv-medrxiv-rag-extractor

Scrape bioRxiv and medRxiv preprints by server, category, and date range. Returns RAG-ready JSON with JATS full-text chunks (cl100k_base, 512/50) when available and abstract fallback otherwise. Drop-in for LangChain, LlamaIndex, Qdrant, Pinecone, Weaviate, pgvector. $0.02 per preprint.

GetAScraper

bioRxiv Preprints Scraper

parseforge/biorxiv-preprints-scraper

Pull bioRxiv and medRxiv preprints by date range, DOI, or category. Records carry DOI, title, authors, publish date, version, type, license, category, abstract, server, and full text PDF link. Useful for literature reviews, science monitoring, and grant research.

ParseForge

Semantic Scholar Scraper

parseforge/semantic-scholar-scraper

Extract detailed academic paper data from Semantic Scholar, including abstracts, citations, authors, and publication details. Ideal for researchers, academics, and analysts who need structured scholarly data for literature reviews, research workflows, and large-scale academic analysis.

ParseForge

5.0

Unified Preprint Search

logical_vivacity/unified-preprint-search

One Apify Actor, five sources: PubMed, arXiv, bioRxiv, medRxiv, chemRxiv.

Logical Vivacity

arXiv Preprint Scraper

parseforge/arxiv-scraper

Export preprints from arXiv.org. Search 2.5M+ open-access papers across physics, mathematics, computer science, biology, economics, and quantitative finance. Query by keyword, author, category, or date range. Pull titles, authors, abstracts, categories, DOIs, journal refs, and PDF links.

ParseForge

5.0

HAL Open Science Scraper

parseforge/hal-open-science-scraper

Export research papers, theses, and preprints from HAL, France's national open science archive. 3M+ full-text records across every scientific discipline. Filter by domain, author, lab, journal, or year. Pull titles, abstracts, authors, DOIs, PDFs, citations.

ParseForge

medRxiv Scraper

🧬 medRxiv Preprint Scraper

📋 What the medRxiv Preprint Scraper does

🎬 Full Demo

⚙️ Input

📊 Output

🧾 Schema

📦 Sample records

✨ Why choose this Actor

📈 How it compares to alternatives

🚀 How to use

💼 Business use cases

🏥 Clinical research teams

🦠 Public-health and epidemiology

📊 Bibliometric and policy analysts

📰 Science journalism and comms

🌟 Beyond business use cases

🎓 Research and academia

🎨 Personal and creative

🤝 Non-profit and civic

🧪 Experimentation

🔌 Automating medRxiv Preprint Scraper

❓ Frequently Asked Questions

🔌 Integrate with any app

🔗 Recommended Actors

You might also like

Medrxiv Scraper

bioRxiv & medRxiv Preprint Scraper

bioRxiv & medRxiv Preprint Scraper

bioRxiv and medRxiv Preprints Scraper

bioRxiv + medRxiv Scraper for RAG

bioRxiv Preprints Scraper

Semantic Scholar Scraper

Unified Preprint Search

arXiv Preprint Scraper

HAL Open Science Scraper