medRxiv Scraper
Pricing
Pay per event
medRxiv Scraper
Extract comprehensive preprint data from medRxiv, including titles, authors, abstracts, full text, DOIs, citations, and metadata. Automate access to health-science preprints with structured outputs, ideal for researchers and analysts who need reliable, large-scale article data without manual work.
Pricing
Pay per event
Rating
0.0
(0)
Developer
ParseForge
Maintained by CommunityActor stats
0
Bookmarked
7
Total users
1
Monthly active users
a day ago
Last modified
Categories
Share

🧬 medRxiv Preprint Scraper
🚀 Scrape medRxiv health-science preprints in seconds. Filter by topic, subject collection, date range, or author. No API key, no registration, no manual CSV wrangling.
🕒 Last updated: 2026-05-18 · 📊 18 fields per record · 210,000+ live preprints · 50 subject collections · 3 sort orders
medRxiv is the leading open preprint server for the health sciences, hosting clinical research, epidemiology, public health, and biomedical work months before formal peer review. This Actor turns any medRxiv search (keyword, subject collection, posted-date window, or author) into a structured dataset of full preprint records, complete with title, all authors, posting date, DOI, abstract, full text, PDF link, license, funding statement, competing-interest declarations, data-availability statement, and any data or code repository link the authors disclose. The output drops straight into Google Sheets, BigQuery, Postgres, Notion, or any other tool your team already uses.
Preprint data is hard to harvest at scale. medRxiv exposes a faceted search interface but no public bulk API for researchers; the site sits behind Cloudflare and a Varnish edge that rate-limits direct datacenter traffic. This Actor closes the gap: pick a search query, narrow it with subject collection, date range, or author filters, set how many records you want, and the data lands in your dataset within minutes. Systematic reviewers, clinical research teams, public-health analysts, bibliometric researchers, science journalists, and AI training-set curators all use this kind of feed for living reviews, evidence surveillance, citation tracking, and topic modelling on emerging health research.
| 👥 Target audience | 🎯 Primary use case |
|---|---|
| Systematic reviewers and meta-analysts | Run living searches across health-science preprints |
| Clinical research and trial teams | Track new evidence in a therapeutic area in near real time |
| Public-health analysts and epidemiologists | Surveil outbreak, vaccine, and intervention literature as it lands |
| Bibliometric and science-policy researchers | Build datasets on topic emergence, author networks, and funding patterns |
| Science journalists and editors | Spot newsworthy preprints in a chosen field within hours of posting |
| AI/ML and NLP teams | Curate domain-specific corpora of biomedical text for training and evaluation |
📋 What the medRxiv Preprint Scraper does
- 🔍 Any keyword search. Drive results from a free-text query just like the medRxiv search box.
- 🗂️ 50 subject collections. Restrict to a specific medical specialty (Epidemiology, Cardiovascular Medicine, Infectious Diseases, HIV/AIDS, Public and Global Health, and 45 more).
- 📅 Date range filter. Slice by posting date with
dateFromanddateTo(YYYY-MM-DD). - 👤 Author filter. Narrow to preprints where a given author name appears.
- 📰 Full record per preprint. Title, full author list, posting date, DOI, abstract, full text, PDF URL, license, funding, competing interests, data-availability statement, and disclosed data or code URLs.
- 🔁 Sort order. Choose newest first, oldest first, or relevance-ranked.
Each record returns the article URL, title, author list (with affiliation metadata where exposed), posting date, DOI, abstract, full text body, subject area assignment from medRxiv, PDF link, citation string, license text, funding statement, competing-interest statement, author declarations block, data-availability statement, disclosed data or code repository URL, supplementary materials, and the scrape timestamp.
💡 Why it matters: preprints often surface findings weeks or months before journal publication. In fast-moving health topics (pandemics, vaccines, drug repurposing) waiting for the indexed-in-MEDLINE version means working from stale evidence. This Actor lets analysts track the live preprint frontier without manual click-through.
🎬 Full Demo
🚧 Coming soon: a 3-minute walkthrough showing how to filter, run, and export medRxiv preprints into Google Sheets and the BI tool of your choice.
⚙️ Input
| Field | Type | Required | Description |
|---|---|---|---|
startUrl | string | no | Direct medRxiv.org search URL. Apply filters on medrxiv.org and paste the URL. Mutually exclusive with the filter fields below. |
maxItems | integer | no | Maximum number of preprints to return. Free plan: capped at 10. Paid plan: up to 1,000,000. |
searchQuery | string | no | Free-text search query (the same string you would type in the medRxiv search box). |
subjectCollection | enum | no | One of 50 medRxiv subject collections (slugs like epidemiology, cardiovascular-medicine, infectious-diseases, hiv-aids). |
dateFrom | string | no | Lower bound for posting date, YYYY-MM-DD. |
dateTo | string | no | Upper bound for posting date, YYYY-MM-DD. |
author | string | no | Restrict to preprints with this author name in the author list. |
orderBy | enum | no | relevance (best match), newest (newest first), or oldest (oldest first). |
Example: pull the latest 50 infectious-disease preprints from January 2026 onward, sorted newest first.
{"searchQuery": "antimicrobial resistance","subjectCollection": "infectious-diseases","dateFrom": "2026-01-01","orderBy": "newest","maxItems": 50}
Example: paste a pre-filtered medRxiv search URL and grab the first 200 results.
{"startUrl": "https://www.medrxiv.org/search/bacterial%20infection","maxItems": 200}
⚠️ Good to Know: medRxiv sits behind Cloudflare and a Varnish edge that rate-limits datacenter IPs. The Actor routes every request through Apify Residential US proxies with per-attempt session rotation, so transient 503s retry transparently. Free-plan runs are capped at 10 preprints; upgrade to a paid plan for full batches.
📊 Output
Each record is one medRxiv preprint, normalised across subject collections.
🧾 Schema
| Field | Type | Example |
|---|---|---|
🔗 url | string | https://www.medrxiv.org/content/10.64898/2026.03.16.26348538v1 |
📝 title | string | Genomic epidemiology of ESBL-producing Escherichia coli and Klebsiella pneumoniae... |
👥 authors | string[] | ["Germanie Delaisie Abomo", "Gabriel Cedric Bessala", "..."] |
🧑🔬 authorDetails | object[] | [{ "name": "...", "affiliation": "..." }, ...] |
📅 publicationDate | string | Posted March 18, 2026. |
🔢 doi | string | https://doi.org/10.64898/2026.03.16.26348538 |
📄 abstract | string | Background Livestock production systems in peri-urban areas are associated with... |
📰 fullText | string | Full article body text |
🗂️ subjectAreas | string[] | ["Epidemiology"] |
📑 pdfUrl | string | https://www.medrxiv.org/content/10.64898/2026.03.16.26348538v1.full.pdf |
📚 citationInformation | string | Auto-generated citation string |
📜 licenseInformation | string | It is made available under a CC-BY 4.0 International license. |
💰 fundingStatement | string | Funding sources disclosed by authors |
⚖️ competingInterestStatement | string | Competing-interest declaration |
🪪 authorDeclarations | string | Author declarations block (IRB, consent, etc.) |
🗄️ dataAvailability | string | Data-availability statement |
🧪 dataCodeUrl | string | https://data.wastewaterscan.org/about |
📎 supplementaryMaterials | string[] | Links to supplementary files |
🕒 scrapedTimestamp | ISO date | 2026-05-18T01:30:00.000Z |
📦 Sample records
✨ Why choose this Actor
| 🪄 | Capability |
|---|---|
| 🧬 | Full health-science coverage. All 50 medRxiv subject collections, from Addiction Medicine to Urology, in a single Actor. |
| 🔁 | Filter or paste. Drive the search from filter fields, or paste a pre-filtered medrxiv.org URL. |
| 🛡️ | Cloudflare and Varnish handled. Apify Residential US proxies with per-attempt session rotation absorb the rate limits that block direct fetches. |
| 📦 | 18 normalised fields. Same shape across every collection, so an epidemiology preprint and a cardiology preprint drop into the same table. |
| 📅 | Date and author filters. Narrow by posting date range or author name without touching the UI. |
| 💸 | Pay-per-event or flat. Compatible with both Apify pricing models. |
| 🪵 | Resilient. Per-request session rotation plus short exponential backoff retries shrug off the transient 503s that medRxiv's edge throws under load. |
📊 medRxiv has hosted 210,000+ preprints across 50 medical specialties since 2019, with hundreds of new posts every week.
📈 How it compares to alternatives
| Approach | Cost | Coverage | Refresh | Filters | Setup |
|---|---|---|---|---|---|
| ⭐ medRxiv Preprint Scraper (this Actor) | Pay only for runs | All 50 subject collections, 210,000+ preprints | On-demand | Keyword, collection, date range, author, sort | None |
| Official PubMed mirrors | Free | Indexed-after-peer-review only | Days to weeks behind preprints | MeSH terms, not preprint fields | API key, query syntax |
| Paid scholarly aggregators | Subscription | Mixed; preprints often partial | Daily | Vendor-specific | Sales contract |
| Manual browser saves | Time | Limited | Manual | Manual | Slow |
| Generic web-scraping tools | DIY | Brittle, breaks on edge updates | DIY | DIY | High |
Most teams already have a citation database. This Actor is the live-preprint layer that sits in front of it.
🚀 How to use
- 🔐 Sign up. Create a free Apify account (no credit card needed for the free tier).
- 🔎 Open the Actor. Search the Apify Store for "medRxiv" or open this page.
- 🧩 Fill in input. Either paste a medRxiv search URL into
startUrl, or setsearchQuery,subjectCollection,dateFrom/dateTo, andauthoras needed. - ▶️ Click Start. The Actor uses Apify Residential US proxies to load and parse search and detail pages.
- 📥 Export. Download the dataset as JSON, CSV, or Excel, or push directly to Google Sheets, BigQuery, Webhook, or S3.
⏱️ Total time: about 2 minutes from sign-up to first export.
💼 Business use cases
🌟 Beyond business use cases
Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.
🔌 Automating medRxiv Preprint Scraper
Trigger this Actor from any code path or scheduler.
- 🟨 Node.js / JavaScript: use the apify-client npm package.
- 🐍 Python: use the apify-client PyPI package.
- 📘 HTTP API: see the Apify API docs for direct REST calls.
Schedule it on Apify with Schedules to refresh evidence pulls daily, weekly, or hourly. Pipe the dataset to Google Sheets, BigQuery, Snowflake, or your own webhook for downstream analytics.
❓ Frequently Asked Questions
🔌 Integrate with any app
Pipe medRxiv data into the tools you already use.
- Google Sheets - drop preprint records into a live sheet for analysts.
- BigQuery - warehouse evidence pulls for SQL analytics.
- Webhooks - notify Slack, Zapier, or your own service when a run completes.
- Airbyte - sync to Snowflake, Postgres, or any Airbyte destination.
- Zapier - trigger downstream automations on every new run.
- Make - build no-code workflows around scraped preprints.
🔗 Recommended Actors
- 🧪 bioRxiv & medRxiv Preprint Scraper - sister Actor covering bioRxiv and the broader preprint network.
- 📚 PubMed Citation Scraper - peer-reviewed biomedical citations from PubMed.
- 🌐 OpenAlex Scholarly Works Scraper - cross-disciplinary scholarly metadata at scale.
- 📐 arXiv Preprint Scraper - preprint coverage for physics, math, CS, and quantitative biology.
- 🧠 Semantic Scholar Scraper - citation graph and paper metadata across academic disciplines.
💡 Pro Tip: browse the complete ParseForge collection for more research, scholarly, and health-data scrapers.
🆘 Need Help? Open our contact form and we will reply within one business day.
⚠️ Disclaimer: This Actor extracts publicly accessible preprint data from medRxiv.org for legitimate research, evidence-surveillance, and analytics purposes. It does not collect personal user data beyond what authors and institutions publicly publish on their preprints. Use of this Actor is at your own risk and subject to medRxiv.org's terms of service and the per-preprint Creative Commons or CC0 license. ParseForge is not affiliated with, endorsed by, or sponsored by medRxiv, Cold Spring Harbor Laboratory, BMJ, or Yale University.