Pricing

Pay per event

medRxiv Scraper

Extract comprehensive preprint data from medRxiv, including titles, authors, abstracts, full text, DOIs, citations, and metadata. Automate access to health-science preprints with structured outputs, ideal for researchers and analysts who need reliable, large-scale article data without manual work.

Pricing

Pay per event

Rating

0.0

(0)

Developer

ParseForge

Actor stats

Bookmarked

Total users

Monthly active users

6 days ago

Last modified

🧬 medRxiv Preprint Scraper

🚀 Scrape medRxiv health-science preprints in seconds. Filter by topic, subject collection, date range, or author. No API key, no registration, no manual CSV wrangling.

medRxiv is the leading open preprint server for the health sciences, hosting clinical research, epidemiology, public health, and biomedical work months before formal peer review. This Actor turns any medRxiv search (keyword, subject collection, posted-date window, or author) into a structured dataset of full preprint records, complete with title, all authors, posting date, DOI, abstract, full text, PDF link, license, funding statement, competing-interest declarations, data-availability statement, and any data or code repository link the authors disclose. The output drops straight into Google Sheets, BigQuery, Postgres, Notion, or any other tool your team already uses.

Preprint data is hard to harvest at scale. medRxiv exposes a faceted search interface but no public bulk API for researchers; the site sits behind Cloudflare and a Varnish edge that rate-limits direct datacenter traffic. This Actor closes the gap: pick a search query, narrow it with subject collection, date range, or author filters, set how many records you want, and the data lands in your dataset within minutes. Systematic reviewers, clinical research teams, public-health analysts, bibliometric researchers, science journalists, and AI training-set curators all use this kind of feed for living reviews, evidence surveillance, citation tracking, and topic modelling on emerging health research.

👥 Target audience	🎯 Primary use case
Systematic reviewers and meta-analysts	Run living searches across health-science preprints
Clinical research and trial teams	Track new evidence in a therapeutic area in near real time
Public-health analysts and epidemiologists	Surveil outbreak, vaccine, and intervention literature as it lands
Bibliometric and science-policy researchers	Build datasets on topic emergence, author networks, and funding patterns
Science journalists and editors	Spot newsworthy preprints in a chosen field within hours of posting
AI/ML and NLP teams	Curate domain-specific corpora of biomedical text for training and evaluation

📋 What the medRxiv Preprint Scraper does

🔍 Any keyword search. Drive results from a free-text query just like the medRxiv search box.
🗂️ 50 subject collections. Restrict to a specific medical specialty (Epidemiology, Cardiovascular Medicine, Infectious Diseases, HIV/AIDS, Public and Global Health, and 45 more).
📅 Date range filter. Slice by posting date with dateFrom and dateTo (YYYY-MM-DD).
👤 Author filter. Narrow to preprints where a given author name appears.
📰 Full record per preprint. Title, full author list, posting date, DOI, abstract, full text, PDF URL, license, funding, competing interests, data-availability statement, and disclosed data or code URLs.
🔁 Sort order. Choose newest first, oldest first, or relevance-ranked.

Each record returns the article URL, title, author list (with affiliation metadata where exposed), posting date, DOI, abstract, full text body, subject area assignment from medRxiv, PDF link, citation string, license text, funding statement, competing-interest statement, author declarations block, data-availability statement, disclosed data or code repository URL, supplementary materials, and the scrape timestamp.

💡 Why it matters: preprints often surface findings weeks or months before journal publication. In fast-moving health topics (pandemics, vaccines, drug repurposing) waiting for the indexed-in-MEDLINE version means working from stale evidence. This Actor lets analysts track the live preprint frontier without manual click-through.

📊 Data fields

Each record includes: abstract, authorDeclarations, authorDetails, authors, citationInformation, competingInterestStatement, dataAvailability, dataCodeUrl, doi, fullText, fundingStatement, licenseInformation, pdfUrl, publicationDate, scrapedTimestamp, subjectAreas, title, url. All 18 field names come from a real production run, so what you see here is what lands in your dataset.

⚠️ Good to Know: You can use either startUrl or searchQuery, but not both at the same time. If you provide a startUrl, the searchQuery and orderBy fields are ignored. Free users are automatically limited to 10 items per run.

🚀 How to use

🔐 Sign up. Create a free Apify account (no credit card needed for the free tier).
🔎 Open the Actor. Search the Apify Store for "medRxiv" or open this page.
🧩 Fill in input. Either paste a medRxiv search URL into startUrl, or set searchQuery, subjectCollection, dateFrom / dateTo, and author as needed.
▶️ Click Start. The Actor uses Apify Residential US proxies to load and parse search and detail pages.
📥 Export. Download the dataset as JSON, CSV, or Excel, or push directly to Google Sheets, BigQuery, Webhook, or S3.

⏱️ Total time: about 2 minutes from sign-up to first export.

🔗 Recommended Actors

🧪 bioRxiv & medRxiv Preprint Scraper - sister Actor covering bioRxiv and the broader preprint network.
📚 PubMed Citation Scraper - peer-reviewed biomedical citations from PubMed.
🌐 OpenAlex Scholarly Works Scraper - cross-disciplinary scholarly metadata at scale.
📐 arXiv Preprint Scraper - preprint coverage for physics, math, CS, and quantitative biology.
🧠 Semantic Scholar Scraper - citation graph and paper metadata across academic disciplines.

💡 Pro Tip: browse the complete ParseForge collection for more research, scholarly, and health-data scrapers.

⚠️ Disclaimer: This Actor extracts publicly accessible preprint data from medRxiv.org for legitimate research, evidence-surveillance, and analytics purposes. It does not collect personal user data beyond what authors and institutions publicly publish on their preprints. Use of this Actor is at your own risk and subject to medRxiv.org's terms of service and the per-preprint Creative Commons or CC0 license. ParseForge is not affiliated with, endorsed by, or sponsored by medRxiv, Cold Spring Harbor Laboratory, BMJ, or Yale University.

🆘 Need Help?

If you hit a bug, have questions about setup, or need a scraper we haven't built yet, open our contact form or write to parseforge@protonmail.com. We also take on paid custom data projects.

For faster answers, join our Discord. It's the best place to get support and suggest new actors.

Medrxiv Scraper

outstanding_vegetable/medrxiv-scraper

Scrape medRxiv medical preprints by date range. Get title, authors, abstract, DOI, category, license. Public API, free.

Peter Skotte

Medrxiv Preprints Scraper

benthepythondev/medrxiv-preprints-scraper

Collect medRxiv records and export title, date, type, category, doi, funder as structured JSON, CSV or Excel data.

Ben

bioRxiv & medRxiv Preprint Scraper

crawlergang/biorxiv-medrxiv-scraper

Scrape preprints from bioRxiv and medRxiv with the leading open-access preprint servers for biology and medicine. Search by date range, fetch by DOI, or retrieve published journal version information.

Crawler Gang

5.0

bioRxiv & medRxiv Preprint Scraper

crawlerbros/biorxiv-medrxiv-scraper

Crawler Bros

bioRxiv and medRxiv Preprints Scraper

parseforge/biorxiv-recent-scraper

Track the latest preprints from bioRxiv or medRxiv inside any date window. Returns DOI, title, authors, posting date, category, abstract, version, server, JATS XML link, and license. Useful for literature surveillance, competitive science intelligence, and rapid biomedical research review.

ParseForge

bioRxiv Preprints Scraper

parseforge/biorxiv-preprints-scraper

Pull bioRxiv and medRxiv preprints by date range, DOI, or category. Records carry DOI, title, authors, publish date, version, type, license, category, abstract, server, and full text PDF link. Useful for literature reviews, science monitoring, and grant research.

ParseForge

Semantic Scholar Scraper

parseforge/semantic-scholar-scraper

Extract detailed academic paper data from Semantic Scholar, including abstracts, citations, authors, and publication details. Ideal for researchers, academics, and analysts who need structured scholarly data for literature reviews, research workflows, and large-scale academic analysis.

ParseForge

1.1

Preprint-to-Publication Lineage Resolver

flintglade/preprint-publication-lineage-resolver

Resolve evidence-backed links among preprints, publications, versions, and corrections using official Crossref, Europe PMC, arXiv, bioRxiv, and medRxiv metadata.

Flintglade

Unified Preprint Search

logical_vivacity/unified-preprint-search

One Apify Actor, five sources: PubMed, arXiv, bioRxiv, medRxiv, chemRxiv.

Logical Vivacity

bioRxiv + medRxiv Scraper for RAG

getascraper/biorxiv-medrxiv-rag-extractor

Scrape bioRxiv and medRxiv preprints by server, category, and date range. Returns RAG-ready JSON with JATS full-text chunks (cl100k_base, 512/50) when available and abstract fallback otherwise. Drop-in for LangChain, LlamaIndex, Qdrant, Pinecone, Weaviate, pgvector.