medRxiv Scraper avatar

medRxiv Scraper

Pricing

Pay per event

Go to Apify Store
medRxiv Scraper

medRxiv Scraper

Extract comprehensive preprint data from medRxiv, including titles, authors, abstracts, full text, DOIs, citations, and metadata. Automate access to health-science preprints with structured outputs, ideal for researchers and analysts who need reliable, large-scale article data without manual work.

Pricing

Pay per event

Rating

0.0

(0)

Developer

ParseForge

ParseForge

Maintained by Community

Actor stats

0

Bookmarked

5

Total users

0

Monthly active users

3 hours ago

Last modified

Share

ParseForge Banner

📚 medRxiv Scraper

🚀 Extract preprint research data from medRxiv in minutes. Filter by search query, sort order, or direct URL. No coding, no API keys required.

🕒 Last updated: 2026-04-16 · 📊 24 fields · 🔄 Runs on Apify cloud or locally · 📁 Export: JSON, CSV, Excel

The medRxiv Scraper collects detailed preprint article data from medRxiv.org, the health sciences preprint server. It extracts 24 structured fields per article, including titles, authors, abstracts, full text, DOIs, publication dates, subject areas, and citation metadata. Whether you need 10 articles for a quick review or 100,000+ records for a large-scale meta-analysis, this tool handles it with parallel fetching and automatic pagination.

Built for researchers conducting systematic reviews, academics tracking emerging health sciences research, pharmaceutical companies monitoring clinical study trends, and data teams building training datasets. The scraper connects directly to medRxiv search, follows pagination automatically, and delivers clean, structured data ready for analysis. No browser required, no rate-limiting headaches, just fast and reliable preprint data collection at any scale.

Target AudienceUse Cases
Academic ResearchersSystematic literature reviews, meta-analyses
Pharmaceutical CompaniesClinical trial monitoring, drug research tracking
Data ScientistsTraining dataset creation, NLP corpus building
Science JournalistsTrend tracking, breaking research discovery
Health Policy AnalystsPublic health research aggregation
Biotech StartupsCompetitive intelligence, R&D monitoring

📋 What the medRxiv Scraper does

  • 📝 Extracts article titles and abstracts for literature review, keyword analysis, and research organization
  • 👥 Collects full author details including names, affiliations, and corresponding author contact information
  • 📖 Captures full text content and supplementary materials for deep analysis and text mining
  • 🔗 Gathers DOIs, PDF links, and citation metadata for bibliometric analysis and reference management
  • 📊 Pulls subject areas and keywords for topic filtering, categorization, and trend detection
  • 📅 Tracks publication dates and version history to identify emerging research and monitor updates

The scraper processes medRxiv search results page by page, extracting every available data field from each article listing. It handles pagination automatically, respects server limits, and delivers structured output ready for spreadsheets, databases, or analytics pipelines.

💡 Why it matters: medRxiv publishes thousands of health sciences preprints before peer review. Manually collecting this data is slow and error-prone. This scraper automates the entire process, giving you structured, analysis-ready data in minutes instead of days.


🎬 Full Demo

🚧 Coming soon...


⚙️ Input

FieldTypeRequiredDescription
startUrlstringNomedRxiv search URL to scrape. Use for custom searches or specific page ranges.
searchQuerystringNoSearch term to find articles (e.g., "bacterial infection"). Cannot be used with startUrl.
orderBystringNoSort order: relevance (best match), oldest, or newest.
maxItemsintegerNoMaximum articles to collect. Free users: limited to 10. Paid users: up to 1,000,000.

Example 1: Search by keyword

{
"searchQuery": "COVID-19 vaccine",
"orderBy": "newest",
"maxItems": 50
}

Example 2: Scrape from a direct URL

{
"startUrl": "https://www.medrxiv.org/search/asd",
"maxItems": 100
}

⚠️ Good to Know: You can use either startUrl or searchQuery, but not both at the same time. If you provide a startUrl, the searchQuery and orderBy fields are ignored. Free users are automatically limited to 10 items per run.


📊 Output

🧾 Schema

EmojiFieldTypeDescription
📝titlestringArticle title
👥authorsarrayList of author names
📖abstractstringFull article abstract
📄fullTextstringComplete article text content
🔗doistringDigital Object Identifier
🌐urlstringDirect article URL
📅publicationDatestringDate the article was posted
📂subjectAreasarrayResearch subject categories
🔑keywordsarrayArticle keywords
👤correspondingAuthorstringPrimary contact author
🏫authorAffiliationsarrayAuthor institutional affiliations
📎supplementaryMaterialsarrayLinks to supplementary data
📚citationInfostringCitation metadata
📄pdfUrlstringDirect PDF download link
🔄versionHistoryarrayArticle revision history
⚖️licensestringLicense information
💰fundingStatementstringFunding disclosure
competingInterestsstringCompeting interests declaration
📊dataAvailabilitystringData availability statement
🏷️articleTypestringType of article
📅revisionDatestringLast revision date
🔢articleIdstringUnique article identifier
📋relatedArticlesarrayLinks to related articles
errorstringError message if extraction failed

📦 Sample records


✨ Why choose this Actor

FeatureDetails
📊 24 structured fieldsTitles, authors, abstracts, full text, DOIs, citations, and more
⚡ Parallel fetchingCollects hundreds of articles per minute
🔍 Flexible searchSearch by keyword or use any medRxiv search URL
📅 Sort optionsSort by relevance, newest first, or oldest first
📁 Multiple export formatsJSON, CSV, Excel - ready for any workflow
🔄 Automatic paginationHandles multi-page results without manual intervention
🏗️ Scale to 1M+ articlesFrom 10 articles to a million, same simple setup

📈 Typical performance: Collects 500+ articles per minute with parallel fetching enabled. A 10,000-article dataset takes roughly 20 minutes.


📈 How it compares to alternatives

FeatureThis ActorManual CollectionGeneric Scrapers
Structured output with 24 fieldsPartial
Full text extraction✅ (slow)
Automatic paginationPartial
Export to CSV/JSON/ExcelPartial
Scales to 1M+ articles
No coding requiredN/A
Scheduled runsPartial

Built specifically for medRxiv, so every field is mapped correctly and every edge case is handled.


🚀 How to use

  1. Create a free Apify account - Sign up here (includes free credits)
  2. Open the medRxiv Scraper - Navigate to the Actor page and click "Start"
  3. Configure your search - Enter a search query like "bacterial infection" or paste a medRxiv search URL
  4. Set your limits - Choose how many articles to collect (free users: up to 10)
  5. Run and download - Click "Start", wait for completion, then export as JSON, CSV, or Excel

⏱️ First results appear in under 30 seconds. A typical run of 100 articles completes in about 2 minutes.


💼 Business use cases

Academic Research

  • Build systematic review datasets
  • Track citation networks across preprints
  • Monitor publication trends in specific fields
  • Collect training data for NLP models

Pharmaceutical & Biotech

  • Monitor clinical trial preprints
  • Track competitor research activity
  • Identify emerging therapeutic targets
  • Build drug discovery literature databases

Health Policy & Journalism

  • Track public health research trends
  • Monitor pandemic-related publications
  • Build evidence bases for policy analysis
  • Discover breaking research stories early

Data Science & AI

  • Create biomedical text corpora
  • Build knowledge graphs from research data
  • Train classification models on article metadata
  • Automate literature monitoring pipelines

🔌 Automating medRxiv Scraper

Node.js

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const run = await client.actor("parseforge/medrxiv-scraper").call({
searchQuery: "bacterial infection",
orderBy: "newest",
maxItems: 100
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items);

Python

from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("parseforge/medrxiv-scraper").call(run_input={
"searchQuery": "bacterial infection",
"orderBy": "newest",
"maxItems": 100
})
items = list(client.dataset(run["defaultDatasetId"]).iterate_items())
print(items)

Schedules: Set up automatic runs on a daily, weekly, or monthly basis using Apify Schedules. Monitor new preprints in your field automatically and get notified when new data is available.


❓ Frequently Asked Questions


🔌 Integrate with any app

  • 🔗 Make (Integromat) - Connect medRxiv data to 1,000+ apps with visual workflows
  • 🔗 Zapier - Trigger actions in other tools when new preprint data is collected
  • 🔗 Slack - Send notifications to Slack channels when new articles match your criteria
  • 🔗 Airbyte - Sync scraped data to your data warehouse or database
  • 🔗 GitHub - Automate research data pipelines with GitHub Actions
  • 🔗 Google Drive - Export results directly to Google Sheets or Drive

ActorDescription
📚 PubMed Citation ScraperExtract citation data and metadata from PubMed biomedical literature
🔬 bioRxiv ScraperCollect biology preprint data from bioRxiv, medRxiv's sister server
🧪 ChemRxiv ScraperScrape chemistry preprint articles from ChemRxiv
🎓 OpenAlex ScraperQuery 250M+ scholarly records from the OpenAlex open catalog
📖 Open Library ScraperExtract book metadata and availability from Open Library

💡 Pro Tip: Combine the medRxiv Scraper with the PubMed Citation Scraper to cross-reference preprints with published peer-reviewed articles.


🆘 Need Help? Open our contact form and we will get back to you within 24 hours. For bug reports, feature requests, or integration help, we are here to assist.


Disclaimer: This Actor is provided as-is, without warranty. It is not affiliated with or endorsed by medRxiv or Cold Spring Harbor Laboratory. Use it responsibly and in compliance with applicable terms of service. The authors are not responsible for how the collected data is used. Always verify data accuracy for critical applications.