medRxiv Scraper
Pricing
Pay per event
medRxiv Scraper
Extract comprehensive preprint data from medRxiv, including titles, authors, abstracts, full text, DOIs, citations, and metadata. Automate access to health-science preprints with structured outputs, ideal for researchers and analysts who need reliable, large-scale article data without manual work.
Pricing
Pay per event
Rating
0.0
(0)
Developer
ParseForge
Actor stats
0
Bookmarked
5
Total users
0
Monthly active users
3 hours ago
Last modified
Categories
Share

📚 medRxiv Scraper
🚀 Extract preprint research data from medRxiv in minutes. Filter by search query, sort order, or direct URL. No coding, no API keys required.
🕒 Last updated: 2026-04-16 · 📊 24 fields · 🔄 Runs on Apify cloud or locally · 📁 Export: JSON, CSV, Excel
The medRxiv Scraper collects detailed preprint article data from medRxiv.org, the health sciences preprint server. It extracts 24 structured fields per article, including titles, authors, abstracts, full text, DOIs, publication dates, subject areas, and citation metadata. Whether you need 10 articles for a quick review or 100,000+ records for a large-scale meta-analysis, this tool handles it with parallel fetching and automatic pagination.
Built for researchers conducting systematic reviews, academics tracking emerging health sciences research, pharmaceutical companies monitoring clinical study trends, and data teams building training datasets. The scraper connects directly to medRxiv search, follows pagination automatically, and delivers clean, structured data ready for analysis. No browser required, no rate-limiting headaches, just fast and reliable preprint data collection at any scale.
| Target Audience | Use Cases |
|---|---|
| Academic Researchers | Systematic literature reviews, meta-analyses |
| Pharmaceutical Companies | Clinical trial monitoring, drug research tracking |
| Data Scientists | Training dataset creation, NLP corpus building |
| Science Journalists | Trend tracking, breaking research discovery |
| Health Policy Analysts | Public health research aggregation |
| Biotech Startups | Competitive intelligence, R&D monitoring |
📋 What the medRxiv Scraper does
- 📝 Extracts article titles and abstracts for literature review, keyword analysis, and research organization
- 👥 Collects full author details including names, affiliations, and corresponding author contact information
- 📖 Captures full text content and supplementary materials for deep analysis and text mining
- 🔗 Gathers DOIs, PDF links, and citation metadata for bibliometric analysis and reference management
- 📊 Pulls subject areas and keywords for topic filtering, categorization, and trend detection
- 📅 Tracks publication dates and version history to identify emerging research and monitor updates
The scraper processes medRxiv search results page by page, extracting every available data field from each article listing. It handles pagination automatically, respects server limits, and delivers structured output ready for spreadsheets, databases, or analytics pipelines.
💡 Why it matters: medRxiv publishes thousands of health sciences preprints before peer review. Manually collecting this data is slow and error-prone. This scraper automates the entire process, giving you structured, analysis-ready data in minutes instead of days.
🎬 Full Demo
🚧 Coming soon...
⚙️ Input
| Field | Type | Required | Description |
|---|---|---|---|
| startUrl | string | No | medRxiv search URL to scrape. Use for custom searches or specific page ranges. |
| searchQuery | string | No | Search term to find articles (e.g., "bacterial infection"). Cannot be used with startUrl. |
| orderBy | string | No | Sort order: relevance (best match), oldest, or newest. |
| maxItems | integer | No | Maximum articles to collect. Free users: limited to 10. Paid users: up to 1,000,000. |
Example 1: Search by keyword
{"searchQuery": "COVID-19 vaccine","orderBy": "newest","maxItems": 50}
Example 2: Scrape from a direct URL
{"startUrl": "https://www.medrxiv.org/search/asd","maxItems": 100}
⚠️ Good to Know: You can use either
startUrlorsearchQuery, but not both at the same time. If you provide a startUrl, the searchQuery and orderBy fields are ignored. Free users are automatically limited to 10 items per run.
📊 Output
🧾 Schema
| Emoji | Field | Type | Description |
|---|---|---|---|
| 📝 | title | string | Article title |
| 👥 | authors | array | List of author names |
| 📖 | abstract | string | Full article abstract |
| 📄 | fullText | string | Complete article text content |
| 🔗 | doi | string | Digital Object Identifier |
| 🌐 | url | string | Direct article URL |
| 📅 | publicationDate | string | Date the article was posted |
| 📂 | subjectAreas | array | Research subject categories |
| 🔑 | keywords | array | Article keywords |
| 👤 | correspondingAuthor | string | Primary contact author |
| 🏫 | authorAffiliations | array | Author institutional affiliations |
| 📎 | supplementaryMaterials | array | Links to supplementary data |
| 📚 | citationInfo | string | Citation metadata |
| 📄 | pdfUrl | string | Direct PDF download link |
| 🔄 | versionHistory | array | Article revision history |
| ⚖️ | license | string | License information |
| 💰 | fundingStatement | string | Funding disclosure |
| ⚡ | competingInterests | string | Competing interests declaration |
| 📊 | dataAvailability | string | Data availability statement |
| 🏷️ | articleType | string | Type of article |
| 📅 | revisionDate | string | Last revision date |
| 🔢 | articleId | string | Unique article identifier |
| 📋 | relatedArticles | array | Links to related articles |
| ❌ | error | string | Error message if extraction failed |
📦 Sample records
✨ Why choose this Actor
| Feature | Details |
|---|---|
| 📊 24 structured fields | Titles, authors, abstracts, full text, DOIs, citations, and more |
| ⚡ Parallel fetching | Collects hundreds of articles per minute |
| 🔍 Flexible search | Search by keyword or use any medRxiv search URL |
| 📅 Sort options | Sort by relevance, newest first, or oldest first |
| 📁 Multiple export formats | JSON, CSV, Excel - ready for any workflow |
| 🔄 Automatic pagination | Handles multi-page results without manual intervention |
| 🏗️ Scale to 1M+ articles | From 10 articles to a million, same simple setup |
📈 Typical performance: Collects 500+ articles per minute with parallel fetching enabled. A 10,000-article dataset takes roughly 20 minutes.
📈 How it compares to alternatives
| Feature | This Actor | Manual Collection | Generic Scrapers |
|---|---|---|---|
| Structured output with 24 fields | ✅ | ❌ | Partial |
| Full text extraction | ✅ | ✅ (slow) | ❌ |
| Automatic pagination | ✅ | ❌ | Partial |
| Export to CSV/JSON/Excel | ✅ | ❌ | Partial |
| Scales to 1M+ articles | ✅ | ❌ | ❌ |
| No coding required | ✅ | N/A | ❌ |
| Scheduled runs | ✅ | ❌ | Partial |
Built specifically for medRxiv, so every field is mapped correctly and every edge case is handled.
🚀 How to use
- Create a free Apify account - Sign up here (includes free credits)
- Open the medRxiv Scraper - Navigate to the Actor page and click "Start"
- Configure your search - Enter a search query like "bacterial infection" or paste a medRxiv search URL
- Set your limits - Choose how many articles to collect (free users: up to 10)
- Run and download - Click "Start", wait for completion, then export as JSON, CSV, or Excel
⏱️ First results appear in under 30 seconds. A typical run of 100 articles completes in about 2 minutes.
💼 Business use cases
|
Academic Research
|
Pharmaceutical & Biotech
|
|
Health Policy & Journalism
|
Data Science & AI
|
🔌 Automating medRxiv Scraper
Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });const run = await client.actor("parseforge/medrxiv-scraper").call({searchQuery: "bacterial infection",orderBy: "newest",maxItems: 100});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(items);
Python
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run = client.actor("parseforge/medrxiv-scraper").call(run_input={"searchQuery": "bacterial infection","orderBy": "newest","maxItems": 100})items = list(client.dataset(run["defaultDatasetId"]).iterate_items())print(items)
Schedules: Set up automatic runs on a daily, weekly, or monthly basis using Apify Schedules. Monitor new preprints in your field automatically and get notified when new data is available.
❓ Frequently Asked Questions
🔌 Integrate with any app
- 🔗 Make (Integromat) - Connect medRxiv data to 1,000+ apps with visual workflows
- 🔗 Zapier - Trigger actions in other tools when new preprint data is collected
- 🔗 Slack - Send notifications to Slack channels when new articles match your criteria
- 🔗 Airbyte - Sync scraped data to your data warehouse or database
- 🔗 GitHub - Automate research data pipelines with GitHub Actions
- 🔗 Google Drive - Export results directly to Google Sheets or Drive
🔗 Recommended Actors
| Actor | Description |
|---|---|
| 📚 PubMed Citation Scraper | Extract citation data and metadata from PubMed biomedical literature |
| 🔬 bioRxiv Scraper | Collect biology preprint data from bioRxiv, medRxiv's sister server |
| 🧪 ChemRxiv Scraper | Scrape chemistry preprint articles from ChemRxiv |
| 🎓 OpenAlex Scraper | Query 250M+ scholarly records from the OpenAlex open catalog |
| 📖 Open Library Scraper | Extract book metadata and availability from Open Library |
💡 Pro Tip: Combine the medRxiv Scraper with the PubMed Citation Scraper to cross-reference preprints with published peer-reviewed articles.
🆘 Need Help? Open our contact form and we will get back to you within 24 hours. For bug reports, feature requests, or integration help, we are here to assist.
Disclaimer: This Actor is provided as-is, without warranty. It is not affiliated with or endorsed by medRxiv or Cold Spring Harbor Laboratory. Use it responsibly and in compliance with applicable terms of service. The authors are not responsible for how the collected data is used. Always verify data accuracy for critical applications.