medRxiv Scraper
Pricing
Pay per event
medRxiv Scraper
Extract comprehensive preprint data from medRxiv, including titles, authors, abstracts, full text, DOIs, citations, and metadata. Automate access to health-science preprints with structured outputs, ideal for researchers and analysts who need reliable, large-scale article data without manual work.
Pricing
Pay per event
Rating
0.0
(0)
Developer
ParseForge
Actor stats
0
Bookmarked
6
Total users
0
Monthly active users
3 days ago
Last modified
Categories
Share

📚 medRxiv Scraper
🚀 Extract preprint research data from medRxiv in minutes. Filter by search query, sort order, or direct URL. No coding, no API keys required.
🕒 Last updated: 2026-04-23 · 📊 24 fields · 🔄 Runs on Apify cloud or locally · 📁 Export: JSON, CSV, Excel
The medRxiv Scraper collects detailed preprint article data from medRxiv.org, the health sciences preprint server. It extracts 24 structured fields per article, including titles, authors, abstracts, full text, DOIs, publication dates, subject areas, and citation metadata. Whether you need 10 articles for a quick review or 100,000+ records for a large-scale meta-analysis, this tool handles it with parallel fetching and automatic pagination.
Built for researchers conducting systematic reviews, academics tracking emerging health sciences research, pharmaceutical companies monitoring clinical study trends, and data teams building training datasets. The scraper connects directly to medRxiv search, follows pagination automatically, and delivers clean, structured data ready for analysis. No browser required, no rate-limiting headaches, just fast and reliable preprint data collection at any scale.
| Target Audience | Use Cases |
|---|---|
| Academic Researchers | Systematic literature reviews, meta-analyses |
| Pharmaceutical Companies | Clinical trial monitoring, drug research tracking |
| Data Scientists | Training dataset creation, NLP corpus building |
| Science Journalists | Trend tracking, breaking research discovery |
| Health Policy Analysts | Public health research aggregation |
| Biotech Startups | Competitive intelligence, R&D monitoring |
📋 What the medRxiv Scraper does
- 📝 Extracts article titles and abstracts for literature review, keyword analysis, and research organization
- 👥 Collects full author details including names, affiliations, and corresponding author contact information
- 📖 Captures full text content and supplementary materials for deep analysis and text mining
- 🔗 Gathers DOIs, PDF links, and citation metadata for bibliometric analysis and reference management
- 📊 Pulls subject areas and keywords for topic filtering, categorization, and trend detection
- 📅 Tracks publication dates and version history to identify emerging research and monitor updates
The scraper processes medRxiv search results page by page, extracting every available data field from each article listing. It handles pagination automatically, respects server limits, and delivers structured output ready for spreadsheets, databases, or analytics pipelines.
💡 Why it matters: medRxiv publishes thousands of health sciences preprints before peer review. Manually collecting this data is slow and error-prone. This scraper automates the entire process, giving you structured, analysis-ready data in minutes instead of days.
🎬 Full Demo
🚧 Coming soon...
⚙️ Input
| Field | Type | Required | Description |
|---|---|---|---|
| startUrl | string | No | medRxiv search URL to scrape. Use for custom searches or specific page ranges. |
| searchQuery | string | No | Search term to find articles (e.g., "bacterial infection"). Cannot be used with startUrl. |
| orderBy | string | No | Sort order: relevance (best match), oldest, or newest. |
| maxItems | integer | No | Maximum articles to collect. Free users: limited to 10. Paid users: up to 1,000,000. |
Example 1: Search by keyword
{"searchQuery": "COVID-19 vaccine","orderBy": "newest","maxItems": 50}
Example 2: Scrape from a direct URL
{"startUrl": "https://www.medrxiv.org/search/asd","maxItems": 100}
⚠️ Good to Know: You can use either
startUrlorsearchQuery, but not both at the same time. If you provide a startUrl, the searchQuery and orderBy fields are ignored. Free users are automatically limited to 10 items per run.
📊 Output
🧾 Schema
| Emoji | Field | Type | Description |
|---|---|---|---|
| 📝 | title | string | Article title |
| 👥 | authors | array | List of author names |
| 📖 | abstract | string | Full article abstract |
| 📄 | fullText | string | Complete article text content |
| 🔗 | doi | string | Digital Object Identifier |
| 🌐 | url | string | Direct article URL |
| 📅 | publicationDate | string | Date the article was posted |
| 📂 | subjectAreas | array | Research subject categories |
| 🔑 | keywords | array | Article keywords |
| 👤 | correspondingAuthor | string | Primary contact author |
| 🏫 | authorAffiliations | array | Author institutional affiliations |
| 📎 | supplementaryMaterials | array | Links to supplementary data |
| 📚 | citationInfo | string | Citation metadata |
| 📄 | pdfUrl | string | Direct PDF download link |
| 🔄 | versionHistory | array | Article revision history |
| ⚖️ | license | string | License information |
| 💰 | fundingStatement | string | Funding disclosure |
| ⚡ | competingInterests | string | Competing interests declaration |
| 📊 | dataAvailability | string | Data availability statement |
| 🏷️ | articleType | string | Type of article |
| 📅 | revisionDate | string | Last revision date |
| 🔢 | articleId | string | Unique article identifier |
| 📋 | relatedArticles | array | Links to related articles |
| ❌ | error | string | Error message if extraction failed |
📦 Sample records
✨ Why choose this Actor
| Feature | Details |
|---|---|
| 📊 24 structured fields | Titles, authors, abstracts, full text, DOIs, citations, and more |
| ⚡ Parallel fetching | Collects hundreds of articles per minute |
| 🔍 Flexible search | Search by keyword or use any medRxiv search URL |
| 📅 Sort options | Sort by relevance, newest first, or oldest first |
| 📁 Multiple export formats | JSON, CSV, Excel - ready for any workflow |
| 🔄 Automatic pagination | Handles multi-page results without manual intervention |
| 🏗️ Scale to 1M+ articles | From 10 articles to a million, same simple setup |
📈 Typical performance: Collects 500+ articles per minute with parallel fetching enabled. A 10,000-article dataset takes roughly 20 minutes.
📈 How it compares to alternatives
| Feature | This Actor | Manual Collection | Generic Scrapers |
|---|---|---|---|
| Structured output with 24 fields | ✅ | ❌ | Partial |
| Full text extraction | ✅ | ✅ (slow) | ❌ |
| Automatic pagination | ✅ | ❌ | Partial |
| Export to CSV/JSON/Excel | ✅ | ❌ | Partial |
| Scales to 1M+ articles | ✅ | ❌ | ❌ |
| No coding required | ✅ | N/A | ❌ |
| Scheduled runs | ✅ | ❌ | Partial |
Built specifically for medRxiv, so every field is mapped correctly and every edge case is handled.
🚀 How to use
- Create a free Apify account - Sign up here (includes free credits)
- Open the medRxiv Scraper - Navigate to the Actor page and click "Start"
- Configure your search - Enter a search query like "bacterial infection" or paste a medRxiv search URL
- Set your limits - Choose how many articles to collect (free users: up to 10)
- Run and download - Click "Start", wait for completion, then export as JSON, CSV, or Excel
⏱️ First results appear in under 30 seconds. A typical run of 100 articles completes in about 2 minutes.
💼 Business use cases
|
Academic Research
|
Pharmaceutical & Biotech
|
|
Health Policy & Journalism
|
Data Science & AI
|
🌟 Beyond business use cases
Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.
🤖 Ask an AI assistant about this scraper
Open a ready-to-send prompt about this ParseForge actor in the AI of your choice:
- 💬 ChatGPT
- 🧠 Claude
- 🔍 Perplexity
- 🅒 Copilot
❓ Frequently Asked Questions
💳 Do I need a paid Apify plan to run this actor?
No. You can start right now on the free Apify plan, which includes $5 in free monthly credit. That is enough to run this actor several times and explore the output before committing to anything. Paid plans unlock higher limits, more concurrent runs, and larger datasets. Create a free Apify account here to get started.
🚨 What happens if my run fails or returns no results?
Failed runs are not charged. If the source site changes, proxies get rate-limited, or a specific input matches nothing, re-run the actor or open our contact form and we will investigate. You can also check the run log in the Apify console to see why the run stopped.
📏 How many items can I scrape per run?
Free users are limited to 10 items per run so you can preview the output and confirm the actor works for your use case. Paid users can raise maxItems up to 1,000,000 per run. Upgrade here if you need full scale.
🕒 How fresh is the data?
Every run fetches live data at the moment of execution. There is no cache or delay: the records you get reflect what the source returned at that moment. Schedule the actor to maintain a rolling snapshot of the data you need.
🧑💻 Can I call this actor from my own code?
Yes. Apify exposes every actor as a REST endpoint and ships first-class SDKs for Node.js and Python. You can start a run, read the dataset, and handle webhooks from your own app in a few lines. All you need is your Apify API token.
📤 How do I export the data?
Every Apify dataset can be downloaded in one click from the console as CSV, JSON, JSONL, Excel, HTML, XML, or RSS. You can also pull results programmatically via the Apify API or stream them into BigQuery, S3, and other destinations through built-in integrations.
📅 Can I schedule the actor to run automatically?
Yes. Use the Apify scheduler to run the actor on any cadence, from hourly to monthly. Results are saved to your dataset and can be delivered to webhooks, email, Slack, cloud storage, or automation tools such as Zapier and Make.
🔌 Automating medRxiv Scraper
Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });const run = await client.actor("parseforge/medrxiv-scraper").call({searchQuery: "bacterial infection",orderBy: "newest",maxItems: 100});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(items);
Python
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run = client.actor("parseforge/medrxiv-scraper").call(run_input={"searchQuery": "bacterial infection","orderBy": "newest","maxItems": 100})items = list(client.dataset(run["defaultDatasetId"]).iterate_items())print(items)
Schedules: Set up automatic runs on a daily, weekly, or monthly basis using Apify Schedules. Monitor new preprints in your field automatically and get notified when new data is available.
🔌 Integrate with any app
- 🔗 Make (Integromat) - Connect medRxiv data to 1,000+ apps with visual workflows
- 🔗 Zapier - Trigger actions in other tools when new preprint data is collected
- 🔗 Slack - Send notifications to Slack channels when new articles match your criteria
- 🔗 Airbyte - Sync scraped data to your data warehouse or database
- 🔗 GitHub - Automate research data pipelines with GitHub Actions
- 🔗 Google Drive - Export results directly to Google Sheets or Drive
🔗 Recommended Actors
| Actor | Description |
|---|---|
| 📚 PubMed Citation Scraper | Extract citation data and metadata from PubMed biomedical literature |
| 🔬 bioRxiv Scraper | Collect biology preprint data from bioRxiv, medRxiv's sister server |
| 🧪 ChemRxiv Scraper | Scrape chemistry preprint articles from ChemRxiv |
| 🎓 OpenAlex Scraper | Query 250M+ scholarly records from the OpenAlex open catalog |
| 📖 Open Library Scraper | Extract book metadata and availability from Open Library |
💡 Pro Tip: Combine the medRxiv Scraper with the PubMed Citation Scraper to cross-reference preprints with published peer-reviewed articles.
🆘 Need Help? Open our contact form and we will get back to you within 24 hours. For bug reports, feature requests, or integration help, we are here to assist.
Disclaimer: This Actor is provided as-is, without warranty. It is not affiliated with or endorsed by medRxiv or Cold Spring Harbor Laboratory. Use it responsibly and in compliance with applicable terms of service. The authors are not responsible for how the collected data is used. Always verify data accuracy for critical applications.