# bioRxiv & medRxiv Preprint Scraper (`crawlergang/biorxiv-medrxiv-scraper`) Actor

Scrape preprints from bioRxiv and medRxiv with the leading open-access preprint servers for biology and medicine. Search by date range, fetch by DOI, or retrieve published journal version information.

- **URL**: https://apify.com/crawlergang/biorxiv-medrxiv-scraper.md
- **Developed by:** [Crawler Gang](https://apify.com/crawlergang) (community)
- **Categories:** Developer tools, Automation, Other
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, 11 bookmarks
- **User rating**: 5.00 out of 5 stars

## Pricing

from $3.00 / 1,000 results

This Actor is paid per event and usage. You are charged both the fixed price for specific events and for Apify platform usage.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## bioRxiv & medRxiv Preprint Scraper

Scrape preprints from **[bioRxiv](https://www.biorxiv.org)** and **[medRxiv](https://www.medrxiv.org)** — the leading open-access preprint servers for biology and medicine — powered by the official [bioRxiv/medRxiv API](https://api.biorxiv.org).

No account, no API key, and no proxy required. Works on the Apify free plan.

---

### What It Does

- **Search by date range** — retrieve all preprints posted within a date window (up to any span; automatically paginates through 90-day API chunks)
- **Fetch by DOI** — look up one or more specific preprints using their DOI
- **Published version info** — check whether a preprint has been published in a journal and retrieve the journal DOI and name
- **Filter by category** — narrow results to a specific scientific field (neuroscience, genomics, immunology, etc.)
- **Both servers** — query bioRxiv, medRxiv, or both simultaneously

---

### Use Cases

- Track new preprints in your research field
- Build a literature monitoring or alerting pipeline
- Analyze publishing trends across biomedical disciplines
- Identify preprints that have been formally published in journals
- Aggregate author/institution data for research network analysis

---

### Input Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `mode` | select | `search` | `search` (date range), `byDoi` (DOI lookup), or `published` (journal version info) |
| `server` | select | `biorxiv` | `biorxiv`, `medrxiv`, or `both` |
| `dateFrom` | date | `2024-01-01` | Start date (YYYY-MM-DD). Required for mode=search |
| `dateTo` | date | `2024-01-07` | End date (YYYY-MM-DD). Required for mode=search |
| `dois` | array | — | One or more DOIs to look up (required for mode=byDoi and mode=published) |
| `category` | select | All | Filter to a specific scientific category (mode=search only) |
| `maxItems` | integer | 50 | Maximum number of records to return (1–10000) |

#### Supported Categories

Neuroscience, Bioinformatics, Genomics, Microbiology, Cell Biology, Biochemistry, Evolutionary Biology, Pharmacology and Toxicology, Immunology, Molecular Biology, Genetics, Cancer Biology, Scientific Communication, Pathology, Systems Biology, Ecology, Physiology, Epidemiology, Developmental Biology, Clinical Trials, Bioengineering, Plant Biology, Zoology, Biophysics, Synthetic Biology.

---

### Output Fields

#### search and byDoi Modes

| Field | Type | Description |
|-------|------|-------------|
| `doi` | string | Preprint DOI |
| `title` | string | Preprint title |
| `authors` | string | All authors as a single string |
| `authorList` | array | Authors as an array of strings |
| `correspondingAuthor` | string | Name of the corresponding author |
| `institution` | string | Corresponding author's institution |
| `submittedDate` | string | Date submitted (YYYY-MM-DD) |
| `version` | integer | Version number of the preprint |
| `type` | string | Preprint type (e.g. "new results") |
| `license` | string | License code (e.g. "cc_by", "cc0") |
| `category` | string | Scientific category |
| `server` | string | Source server (biorxiv or medrxiv) |
| `abstractText` | string | Full abstract text |
| `jatsXmlUrl` | string | URL to the JATS/XML version |
| `previewUrl` | string | URL to view the preprint on biorxiv/medrxiv |
| `isPublished` | boolean | Whether the preprint has a journal publication |
| `publishedDoi` | string | Journal publication DOI (if published) |
| `scrapedAt` | string | Timestamp when the record was scraped (ISO-8601) |

#### published Mode

| Field | Type | Description |
|-------|------|-------------|
| `doi` | string | bioRxiv/medRxiv preprint DOI |
| `title` | string | Preprint title |
| `authors` | string | Authors string |
| `category` | string | Scientific category |
| `server` | string | Preprint server |
| `isPublished` | boolean | Whether a journal publication exists |
| `publishedDoi` | string | Journal publication DOI |
| `publishedJournal` | string | Journal name |
| `publishedDate` | string | Journal publication date |
| `preprintDate` | string | Date originally submitted as preprint |
| `preprintDoi` | string | Original preprint DOI |
| `scrapedAt` | string | Timestamp when the record was scraped |

---

### Sample Output

#### Preprint Record

```json
{
  "doi": "10.1101/2024.01.15.575123",
  "title": "A Study of Neural Circuits in the Hippocampus",
  "authors": "Smith J, Jones A, Brown C",
  "authorList": ["Smith J", "Jones A", "Brown C"],
  "correspondingAuthor": "Smith J",
  "institution": "Harvard University",
  "submittedDate": "2024-01-15",
  "version": 1,
  "type": "new results",
  "license": "cc_by",
  "category": "neuroscience",
  "server": "biorxiv",
  "abstractText": "This paper studies hippocampal circuits...",
  "jatsXmlUrl": "https://www.biorxiv.org/content/10.1101/2024.01.15.575123v1.source.xml",
  "previewUrl": "https://www.biorxiv.org/content/10.1101/2024.01.15.575123",
  "isPublished": false,
  "scrapedAt": "2026-05-23T10:00:00+00:00"
}
````

***

### FAQ

**Does this require an API key or account?**
No. The bioRxiv/medRxiv API is completely public and free. No registration required.

**What is the maximum date range I can query?**
The bioRxiv API returns up to 100 preprints per call with a 90-day window. This scraper automatically splits larger date ranges into 90-day chunks and paginates through all of them.

**How do I fetch a specific preprint?**
Use `mode=byDoi` and enter the DOI (e.g. `10.1101/2024.01.01.612345`) in the `dois` field.

**Can I check if preprints have been published?**
Yes — use `mode=published` with a list of DOIs to retrieve journal publication information including the journal name and published DOI.

**What categories are available?**
bioRxiv covers biological sciences; medRxiv covers health sciences and clinical research. See the category dropdown in the input form for the full list.

**Can I query both bioRxiv and medRxiv at once?**
Yes — set `server=both` and the scraper will query both servers and combine results.

**Why are some preprints missing fields like `institution` or `abstractText`?**
These fields are only included when the data is available in the API response. Records with missing data will simply omit those fields rather than including null values.

**How many records can I retrieve per run?**
Up to 10,000 records per run. For larger datasets, use narrower date ranges or run multiple times with offset date ranges.

# Actor input Schema

## `mode` (type: `string`):

What to fetch: search by date range, fetch by DOI, or get published journal version info.

## `server` (type: `string`):

Which preprint server to query.

## `dateFrom` (type: `string`):

Start of date range (YYYY-MM-DD). Required for mode=search.

## `dateTo` (type: `string`):

End of date range (YYYY-MM-DD). Maximum 90 days from dateFrom.

## `dois` (type: `array`):

One or more DOIs to fetch (e.g. 10.1101/2024.01.01.000001).

## `category` (type: `string`):

Filter results to a specific scientific category (mode=search only).

## `maxItems` (type: `integer`):

Maximum number of preprints to return.

## Actor input object example

```json
{
  "mode": "search",
  "server": "biorxiv",
  "dateFrom": "2024-01-01",
  "dateTo": "2024-01-07",
  "category": "",
  "maxItems": 50
}
```

# Actor output Schema

## `preprints` (type: `string`):

Dataset containing all scraped preprint records.

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "mode": "search",
    "server": "biorxiv",
    "dateFrom": "2024-01-01",
    "dateTo": "2024-01-07",
    "category": "",
    "maxItems": 50
};

// Run the Actor and wait for it to finish
const run = await client.actor("crawlergang/biorxiv-medrxiv-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "mode": "search",
    "server": "biorxiv",
    "dateFrom": "2024-01-01",
    "dateTo": "2024-01-07",
    "category": "",
    "maxItems": 50,
}

# Run the Actor and wait for it to finish
run = client.actor("crawlergang/biorxiv-medrxiv-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "mode": "search",
  "server": "biorxiv",
  "dateFrom": "2024-01-01",
  "dateTo": "2024-01-07",
  "category": "",
  "maxItems": 50
}' |
apify call crawlergang/biorxiv-medrxiv-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=crawlergang/biorxiv-medrxiv-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "bioRxiv & medRxiv Preprint Scraper",
        "description": "Scrape preprints from bioRxiv and medRxiv with the leading open-access preprint servers for biology and medicine. Search by date range, fetch by DOI, or retrieve published journal version information.",
        "version": "1.0",
        "x-build-id": "wMJ34teIlizMi0jQz"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/crawlergang~biorxiv-medrxiv-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-crawlergang-biorxiv-medrxiv-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/crawlergang~biorxiv-medrxiv-scraper/runs": {
            "post": {
                "operationId": "runs-sync-crawlergang-biorxiv-medrxiv-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/crawlergang~biorxiv-medrxiv-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-crawlergang-biorxiv-medrxiv-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "mode"
                ],
                "properties": {
                    "mode": {
                        "title": "Mode",
                        "enum": [
                            "search",
                            "byDoi",
                            "published"
                        ],
                        "type": "string",
                        "description": "What to fetch: search by date range, fetch by DOI, or get published journal version info.",
                        "default": "search"
                    },
                    "server": {
                        "title": "Server",
                        "enum": [
                            "biorxiv",
                            "medrxiv",
                            "both"
                        ],
                        "type": "string",
                        "description": "Which preprint server to query.",
                        "default": "biorxiv"
                    },
                    "dateFrom": {
                        "title": "Start date",
                        "type": "string",
                        "description": "Start of date range (YYYY-MM-DD). Required for mode=search."
                    },
                    "dateTo": {
                        "title": "End date",
                        "type": "string",
                        "description": "End of date range (YYYY-MM-DD). Maximum 90 days from dateFrom."
                    },
                    "dois": {
                        "title": "DOIs (mode=byDoi or mode=published)",
                        "type": "array",
                        "description": "One or more DOIs to fetch (e.g. 10.1101/2024.01.01.000001).",
                        "items": {
                            "type": "string"
                        }
                    },
                    "category": {
                        "title": "Category filter",
                        "enum": [
                            "",
                            "neuroscience",
                            "bioinformatics",
                            "genomics",
                            "microbiology",
                            "cell biology",
                            "biochemistry",
                            "evolutionary biology",
                            "pharmacology and toxicology",
                            "immunology",
                            "molecular biology",
                            "genetics",
                            "cancer biology",
                            "scientific communication and education",
                            "pathology",
                            "systems biology",
                            "ecology",
                            "physiology",
                            "epidemiology",
                            "developmental biology",
                            "clinical trials",
                            "bioengineering",
                            "plant biology",
                            "zoology",
                            "biophysics",
                            "synthetic biology"
                        ],
                        "type": "string",
                        "description": "Filter results to a specific scientific category (mode=search only).",
                        "default": ""
                    },
                    "maxItems": {
                        "title": "Max items",
                        "minimum": 1,
                        "maximum": 10000,
                        "type": "integer",
                        "description": "Maximum number of preprints to return.",
                        "default": 50
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
