# OSF Preprints Scraper (`crawlerbros/osf-preprints-scraper`) Actor

This actor extracts preprint metadata from OSF's preprint archive, which hosts over 190,000 open-access scholarly works across disciplines including psychology, medicine, social sciences, engineering, and more. It supports filtering by tags, subjects, and provider, as well as direct ID-based lookup.

- **URL**: https://apify.com/crawlerbros/osf-preprints-scraper.md
- **Developed by:** [Crawler Bros](https://apify.com/crawlerbros) (community)
- **Categories:** Automation, Developer tools, Agents
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

from $3.00 / 1,000 results

This Actor is paid per event and usage. You are charged both the fixed price for specific events and for Apify platform usage.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## OSF Preprints Scraper

Scrape preprints from the [Open Science Framework (OSF)](https://osf.io/preprints/) using its public REST API — no authentication or proxy required.

### What It Does

This actor extracts preprint metadata from OSF's preprint archive, which hosts over 190,000 open-access scholarly works across disciplines including psychology, medicine, social sciences, engineering, and more. It supports filtering by tags, subjects, and provider, as well as direct ID-based lookup.

### Key Features

- **No authentication required** — uses the public OSF API
- **Two modes**: search/browse preprints or fetch specific ones by ID
- **Filter by tags, subjects, or provider** (e.g., PsyArXiv, SocArXiv, MedArXiv)
- **Pagination handled automatically** — retrieves up to 1,000 records per run
- **Clean structured output** with camelCase field names

### Input Fields

| Field | Type | Description |
|-------|------|-------------|
| `mode` | Select | `searchPreprints` (default) or `getById` |
| `searchQuery` | String | Filter preprints by tag (e.g. `machine learning`) |
| `subjectFilter` | String | Filter by subject text (e.g. `Medicine and Health Sciences`) |
| `provider` | String | Filter by provider (e.g. `psyarxiv`, `socarxiv`, `osf`) |
| `preprintIds` | Array | List of OSF preprint IDs (for `getById` mode) |
| `maxItems` | Integer | Max number of results (1–1000, default 50) |

#### Provider Examples

Popular OSF preprint providers you can filter by:

| Provider ID | Description |
|-------------|-------------|
| `osf` | General OSF preprints |
| `psyarxiv` | Psychology |
| `socarxiv` | Social sciences |
| `medarxiv` | Medicine |
| `eartharxiv` | Earth sciences |
| `engrxiv` | Engineering |
| `biorxiv` | Biology |
| `ecsarxiv` | Electrochemical Society |

### Output Fields

Each item in the dataset contains:

| Field | Type | Description |
|-------|------|-------------|
| `preprintId` | String | Unique OSF preprint ID (e.g. `abc12_v2`) |
| `title` | String | Title of the preprint |
| `description` | String | Abstract or summary |
| `doi` | String | Digital Object Identifier |
| `datePublished` | String | Publication date (ISO 8601) |
| `dateCreated` | String | Creation date (ISO 8601) |
| `dateModified` | String | Last modified date (ISO 8601) |
| `tags` | Array | Author-assigned tags |
| `isPublished` | Boolean | Whether the preprint is publicly published |
| `provider` | String | Provider ID (e.g. `psyarxiv`) |
| `subjects` | Array | Subject classifications |
| `license` | String | License name (e.g. `CC-By Attribution 4.0`) |
| `sourceUrl` | String | Direct URL to the preprint on OSF |
| `recordType` | String | Always `"preprint"` |
| `scrapedAt` | String | Timestamp when the record was scraped |

### Example Output

```json
{
  "preprintId": "snveb_v2",
  "title": "Beyond the Resume: Comparing the Predictive Power of Personality Assessments",
  "description": "This study examines employee turnover prediction using machine learning...",
  "doi": "10.31234/osf.io/snveb_v2",
  "datePublished": "2026-05-26T13:58:36.783000Z",
  "dateCreated": "2026-05-25T09:31:34.214181Z",
  "dateModified": "2026-05-26T13:58:36.814700Z",
  "tags": ["Machine learning", "Employee turnover", "Explainable AI"],
  "isPublished": true,
  "provider": "psyarxiv",
  "subjects": ["Industrial and Organizational Psychology", "Quantitative Methods"],
  "sourceUrl": "https://osf.io/preprints/psyarxiv/snveb_v2/",
  "recordType": "preprint",
  "scrapedAt": "2026-05-30T10:00:00.000000+00:00"
}
````

### Use Cases

- **Academic research**: Track preprints in specific fields
- **Literature reviews**: Collect papers by subject or tag for systematic reviews
- **Trend analysis**: Monitor publication rates by subject over time
- **Citation tracking**: Gather DOIs for downstream citation analysis
- **Content aggregation**: Build databases of open-access scholarly works

### FAQs

**Q: Does this require an API key?**
A: No. The OSF public API is freely accessible without authentication.

**Q: How many results can I get?**
A: Up to 1,000 per run. OSF has 190,000+ preprints total.

**Q: Can I filter by date?**
A: Not directly via this actor's inputs. You can filter by tag and subject, then sort results by `datePublished` in post-processing.

**Q: What's the difference between providers?**
A: Different academic communities host preprint servers on OSF (e.g., PsyArXiv for psychology). Using the `provider` filter restricts results to that community.

**Q: Are all preprints peer-reviewed?**
A: No — preprints are pre-peer-review. The `isPublished` field indicates OSF server acceptance, not journal peer review.

**Q: How current is the data?**
A: The OSF API returns live data. New preprints appear within hours of submission.

# Actor input Schema

## `mode` (type: `string`):

Scraping mode: search published preprints or fetch specific preprints by ID.

## `searchQuery` (type: `string`):

Filter preprints by tag (e.g. 'machine learning', 'climate change'). Leave empty to get latest published preprints.

## `subjectFilter` (type: `string`):

Filter by subject text (e.g. 'Medicine and Health Sciences', 'Social and Behavioral Sciences').

## `provider` (type: `string`):

Filter by preprint provider/repository.

## `preprintIds` (type: `array`):

List of OSF preprint IDs to fetch (used in 'Get by ID' mode). Example: \['abc12', 'xyz99\_v2'].

## `maxItems` (type: `integer`):

Maximum number of preprints to return.

## Actor input object example

```json
{
  "mode": "searchPreprints",
  "provider": "",
  "maxItems": 10
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "mode": "searchPreprints",
    "provider": "",
    "maxItems": 10
};

// Run the Actor and wait for it to finish
const run = await client.actor("crawlerbros/osf-preprints-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "mode": "searchPreprints",
    "provider": "",
    "maxItems": 10,
}

# Run the Actor and wait for it to finish
run = client.actor("crawlerbros/osf-preprints-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "mode": "searchPreprints",
  "provider": "",
  "maxItems": 10
}' |
apify call crawlerbros/osf-preprints-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=crawlerbros/osf-preprints-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "OSF Preprints Scraper",
        "description": "This actor extracts preprint metadata from OSF's preprint archive, which hosts over 190,000 open-access scholarly works across disciplines including psychology, medicine, social sciences, engineering, and more. It supports filtering by tags, subjects, and provider, as well as direct ID-based lookup.",
        "version": "1.0",
        "x-build-id": "aJr4HH5uA0fdfQT1A"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/crawlerbros~osf-preprints-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-crawlerbros-osf-preprints-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/crawlerbros~osf-preprints-scraper/runs": {
            "post": {
                "operationId": "runs-sync-crawlerbros-osf-preprints-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/crawlerbros~osf-preprints-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-crawlerbros-osf-preprints-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "mode": {
                        "title": "Mode",
                        "enum": [
                            "searchPreprints",
                            "getById"
                        ],
                        "type": "string",
                        "description": "Scraping mode: search published preprints or fetch specific preprints by ID.",
                        "default": "searchPreprints"
                    },
                    "searchQuery": {
                        "title": "Search Query (Tag Filter)",
                        "type": "string",
                        "description": "Filter preprints by tag (e.g. 'machine learning', 'climate change'). Leave empty to get latest published preprints."
                    },
                    "subjectFilter": {
                        "title": "Subject Filter",
                        "type": "string",
                        "description": "Filter by subject text (e.g. 'Medicine and Health Sciences', 'Social and Behavioral Sciences')."
                    },
                    "provider": {
                        "title": "Provider",
                        "enum": [
                            "",
                            "osf",
                            "psyarxiv",
                            "socarxiv",
                            "eartharxiv",
                            "ecoevorxiv",
                            "engrxiv",
                            "metaarxiv",
                            "bodoarxiv",
                            "africarxiv",
                            "agrixiv",
                            "arabixiv",
                            "biohackrxiv",
                            "coppreprints",
                            "ecsarxiv",
                            "edarxiv",
                            "focusarchive",
                            "frenxiv",
                            "inarxiv",
                            "indiarxiv",
                            "lawarchive",
                            "lawarxiv",
                            "lissa",
                            "marxiv",
                            "mediarxiv",
                            "mindrxiv",
                            "newaddictionsx",
                            "nutrixiv",
                            "paleorxiv",
                            "sportrxiv",
                            "thesiscommons",
                            "acctrt",
                            "livedata"
                        ],
                        "type": "string",
                        "description": "Filter by preprint provider/repository.",
                        "default": ""
                    },
                    "preprintIds": {
                        "title": "Preprint IDs",
                        "type": "array",
                        "description": "List of OSF preprint IDs to fetch (used in 'Get by ID' mode). Example: ['abc12', 'xyz99_v2']."
                    },
                    "maxItems": {
                        "title": "Max Items",
                        "minimum": 1,
                        "maximum": 1000,
                        "type": "integer",
                        "description": "Maximum number of preprints to return.",
                        "default": 50
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
