# ArXiv Paper Scraper (`sheshinmcfly/arxiv-paper-scraper`) Actor

Search and extract scientific papers from ArXiv.org across any field. Returns title, authors, full abstract, PDF link, arXiv ID, categories, and submission date. Ideal for AI research monitoring, RAG pipelines, literature reviews, and academic trend analysis. No API key needed.

- **URL**: https://apify.com/sheshinmcfly/arxiv-paper-scraper.md
- **Developed by:** [Sheshinmcfly](https://apify.com/sheshinmcfly) (community)
- **Categories:** News, AI, Automation
- **Stats:** 2 total users, 0 monthly users, 100.0% runs succeeded, 1 bookmarks
- **User rating**: No ratings yet

## Pricing

from $2.00 / 1,000 results

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## ArXiv Paper Scraper

Search and extract **scientific papers from ArXiv.org** — the largest open-access repository of preprints in physics, mathematics, computer science, AI, and more.

Returns full metadata including title, authors, abstract, categories, submission date, and PDF link. Perfect for AI research pipelines, RAG systems, and academic trend monitoring.

---

### What data does it extract?

| Field | Description | Example |
|---|---|---|
| `arxivId` | ArXiv paper ID | `"2604.18584"` |
| `title` | Full paper title | `"MathNet: a Global Multimodal Benchmark..."` |
| `authors` | List of authors | `["Shaden Alshammari", "Kevin Wen"]` |
| `abstract` | Full abstract text | `"Mathematical problem solving remains..."` |
| `categories` | ArXiv subject tags | `["cs.AI", "cs.LG", "cs.IR"]` |
| `primaryCategory` | Primary category | `"cs.AI"` |
| `submittedDate` | Submission date | `"20 April, 2026"` |
| `comments` | Author comments | `"ICLR 2026; 30 pages"` |
| `journalRef` | Journal reference | `"Proceedings of ICLR, 2026"` |
| `pdfUrl` | Direct PDF link | `"https://arxiv.org/pdf/2604.18584"` |
| `url` | ArXiv abstract page | `"https://arxiv.org/abs/2604.18584"` |
| `query` | Search query used | `"large language models"` |
| `extractedAt` | Extraction timestamp | `"2026-04-21T12:00:00Z"` |

---

### Use cases

- **RAG pipelines**: Feed domain-specific papers into retrieval-augmented AI systems
- **AI research monitoring**: Track the latest publications in LLMs, computer vision, NLP
- **Academic trend analysis**: Identify hot topics and emerging research areas
- **Literature review automation**: Collect papers for a specific topic at scale
- **LLM fine-tuning data**: High-quality scientific text for model training
- **Competitive intelligence**: Monitor what research competitors are publishing

---

### How to use

1. Open the actor and configure:
   - **Search queries**: One or more search terms (e.g. `"diffusion models"`, `"reinforcement learning"`)
   - **Search field**: All fields, title only, abstract only, or author
   - **Sort by**: Newest first or by relevance
   - **Max results**: Number of papers per query
2. Click **Start**
3. Download results as JSON, CSV, or Excel

---

### Example output (JSON)

```json
{
  "arxivId": "2604.18584",
  "title": "MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval",
  "authors": ["Shaden Alshammari", "Kevin Wen", "Antonio Torralba"],
  "abstract": "Mathematical problem solving remains a challenging test of reasoning...",
  "categories": ["cs.AI", "cs.DL", "cs.IR", "cs.LG"],
  "primaryCategory": "cs.AI",
  "submittedDate": "20 April, 2026",
  "comments": "ICLR 2026; Website: http://mathnet.mit.edu",
  "journalRef": "Proceedings of ICLR, 2026",
  "pdfUrl": "https://arxiv.org/pdf/2604.18584",
  "url": "https://arxiv.org/abs/2604.18584",
  "query": "large language models",
  "extractedAt": "2026-04-21T12:00:00.000Z"
}
````

***

### Pricing

This actor charges **$0.002 USD per paper extracted**. Extracting 100 papers costs approximately $0.20 USD.

***

### Keywords

arxiv scraper, scientific paper extractor, research paper scraper, arxiv API, AI paper scraper, academic data extractor, preprint scraper, NLP research data, LLM training data, arxiv search scraper

***

### Legal Disclaimer

This actor extracts **publicly available open-access data only** from ArXiv.org, in compliance with Chilean Law 19.628 on the Protection of Private Life (*Ley 19.628 sobre Protección de la Vida Privada*).

ArXiv is an open-access repository operated by Cornell University. All papers and metadata extracted are freely and publicly accessible without authentication.

**What this actor does NOT collect:**

- Names or personal data of any private individuals
- User accounts, submissions portals, or private information
- Any data not freely visible to anonymous visitors

**What this actor collects:**

- Paper titles, abstracts, and author names (public academic data)
- Subject categories and submission dates
- Public URLs and PDF links

Users are solely responsible for ensuring their use of this data complies with applicable laws and ArXiv's terms of use.

# Actor input Schema

## `queries` (type: `array`):

List of search terms to look up on ArXiv (e.g. 'large language models', 'quantum computing').

## `searchType` (type: `string`):

Which field to search in.

## `sortBy` (type: `string`):

How to order the results.

## `maxResultsPerQuery` (type: `integer`):

Maximum number of papers to extract per search query.

## `proxyConfiguration` (type: `object`):

Datacenter proxies are free and work for most sites. Switch to Residential if you get blocked.

## Actor input object example

```json
{
  "queries": [
    "large language models"
  ],
  "searchType": "all",
  "sortBy": "-announced_date_first",
  "maxResultsPerQuery": 50,
  "proxyConfiguration": {
    "useApifyProxy": true
  }
}
```

# Actor output Schema

## `dataset` (type: `string`):

All scraped research papers stored in the default dataset.

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "proxyConfiguration": {
        "useApifyProxy": true
    }
};

// Run the Actor and wait for it to finish
const run = await client.actor("sheshinmcfly/arxiv-paper-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "proxyConfiguration": { "useApifyProxy": True } }

# Run the Actor and wait for it to finish
run = client.actor("sheshinmcfly/arxiv-paper-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "proxyConfiguration": {
    "useApifyProxy": true
  }
}' |
apify call sheshinmcfly/arxiv-paper-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=sheshinmcfly/arxiv-paper-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "ArXiv Paper Scraper",
        "description": "Search and extract scientific papers from ArXiv.org across any field. Returns title, authors, full abstract, PDF link, arXiv ID, categories, and submission date. Ideal for AI research monitoring, RAG pipelines, literature reviews, and academic trend analysis. No API key needed.",
        "version": "1.0",
        "x-build-id": "3IOP7YqIcjobAfGbh"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/sheshinmcfly~arxiv-paper-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-sheshinmcfly-arxiv-paper-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/sheshinmcfly~arxiv-paper-scraper/runs": {
            "post": {
                "operationId": "runs-sync-sheshinmcfly-arxiv-paper-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/sheshinmcfly~arxiv-paper-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-sheshinmcfly-arxiv-paper-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "queries": {
                        "title": "Search queries",
                        "type": "array",
                        "description": "List of search terms to look up on ArXiv (e.g. 'large language models', 'quantum computing').",
                        "items": {
                            "type": "string"
                        },
                        "default": [
                            "large language models"
                        ]
                    },
                    "searchType": {
                        "title": "Search field",
                        "enum": [
                            "all",
                            "title",
                            "abstract",
                            "author"
                        ],
                        "type": "string",
                        "description": "Which field to search in.",
                        "default": "all"
                    },
                    "sortBy": {
                        "title": "Sort by",
                        "enum": [
                            "-announced_date_first",
                            "-submitted_date",
                            ""
                        ],
                        "type": "string",
                        "description": "How to order the results.",
                        "default": "-announced_date_first"
                    },
                    "maxResultsPerQuery": {
                        "title": "Max results per query",
                        "minimum": 1,
                        "maximum": 500,
                        "type": "integer",
                        "description": "Maximum number of papers to extract per search query.",
                        "default": 50
                    },
                    "proxyConfiguration": {
                        "title": "Proxy Configuration",
                        "type": "object",
                        "description": "Datacenter proxies are free and work for most sites. Switch to Residential if you get blocked."
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
