# arXiv Paper Scraper — Abstracts, Authors & Metadata (`logiover/arxiv-paper-scraper`) Actor

Scrape research paper metadata from arXiv.org the worlds largest open-access repository. Search by keyword across computer science physics mathematics biology. Returns titles abstracts authors categories PDF links and DOIs. No API key required.

- **URL**: https://apify.com/logiover/arxiv-paper-scraper.md
- **Developed by:** [Logiover](https://apify.com/logiover) (community)
- **Categories:** Developer tools, Business
- **Stats:** 2 total users, 1 monthly users, 75.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

from $3.50 / 1,000 results

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## 📄 arXiv Paper Scraper — Research Metadata, Abstracts & Author Data

Scrape research paper metadata from **arXiv.org**, the world's largest open-access research repository with over 2.5 million scholarly articles across physics, computer science, mathematics, biology, economics, and more. This actor queries the **public arXiv API** (no API key required) and returns structured paper data including titles, abstracts, authors, categories, publication dates, PDF links and DOIs.

### 🎓 Why arXiv Data Is Valuable

arXiv is the primary preprint server for cutting-edge research. Every major AI breakthrough — from transformers to diffusion models — appeared on arXiv first. Researchers, companies, universities, VCs, and journalists track arXiv to stay ahead of scientific developments.

**What you can extract per paper:**
- **arXiv ID** and direct links to abstract page and PDF
- **Title** and **abstract** (full text)
- **Author list** with all co-authors
- **Categories** (e.g., cs.AI, cs.CL, stat.ML) and primary category
- **Publication date** (original submission) and last updated date
- **DOI** (Digital Object Identifier) and journal reference if published
- **Author comments** (implementation notes, accepted venues, code links)

### 📊 Output Fields

| Field | Description |
|-------|-------------|
| `arxivId` | Unique arXiv paper identifier (e.g., 2401.12345) |
| `title` | Paper title |
| `authors` | Comma-separated author names |
| `abstract` | Full paper abstract text |
| `categories` | All arXiv category codes |
| `primaryCategory` | Primary category |
| `publishedDate` | Original submission date |
| `updatedDate` | Last update date |
| `pdfUrl` | Direct PDF download link |
| `arxivUrl` | Abstract page URL |
| `comment` | Author comments |
| `journalRef` | Journal reference |
| `doi` | Digital Object Identifier |
| `searchQuery` | The query that found this paper |

### ⚙️ Input Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `searchQueries` | array | `["machine learning"]` | Search terms — paper titles, author names, keywords |
| `categories` | array | `[]` | arXiv category filters (cs.AI, stat.ML, etc.) |
| `maxResults` | integer | 200 | Max papers to return per query (up to 1000) |
| `sortBy` | enum | relevance | relevance / lastUpdatedDate / submittedDate |
| `dateFrom` | string | — | Filter papers after date (YYYY-MM-DD) |

### 🎯 Use Cases

#### AI & Tech Industry Intelligence
Venture capital firms and corporate strategy teams scrape arXiv daily to identify emerging technologies, track competitor research output, and discover promising startups before they raise funding. The people publishing breakthrough papers today are founding the unicorns of tomorrow.

#### Academic Literature Reviews
PhD students and researchers search across thousands of papers, filter by date and category, and export structured metadata for systematic literature reviews. No more manual copy-pasting from the arXiv website.

#### Recruitment & Talent Sourcing
Recruiters and engineering leaders search arXiv for authors publishing in specific domains. Every paper author is a potential candidate — arXiv gives you their name, research area, and publication track record.

#### Dataset Building for NLP/ML
Machine learning teams build training datasets from arXiv abstracts and titles. The clean XML API response makes it ideal for text classification, topic modeling, and citation graph construction.

#### Competitive Research Monitoring
Companies monitor arXiv categories relevant to their industry and get alerts when competitors or key researchers publish new work. Stay ahead of your competitors' R&D pipeline.

### 💰 Pricing

Pay per event — charged per API request to arXiv. Each query returns up to 100 papers in a single API call. A run with 5 search queries and 200 results each costs approximately $0.05–0.15 in compute units. The arXiv API is free and has no rate limiting beyond a polite delay between requests.

### 🚀 Tips

- **Use specific categories** (e.g., `cs.CL` for NLP, `cs.CV` for computer vision) for more targeted results
- **Combine queries**: run 5–10 related queries to build a comprehensive dataset
- **Filter by date**: use `dateFrom` to only get papers from the last month or year
- **Sort by lastUpdatedDate** to find recently revised papers with new results
- **arXiv rate limit**: the API asks for polite delays (this actor includes built-in delays)

### ❓ FAQ

**Q: Is this the official arXiv API?**
A: Yes — this actor uses arXiv's public OAI-PMH compatible API at export.arxiv.org. No API key, no authentication, no rate-limiting beyond polite use.

**Q: Can I download the actual PDFs?**
A: The actor provides PDF URLs. You can download PDFs separately. Full-text extraction from PDFs is not included.

**Q: How many papers can I get in one run?**
A: Up to 1000 papers per query, with no limit on the number of queries.

**Keywords:** arxiv scraper, research paper api, academic paper data, arxiv metadata extractor, scientific paper scraper, arxiv abstract api, machine learning papers dataset, arxiv search tool, research literature mining, preprint server data, arxiv paper downloader, scholarly article scraper, cs papers data, arxiv api without key, academic research tool

#### How do I build a dataset of recent AI papers from arXiv?

Set searchQueries to your AI topics, add cs.AI or cs.LG categories, sort by submittedDate and use dateFrom to keep only recent preprints in your export.

#### Can I search arXiv by author name?

Yes. Put the author name in searchQueries and the scraper returns every matching paper with the full co-author list, abstract and PDF link.

### 📝 Changelog

#### 2026-07-01

- Maintenance pass: re-verified end-to-end on live data and confirmed successful runs within the 5-minute quality window on the default input.
- Sharpened Store metadata (SEO title & description) and expanded the FAQ with high-intent, long-tail questions for easier discovery in Google and Apify Store search.
- Added ready-to-run example tasks that cover common real-world use cases.

# Actor input Schema

## `searchQueries` (type: `array`):

Search terms to query on arXiv. Use author names, paper titles, keywords, or topic areas like 'machine learning', 'quantum computing', 'neural networks'.
## `categories` (type: `array`):

arXiv category codes like cs.AI, cs.CL, cs.CV, stat.ML, physics.optics, q-bio.GN. Leave empty to search all categories.
## `maxResults` (type: `integer`):

Maximum total number of papers to return across all search queries.
## `sortBy` (type: `string`):

Sort order for search results.
## `dateFrom` (type: `string`):

Filter papers submitted after this date (YYYY-MM-DD). Leave empty for no date filter.
## `proxyConfiguration` (type: `object`):

arXiv is generally proxy-friendly. Datacenter proxies work fine.

## Actor input object example

```json
{
  "searchQueries": [
    "machine learning",
    "large language models",
    "computer vision"
  ],
  "categories": [
    "cs.AI",
    "cs.CL",
    "stat.ML"
  ],
  "maxResults": 200,
  "sortBy": "relevance",
  "dateFrom": "",
  "proxyConfiguration": {
    "useApifyProxy": true
  }
}
````

# Actor output Schema

## `arxivId` (type: `string`):

arXiv paper identifier

## `title` (type: `string`):

Paper title

## `authors` (type: `string`):

Comma-separated author names

## `abstract` (type: `string`):

Paper abstract text

## `categories` (type: `string`):

arXiv category codes

## `primaryCategory` (type: `string`):

Primary arXiv category

## `publishedDate` (type: `string`):

Original submission date

## `updatedDate` (type: `string`):

Last updated date

## `pdfUrl` (type: `string`):

Direct PDF link

## `arxivUrl` (type: `string`):

Abstract page URL

## `comment` (type: `string`):

Author comments

## `journalRef` (type: `string`):

Journal reference if published

## `doi` (type: `string`):

Digital Object Identifier

## `searchQuery` (type: `string`):

The query that found this paper

## `scrapedAt` (type: `string`):

Timestamp

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "searchQueries": [
        "machine learning",
        "large language models",
        "computer vision"
    ],
    "categories": [
        "cs.AI",
        "cs.CL",
        "stat.ML"
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("logiover/arxiv-paper-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "searchQueries": [
        "machine learning",
        "large language models",
        "computer vision",
    ],
    "categories": [
        "cs.AI",
        "cs.CL",
        "stat.ML",
    ],
}

# Run the Actor and wait for it to finish
run = client.actor("logiover/arxiv-paper-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "searchQueries": [
    "machine learning",
    "large language models",
    "computer vision"
  ],
  "categories": [
    "cs.AI",
    "cs.CL",
    "stat.ML"
  ]
}' |
apify call logiover/arxiv-paper-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=logiover/arxiv-paper-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "arXiv Paper Scraper — Abstracts, Authors & Metadata",
        "description": "Scrape research paper metadata from arXiv.org the worlds largest open-access repository. Search by keyword across computer science physics mathematics biology. Returns titles abstracts authors categories PDF links and DOIs. No API key required.",
        "version": "1.0",
        "x-build-id": "JDmB3pgvyMkPprJY4"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/logiover~arxiv-paper-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-logiover-arxiv-paper-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/logiover~arxiv-paper-scraper/runs": {
            "post": {
                "operationId": "runs-sync-logiover-arxiv-paper-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/logiover~arxiv-paper-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-logiover-arxiv-paper-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "searchQueries"
                ],
                "properties": {
                    "searchQueries": {
                        "title": "Search Queries",
                        "type": "array",
                        "description": "Search terms to query on arXiv. Use author names, paper titles, keywords, or topic areas like 'machine learning', 'quantum computing', 'neural networks'.",
                        "default": [
                            "machine learning"
                        ],
                        "items": {
                            "type": "string"
                        }
                    },
                    "categories": {
                        "title": "arXiv Categories",
                        "type": "array",
                        "description": "arXiv category codes like cs.AI, cs.CL, cs.CV, stat.ML, physics.optics, q-bio.GN. Leave empty to search all categories.",
                        "default": [],
                        "items": {
                            "type": "string"
                        }
                    },
                    "maxResults": {
                        "title": "Max Results",
                        "minimum": 1,
                        "maximum": 1000,
                        "type": "integer",
                        "description": "Maximum total number of papers to return across all search queries.",
                        "default": 200
                    },
                    "sortBy": {
                        "title": "Sort By",
                        "enum": [
                            "relevance",
                            "lastUpdatedDate",
                            "submittedDate"
                        ],
                        "type": "string",
                        "description": "Sort order for search results.",
                        "default": "relevance"
                    },
                    "dateFrom": {
                        "title": "Date From",
                        "type": "string",
                        "description": "Filter papers submitted after this date (YYYY-MM-DD). Leave empty for no date filter.",
                        "default": ""
                    },
                    "proxyConfiguration": {
                        "title": "Proxy Configuration",
                        "type": "object",
                        "description": "arXiv is generally proxy-friendly. Datacenter proxies work fine.",
                        "default": {
                            "useApifyProxy": true
                        }
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
