# arXiv Scraper: Papers, Authors, Categories & Search (`perconey/arxiv-scraper`) Actor

Scrape arxiv.org via the official Atom API. Full-text search, by author / title / category, paper detail by id, latest in any category. Returns title, abstract, authors, DOI, PDF link. No auth, no proxies. Pay only per result item.

- **URL**: https://apify.com/perconey/arxiv-scraper.md
- **Developed by:** [Perconey](https://apify.com/perconey) (community)
- **Categories:** Developer tools, News, AI
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

$1.00 / 1,000 result items

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

### What does arXiv Scraper do?

**arXiv Scraper** pulls research papers from [arxiv.org](https://arxiv.org) via the **official Atom API**. Latest papers in any category, free-text search, by author / title / id, with full title, abstract, authors, DOI, journal reference, PDF link. arxiv.org is the canonical preprint server for AI / ML / CS / math / physics / quant-bio - over 2.4 million papers. The actor calls the documented public API directly: no browser, no proxies, no auth.

Try it instantly: pick **getLatestPapers**, leave category `cs.AI`, click Start. You get the 30 newest AI papers (title, abstract, authors, PDF link) in under 5 seconds for $0.03.

### Why use arXiv Scraper?

- **AI / ML researchers**: Daily digest of new papers in your category. Schedule `getLatestPapers` for cs.AI / cs.CL / cs.LG and never miss a release.
- **Trend analysts**: Track which sub-fields are accelerating. Combine `getPapersByCategory` with `sortBy=submittedDate` to see week-over-week paper-count deltas.
- **Recruiters / scouts**: `getPapersByAuthor` returns everything a researcher published, with publication dates and co-authors. Ideal for hiring pipelines.
- **Content marketers in tech**: Pull abstracts of trending papers and remix into blog content / newsletters. The summary field is rich and license-friendly.
- **AI agent developers**: Wire the actor into your knowledge pipeline so your agent always has the latest research summaries to ground on.
- **Academic librarians**: Bulk-export your institution's authors. The actor paginates politely (3 s between batches per arXiv guidelines) so multi-thousand-result exports are safe.

### How to use arXiv Scraper

1. Open the **Input** tab.
2. Pick an **action** from the dropdown. `getLatestPapers` is the simplest starting point.
3. For getLatestPapers, set **category** (default `cs.AI`). Use any arXiv category code like `cs.CL`, `cs.LG`, `stat.ML`, `math.OC`, `q-bio.QM`.
4. For search / by-author / by-title / by-category / paper-detail actions, fill **queries**.
5. Tune **maxItems** (default 30).
6. Click **Start**.

#### Query format by action

Action | Query format
--- | ---
getLatestPapers | leave empty (use category field)
searchPapers | free-text (e.g. `large language model`)
getPapersByAuthor | author surname (e.g. `Bengio`, `LeCun`, `Hinton`)
getPapersByCategory | arXiv category code (e.g. `cs.AI`)
getPapersByTitle | exact title phrase (e.g. `attention is all you need`)
getPaperDetail | arXiv id (e.g. `2501.00001` or `2501.00001v2`)

### Input

Field | Required | Description
--- | --- | ---
`action` | yes | Which lookup. Six options.
`queries` | sometimes | Required for all actions except getLatestPapers.
`category` | no | getLatestPapers only. arXiv category code. Default `cs.AI`.
`maxItems` | no | Max items per query. Default 30. arXiv API caps a single call at 30,000 - we paginate in batches of 100 with the recommended 3 s delay.
`sortBy` | no | `submittedDate` (default), `relevance`, or `lastUpdatedDate`.

### Output

Every item carries `_type=paper` (or `error`) plus `_action`.

```json
{
    "_type": "paper",
    "_action": "getLatestPapers",
    "arxiv_id": "2501.00001v1",
    "version": 1,
    "title": "Toward Foundation Models for Cell-Level Biology",
    "summary": "We present a new family of foundation models for single-cell genomics ...",
    "authors": ["Jane Doe", "John Smith", "Alex Researcher"],
    "author_count": 3,
    "categories": ["q-bio.QM", "cs.LG"],
    "primary_category": "q-bio.QM",
    "published": "2026-01-02T15:30:00Z",
    "updated":   "2026-01-08T09:12:00Z",
    "doi": null,
    "journal_ref": null,
    "comment": "https://github.com/lab/foundation-cells",
    "pdf_url": "https://arxiv.org/pdf/2501.00001v1",
    "abs_url": "https://arxiv.org/abs/2501.00001v1"
}
````

You can download the dataset in JSON, CSV, XML, Excel, RSS or HTML format from the Output tab.

### Data fields

Type | Key fields
\--- | ---
`paper` | arxiv\_id, version, title, summary, authors, author\_count, categories, primary\_category, published, updated, doi, journal\_ref, comment, pdf\_url, abs\_url

### Pricing

**Pay-per-result: $0.001 per paper.** No flat monthly fee.

Cost examples:

- Daily 30 newest cs.AI papers: **$0.03**
- 1,000 papers by an author: **$1.00**
- 5,000 cs.CL papers from the last year for a literature review: **$5.00**
- One paper detail lookup: **$0.001**

### Tips

- **Proxy is enabled by default.** arxiv aggressively rate-limits per outbound IP and the Apify cloud egress pool is shared across many users - hitting arxiv from a single IP gets you a 429 within seconds. The actor uses the Apify proxy by default to rotate IPs per request. Disable via `proxyConfiguration.useApifyProxy: false` only if you're sure of your own IP.
- **Pagination is rate-limited.** arXiv asks for 3 s between requests, so 30,000 papers take ~15 minutes wall-clock minimum. Plan timeouts accordingly.
- **Category codes are case-sensitive.** Use the arXiv taxonomy: https://arxiv.org/category\_taxonomy. Common ones: cs.AI, cs.CL (NLP), cs.CV (Vision), cs.LG (ML), stat.ML.
- **Author search matches surnames.** `Bengio` returns Yoshua + Samy + others. Use full names with quotes for disambiguation: `"Yoshua Bengio"`.
- **Comment field often has GitHub links.** `arxiv:comment` is where authors typically paste their code-repo URL. Useful for crawling implementations.
- **Versions matter.** A paper id like `2501.00001` returns the latest version. Pin to a specific revision with `2501.00001v2`.

### FAQ, disclaimers, support

**Is this legal?** The actor calls arxiv.org's official documented public API, identifies itself with a clear User-Agent, and honors the recommended 3 s inter-request delay. arXiv explicitly supports automated access.

**Why is pagination slow?** arXiv asks API clients to wait 3 s between requests. We honor that. For large pulls, schedule the actor overnight.

**What about citation counts?** arXiv does not expose citation counts via its API. For citation metrics you would need Semantic Scholar or Google Scholar (no public API). Open an issue if this matters for your use case.

**What about the full paper text?** The actor returns the abstract plus a PDF link. To get the full text, download the PDF via the `pdf_url` field.

**Bug or feature request?** Open an Issue on the actor's Issues tab. I usually respond within a day.

**Need a scraper for Hacker News, Stack Overflow, dev.to, Lemmy, Mastodon, Bluesky, Substack?** See my other actors at https://apify.com/perconey.

# Actor input Schema

## `action` (type: `string`):

Pick the action. searchPapers / getPapersByAuthor / getPapersByCategory / getPapersByTitle / getPaperDetail need at least one entry in queries. getLatestPapers takes an optional category.

## `queries` (type: `array`):

One entry per query. searchPapers: free text. getPapersByAuthor: surname (e.g. Bengio). getPapersByCategory: arXiv category code. getPapersByTitle: phrase to match. getPaperDetail: arXiv id. getLatestPapers: leave empty.

## `category` (type: `string`):

arXiv category code. Examples: cs.AI (AI), cs.CL (NLP), cs.CV (Vision), cs.LG (Machine Learning), stat.ML, math.OC, q-bio.QM. Default: cs.AI.

## `maxItems` (type: `integer`):

Stop after this many items per query. arXiv API caps a single call at 30,000 but we paginate in batches of 100 with a 3 s delay between batches (per arXiv API guidelines).

## `sortBy` (type: `string`):

submittedDate = newest first (default for getLatestPapers). relevance = best match (default for search actions). lastUpdatedDate = most-recently-revised first.

## `proxyConfiguration` (type: `object`):

arxiv aggressively rate-limits per IP. The default (Apify proxy with datacenter group) rotates IPs and bypasses the limit. Disable only if you intend to run from a single trusted IP.

## Actor input object example

```json
{
  "action": "getLatestPapers",
  "queries": [],
  "category": "cs.AI",
  "maxItems": 30,
  "sortBy": "submittedDate",
  "proxyConfiguration": {
    "useApifyProxy": true
  }
}
```

# Actor output Schema

## `dataset` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "queries": [],
    "proxyConfiguration": {
        "useApifyProxy": true
    }
};

// Run the Actor and wait for it to finish
const run = await client.actor("perconey/arxiv-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "queries": [],
    "proxyConfiguration": { "useApifyProxy": True },
}

# Run the Actor and wait for it to finish
run = client.actor("perconey/arxiv-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "queries": [],
  "proxyConfiguration": {
    "useApifyProxy": true
  }
}' |
apify call perconey/arxiv-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=perconey/arxiv-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "arXiv Scraper: Papers, Authors, Categories & Search",
        "description": "Scrape arxiv.org via the official Atom API. Full-text search, by author / title / category, paper detail by id, latest in any category. Returns title, abstract, authors, DOI, PDF link. No auth, no proxies. Pay only per result item.",
        "version": "0.1",
        "x-build-id": "yhrD2JnWcUf9lWdtq"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/perconey~arxiv-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-perconey-arxiv-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/perconey~arxiv-scraper/runs": {
            "post": {
                "operationId": "runs-sync-perconey-arxiv-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/perconey~arxiv-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-perconey-arxiv-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "action"
                ],
                "properties": {
                    "action": {
                        "title": "What do you want to scrape?",
                        "enum": [
                            "getLatestPapers",
                            "searchPapers",
                            "getPapersByAuthor",
                            "getPapersByCategory",
                            "getPapersByTitle",
                            "getPaperDetail"
                        ],
                        "type": "string",
                        "description": "Pick the action. searchPapers / getPapersByAuthor / getPapersByCategory / getPapersByTitle / getPaperDetail need at least one entry in queries. getLatestPapers takes an optional category.",
                        "default": "getLatestPapers"
                    },
                    "queries": {
                        "title": "Queries",
                        "type": "array",
                        "description": "One entry per query. searchPapers: free text. getPapersByAuthor: surname (e.g. Bengio). getPapersByCategory: arXiv category code. getPapersByTitle: phrase to match. getPaperDetail: arXiv id. getLatestPapers: leave empty.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "category": {
                        "title": "Category filter (getLatestPapers only)",
                        "type": "string",
                        "description": "arXiv category code. Examples: cs.AI (AI), cs.CL (NLP), cs.CV (Vision), cs.LG (Machine Learning), stat.ML, math.OC, q-bio.QM. Default: cs.AI.",
                        "default": "cs.AI"
                    },
                    "maxItems": {
                        "title": "Max items per query",
                        "minimum": 0,
                        "maximum": 30000,
                        "type": "integer",
                        "description": "Stop after this many items per query. arXiv API caps a single call at 30,000 but we paginate in batches of 100 with a 3 s delay between batches (per arXiv API guidelines).",
                        "default": 30
                    },
                    "sortBy": {
                        "title": "Sort order",
                        "enum": [
                            "submittedDate",
                            "relevance",
                            "lastUpdatedDate"
                        ],
                        "type": "string",
                        "description": "submittedDate = newest first (default for getLatestPapers). relevance = best match (default for search actions). lastUpdatedDate = most-recently-revised first.",
                        "default": "submittedDate"
                    },
                    "proxyConfiguration": {
                        "title": "Proxy",
                        "type": "object",
                        "description": "arxiv aggressively rate-limits per IP. The default (Apify proxy with datacenter group) rotates IPs and bypasses the limit. Disable only if you intend to run from a single trusted IP.",
                        "default": {
                            "useApifyProxy": true
                        }
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
