# arXiv Research Papers Tracker (`wsgcjj/arxiv-papers-scraper`) Actor

Search and extract academic papers from arXiv by category, keyword, date range. Returns paper title, authors, abstract, categories, published date, PDF URL. Ideal for AI/ML research monitoring and training data collection.

- **URL**: https://apify.com/wsgcjj/arxiv-papers-scraper.md
- **Developed by:** [陈俊杰](https://apify.com/wsgcjj) (community)
- **Categories:** Developer tools, AI
- **Stats:** 2 total users, 1 monthly users, 0.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## arXiv Research Papers Tracker

An [Apify Actor](https://apify.com/actors) that searches and extracts academic papers from [arXiv](https://arxiv.org/) by category, keyword, and date range. Ideal for AI/ML research monitoring, literature reviews, and training-data collection.

### Features

- **Category search** — search one or more arXiv categories (e.g. `cs.AI`, `cs.LG`, `stat.ML`).
- **Keyword filtering** — narrow results to papers whose title or abstract contains specific terms.
- **Pagination** — automatically fetches up to 200 results with polite 3-second delays between pages.
- **Rich output** — returns title, authors, abstract, categories, published/updated dates, PDF URL, and arXiv ID.

### Input

| Field          | Type    | Default                    | Description                                   |
|----------------|---------|----------------------------|-----------------------------------------------|
| `categories`   | string  | `cs.AI,cs.LG,stat.ML`     | Comma-separated arXiv category codes          |
| `keywords`     | string  | *(optional)*               | Space-separated search terms (title/abstract) |
| `max_results`  | integer | 50                         | Maximum number of papers (≤ 200)              |
| `sort_by`      | enum    | `submittedDate`            | `submittedDate` or `relevance`                |

### Output

Each result is a JSON object pushed to the Apify dataset with the following fields:

| Field              | Type     | Description                                      |
|--------------------|----------|--------------------------------------------------|
| `id`               | string   | arXiv identifier (e.g. `2101.12345`)             |
| `url`              | string   | arXiv abstract page URL                          |
| `title`            | string   | Paper title                                      |
| `authors`          | string[] | List of author names                             |
| `abstract`         | string   | Paper abstract / summary                         |
| `categories`       | string   | Comma-separated category codes                   |
| `primary_category` | string   | Primary arXiv category                           |
| `published`        | string   | Original publication date (ISO‑8601)             |
| `updated`          | string   | Last update date (ISO‑8601)                      |
| `pdf_url`          | string   | Direct link to the PDF                           |

### Common arXiv Category Codes

#### Computer Science (cs.*)

| Code               | Description                              |
|--------------------|------------------------------------------|
| `cs.AI`            | Artificial Intelligence                  |
| `cs.AR`            | Hardware Architecture                    |
| `cs.CC`            | Computational Complexity                 |
| `cs.CE`            | Computational Engineering, Finance, and Science |
| `cs.CL`            | Computation and Language (NLP)           |
| `cs.CR`            | Cryptography and Security                |
| `cs.CV`            | Computer Vision and Pattern Recognition  |
| `cs.CY`            | Computers and Society                    |
| `cs.DB`            | Databases                                |
| `cs.DC`            | Distributed, Parallel, and Cluster Computing |
| `cs.DL`            | Digital Libraries                        |
| `cs.DS`            | Data Structures and Algorithms           |
| `cs.ET`            | Emerging Technologies                    |
| `cs.GL`            | General Literature                       |
| `cs.GT`            | Computer Science and Game Theory         |
| `cs.HC`            | Human-Computer Interaction               |
| `cs.IR`            | Information Retrieval                    |
| `cs.IT`            | Information Theory                       |
| `cs.LG`            | Machine Learning                         |
| `cs.LO`            | Logic in Computer Science                |
| `cs.MA`            | Multiagent Systems                       |
| `cs.NE`            | Neural and Evolutionary Computing        |
| `cs.NI`            | Networking and Internet Architecture     |
| `cs.PL`            | Programming Languages                    |
| `cs.RO`            | Robotics                                 |
| `cs.SE`            | Software Engineering                     |
| `cs.SI`            | Social and Information Networks          |
| `cs.SY`            | Systems and Control                      |

#### Statistics (stat.*)

| Code               | Description                              |
|--------------------|------------------------------------------|
| `stat.AP`          | Applications                             |
| `stat.CO`          | Computation                              |
| `stat.ME`          | Methodology                              |
| `stat.ML`          | Machine Learning                         |
| `stat.TH`          | Statistics Theory                         |

#### Mathematics (math.*)

| Code               | Description                              |
|--------------------|------------------------------------------|
| `math.NA`          | Numerical Analysis                       |
| `math.OC`          | Optimization and Control                 |
| `math.PR`          | Probability                              |
| `math.ST`          | Statistics Theory                         |

#### Physics (physics.*) & Other

| Code               | Description                              |
|--------------------|------------------------------------------|
| `physics.*`        | Various physics sub-disciplines          |
| `q-fin.*`          | Quantitative Finance                     |
| `q-bio.*`          | Quantitative Biology                     |
| `eess.*`           | Electrical Engineering and Systems Science |

> See the [full arXiv category list](https://arxiv.org/category_taxonomy).

### Local Development

```bash
## Clone / navigate to the project
cd ~/apify-actors/arxiv-papers-scraper

## Install dependencies
pip install -r requirements.txt

## Run the actor (requires Apify API token when using Apify platform features)
python -m src
````

To run with custom input via the Apify CLI:

```bash
apify run
```

### License

MIT

# Actor input Schema

## `categories` (type: `string`):

Comma-separated arXiv categories to search (e.g., cs.AI,cs.LG,stat.ML)

## `keywords` (type: `string`):

Optional search terms to filter papers by title or abstract

## `max_results` (type: `integer`):

Maximum number of papers to return (max 200)

## `sort_by` (type: `string`):

Sort results by relevance or submission date

## Actor input object example

```json
{
  "categories": "cs.AI,cs.LG,stat.ML",
  "max_results": 50,
  "sort_by": "submittedDate"
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {};

// Run the Actor and wait for it to finish
const run = await client.actor("wsgcjj/arxiv-papers-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {}

# Run the Actor and wait for it to finish
run = client.actor("wsgcjj/arxiv-papers-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{}' |
apify call wsgcjj/arxiv-papers-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=wsgcjj/arxiv-papers-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "arXiv Research Papers Tracker",
        "description": "Search and extract academic papers from arXiv by category, keyword, date range. Returns paper title, authors, abstract, categories, published date, PDF URL. Ideal for AI/ML research monitoring and training data collection.",
        "version": "0.0",
        "x-build-id": "XtpSa2rG8iK1ibc9O"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/wsgcjj~arxiv-papers-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-wsgcjj-arxiv-papers-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/wsgcjj~arxiv-papers-scraper/runs": {
            "post": {
                "operationId": "runs-sync-wsgcjj-arxiv-papers-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/wsgcjj~arxiv-papers-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-wsgcjj-arxiv-papers-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "categories"
                ],
                "properties": {
                    "categories": {
                        "title": "Categories",
                        "type": "string",
                        "description": "Comma-separated arXiv categories to search (e.g., cs.AI,cs.LG,stat.ML)",
                        "default": "cs.AI,cs.LG,stat.ML"
                    },
                    "keywords": {
                        "title": "Keywords",
                        "type": "string",
                        "description": "Optional search terms to filter papers by title or abstract"
                    },
                    "max_results": {
                        "title": "Max Results",
                        "minimum": 1,
                        "maximum": 200,
                        "type": "integer",
                        "description": "Maximum number of papers to return (max 200)",
                        "default": 50
                    },
                    "sort_by": {
                        "title": "Sort By",
                        "enum": [
                            "relevance",
                            "submittedDate"
                        ],
                        "type": "string",
                        "description": "Sort results by relevance or submission date",
                        "default": "submittedDate"
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
