# arXiv Paper Scraper — Citations, Authors, ORCID, Analytics (`brilliant_gum/arxiv-scraper`) Actor

Scrape academic papers from arXiv via the official Atom API. Filter by category, date, query, or author. Includes citation data, ORCID IDs from Semantic Scholar, citation network graph, and built-in analytics (authors, categories, timeline). Four output formats. Proxies included.

- **URL**: https://apify.com/brilliant\_gum/arxiv-scraper.md
- **Developed by:** [Yuliia Kulakova](https://apify.com/brilliant_gum) (community)
- **Categories:** Developer tools, AI, Other
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $3.00 / 1,000 paper scrapeds

This Actor is paid per event and usage. You are charged both the fixed price for specific events and for Apify platform usage.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## arXiv Paper Scraper — Citations · Authors · ORCID · Analytics

Extract academic papers from [arXiv.org](https://arxiv.org) at scale. Built on the **official arXiv Atom API** — no DOM scraping, no anti-bot games, no breakage. Includes citation data from Semantic Scholar, ORCID author IDs, citation network graphs, and built-in analytics.

![banner](https://i.imgur.com/0Rd7Y8a.png)

> Filter by category, date range, author, or full-text query. Four output formats. Proxies included. Free Apify trial.

---

### What You Get

- **Official Atom API**, not a browser — fast (100 papers in ~6 seconds), stable, won't break when arXiv updates their site
- **Full text search** with field targeting (title, abstract, author, category, or all)
- **150+ subject categories** supported (cs.AI, cs.LG, stat.ML, math.ST, q-bio, physics.*, econ.*, …)
- **Date range filtering** based on arXiv's `submittedDate` (v1 submission)
- **Citation data from Semantic Scholar** (optional): citation count, influential citation count, related papers, **ORCID author IDs**
- **Citation network graph** export when citations are enabled — ready for Gephi, Cytoscape, or NetworkX
- **Four output formats** in one actor:
  1. **Papers** — one record per paper (default)
  2. **Author analytics** — sorted by paper count and citations, with per-author paper lists
  3. **Category statistics** — paper counts and top 5 papers per arXiv category
  4. **Timeline** — publication counts by month and year
- **Legacy paperId support** — handles both modern (`2606.11125`) and pre-2007 (`astro-ph/0408219`) arXiv ID formats
- **Multi-query batching with deduplication** — pass several queries, dupes by arxivId are removed automatically
- **Built-in retry resilience** — exponential back-off on network blips and arXiv 429 / 503 responses
- **Proxies included** — no setup, works out of the box

---

### Use Cases

#### For Researchers & PhD Students
Track new papers in your niche daily. Set `categories: ["cs.CL"]`, `queries: ["retrieval augmented generation"]`, schedule to run every morning at 7 AM. Connect output to Slack or Google Sheets via Apify Integrations.

#### For R&D Teams & Labs
Build a private literature monitoring pipeline. Combine multiple queries (e.g. `["diffusion models", "flow matching", "score-based generative"]`) with a 30-day window. Output → internal Notion / Airtable.

#### For Analysts & Data Scientists
Measure research trends. Use `outputFormat: "timeline"` with a 5-year date range to chart monthly publication volume in a subject category. Or `outputFormat: "categories_stats"` to see which subfields dominate a query.

#### For Citation Network Analysis
Enable `includeCitations: true` with a Semantic Scholar API key, then load the `CITATION_GRAPH` Key-Value record into Gephi or Cytoscape. Nodes are papers in your scrape, edges are citation relationships filtered to your dataset.

---

### Quick Start

Paste this into the **Input** tab and click **Start**:

```json
{
  "queries": ["large language models"],
  "maxResults": 100
}
````

Results appear in the **Dataset** tab in real time. Analytics (author rankings, category stats, timeline) land in **Storage → Key-Value Store**. Typical default run: **~6 seconds, 100 papers**.

***

### Common Inputs

#### Track new papers in a specific niche

```json
{
  "queries": ["retrieval augmented generation"],
  "categories": ["cs.CL", "cs.IR"],
  "dateFrom": "2026-05-10",
  "maxResults": 200
}
```

#### Find papers by a specific author

```json
{
  "queries": ["Yann LeCun"],
  "searchField": "author",
  "maxResults": 100
}
```

#### Browse a category page

```json
{
  "queries": [""],
  "categories": ["cs.LG"],
  "dateFrom": "2026-06-01",
  "maxResults": 500
}
```

#### Build a citation graph for a research area

```json
{
  "queries": ["attention is all you need"],
  "maxResults": 50,
  "includeCitations": true,
  "semanticScholarApiKey": "YOUR_FREE_KEY_FROM_SEMANTICSCHOLAR_ORG"
}
```

#### Author analytics across a subfield

```json
{
  "queries": ["large language models"],
  "categories": ["cs.CL"],
  "maxResults": 1000,
  "outputFormat": "authors"
}
```

#### 5-year publication timeline for a topic

```json
{
  "queries": ["neural network"],
  "categories": ["cs.LG"],
  "dateFrom": "2021-01-01",
  "dateTo": "2026-06-10",
  "maxResults": 10000,
  "sortOrder": "ascending",
  "outputFormat": "timeline"
}
```

> **Tip for timeline runs:** use `sortOrder: "ascending"` with large date ranges so older papers aren't skipped. Active categories like cs.LG generate 200+ papers per day, so descending sort + `maxResults: 500` may return only the latest month.

***

### Input Parameters

| Parameter | Type | Description | Default |
|---|---|---|---|
| `queries` | Array | Search query strings — multi-word queries match all words (e.g. `"attention transformer"`) | `["large language models"]` |
| `searchField` | String | `all`, `title`, `abstract`, `author`, or `category` | `all` |
| `categories` | Array | arXiv subject codes (OR-combined, e.g. `["cs.AI", "cs.LG"]`) | — |
| `dateFrom` | String | ISO 8601 (YYYY-MM-DD), based on v1 submission date | — |
| `dateTo` | String | ISO 8601 (YYYY-MM-DD) | — |
| `maxResults` | Integer | Max papers per query (1–10,000) | `100` |
| `sortBy` | String | `submittedDate`, `relevance`, or `lastUpdatedDate` | `submittedDate` |
| `sortOrder` | String | `descending` (newest first) or `ascending` | `descending` |
| `includeAbstract` | Boolean | Include full abstract text | `true` |
| `includeCitations` | Boolean | Enrich with Semantic Scholar (citation count, ORCID, related papers) | `false` |
| `semanticScholarApiKey` | String (secret) | Optional Semantic Scholar API key — strongly recommended when `includeCitations: true` | — |
| `outputFormat` | String | `papers`, `authors`, `categories_stats`, or `timeline` | `papers` |
| `proxyConfiguration` | Object | Optional — proxies are included automatically | — |

***

### Output — Dataset Fields

Sample paper record (`outputFormat: "papers"`):

```json
{
  "paperId": "2606.11107",
  "arxivId": "2606.11107",
  "title": "Multimodal Brain Tumour Classification Using Feature Fusion",
  "authors": [
    {
      "name": "Wajih ul Islam",
      "affiliation": null,
      "orcid": null
    },
    {
      "name": "Volker Steuber",
      "affiliation": null,
      "orcid": "0000-0003-4683-3173"
    }
  ],
  "abstract": "Clinicians diagnose brain tumors by synthesizing patient symptoms, medical history...",
  "primaryCategory": "eess.IV",
  "categories": ["eess.IV", "cs.CV", "cs.LG"],
  "submittedDate": "2026-06-09",
  "updatedDate": "2026-06-09",
  "pdfUrl": "https://arxiv.org/pdf/2606.11107v1",
  "htmlUrl": null,
  "absUrl": "https://arxiv.org/abs/2606.11107v1",
  "doi": "10.1109/EXAMPLE.2026.12345",
  "journalRef": "Nature Machine Intelligence 8 (2026) 1042-1057",
  "comments": "12 pages, 5 figures",
  "license": "http://creativecommons.org/licenses/by/4.0/",
  "citationCount": 47,
  "influentialCitationCount": 8,
  "relatedPapers": [
    {
      "semanticScholarId": "abc123def456",
      "title": "Attention Is All You Need",
      "citationCount": 95442,
      "isInfluential": true,
      "relationshipType": "cited_by_this"
    }
  ],
  "semanticScholarId": "xyz789uvw012"
}
```

#### All Fields

| Field | Type | Description |
|---|---|---|
| `paperId` / `arxivId` | String | Canonical arXiv ID (e.g. `2606.11107` or legacy `astro-ph/0408219`) |
| `title` | String | Paper title |
| `authors[]` | Array | Each: `name`, `affiliation` (rare, only when arXiv provides), `orcid` (via Semantic Scholar) |
| `abstract` | String | Full abstract text (null if `includeAbstract: false`) |
| `primaryCategory` | String | Primary arXiv subject code |
| `categories[]` | Array | All assigned arXiv categories |
| `submittedDate` | String | ISO 8601 — when v1 was submitted |
| `updatedDate` | String | ISO 8601 — last revision date |
| `pdfUrl` | String | Direct PDF URL |
| `htmlUrl` | String | HTML version URL (newer papers only, null otherwise) |
| `absUrl` | String | arXiv abstract page URL |
| `doi` | String | Digital Object Identifier (if registered) |
| `journalRef` | String | Journal citation (if peer-reviewed) |
| `comments` | String | Author comments (page count, conference acceptance, etc.) |
| `license` | String | License URL (if specified) |
| `citationCount` | Integer | Total citations from Semantic Scholar (null without `includeCitations`) |
| `influentialCitationCount` | Integer | Subset rated influential by Semantic Scholar |
| `relatedPapers[]` | Array | Up to 5 most-cited related papers (cited by this paper or citing this paper) |
| `semanticScholarId` | String | Semantic Scholar internal ID |

***

### Analytics Report (Key-Value Store)

Every run saves four analytics records to the **Key-Value Store** (available in the run's **Storage** tab), regardless of `outputFormat`:

#### `AUTHOR_ANALYTICS`

List of every unique author across the scrape, with per-author paper count, total citations, ORCID (when available), category distribution, and paper list. Sorted by paper count descending.

#### `CATEGORY_STATS`

Per-category breakdown — paper count, total/average citations, and top 5 most-cited papers in each arXiv category. Sorted by paper count.

#### `TIMELINE`

Publication volume by month and year — for charting research activity over time.

#### `RUN_SUMMARY`

High-level run statistics: total papers, total unique authors, total categories, queries run, citation enrichment status, and date span of the scraped corpus.

#### `CITATION_GRAPH` (only when `includeCitations: true`)

A network graph of papers in your scrape:

- **Nodes** = papers (with `arxivId`, `title`, `citationCount`, `submittedDate`, `primaryCategory`)
- **Edges** = citation relationships filtered to papers in your dataset (`source` cites `target`, `isInfluential` flag)

Load directly into Gephi, Cytoscape, or NetworkX for visualization and centrality analysis.

***

### Pricing

Pay-per-result, fully transparent:

| Event | Price |
|---|---|
| Actor start | **$0.01** |
| Per paper scraped | **$0.003** |

#### Examples

| Papers scraped | Total cost |
|---|---|
| 100 | **$0.31** |
| 500 | **$1.51** |
| 1,000 | **$3.01** |
| 5,000 | **$15.01** |
| 10,000 | **$30.01** |

Free Apify trial credit ($5) covers ~1,650 papers for evaluation.

***

### Scheduled Runs

Track your niche automatically:

1. Open the actor → **Schedule** → New schedule
2. Set a cron expression (e.g. `0 7 * * *` for 7 AM daily)
3. Fresh dataset each run
4. Pipe into **Google Sheets, Airtable, Slack, webhooks** via Apify Integrations

***

### FAQ

**Is this scraper legal to use?**
Yes. It uses the official arXiv Atom API (the same one arXiv themselves publish for programmatic access), respecting their published rate limit (3.5 seconds between paginated requests). No DOM scraping, no rate-limit circumvention.

**Do I need to configure a proxy?**
No. Proxies are included and configured automatically. The `proxyConfiguration` field is optional and only used if you want to override with your own.

**How accurate is the citation data?**
Citation counts come from [Semantic Scholar's Academic Graph API](https://www.semanticscholar.org/product/api), which is the same dataset used by major research tools. It covers most papers from 2000 onward; very recent papers (last 1-2 weeks) may not yet be indexed.

**Why does Semantic Scholar enrichment need an API key?**
Without a key, Semantic Scholar's free tier rate-limits aggressively (HTTP 429 within seconds). With a free API key (get one at [semanticscholar.org/product/api](https://www.semanticscholar.org/product/api)), enrichment runs at full speed. For runs under ~20 papers without citations, you can skip the key.

**How many papers can I scrape per run?**
Up to 10,000 per query (arXiv API hard limit). Run multiple queries to scale further — they're auto-deduplicated by paper ID.

**Why is my timeline showing only the last month?**
When `sortOrder: descending` with a small `maxResults` and a wide date range, you'll get the **newest** N papers — which in active categories (cs.LG, cs.CL) fit in a few weeks. For long-range timelines, use `sortOrder: ascending` and/or increase `maxResults` to 5,000–10,000.

**Does it handle old arXiv ID formats like `astro-ph/0408219`?**
Yes. Both modern (`2606.11107`) and pre-2007 legacy formats are parsed correctly. The slash in `paperId` is preserved verbatim.

**Can I filter by multiple categories?**
Yes. Categories are OR-combined: `["cs.AI", "cs.LG"]` returns papers in either category. Combine with a text query and a date range for narrow scopes.

**What sort orders does arXiv actually support?**
`submittedDate`, `relevance`, and `lastUpdatedDate`. Note that arXiv may prepend featured / new submissions above strict ordering.

**Where are author analytics / citation graph stored?**
Run → **Storage** tab → **Key-Value Store** → click `AUTHOR_ANALYTICS`, `CATEGORY_STATS`, `TIMELINE`, `RUN_SUMMARY`, or `CITATION_GRAPH`.

***

### Support

Found a bug or want a new feature? Open an issue in the **Issues** tab on this actor's page. Response time typically under 24 hours.

Maintained by **brilliant\_gum** on Apify.

# Actor input Schema

## `queries` (type: `array`):

List of search query strings. Multiple queries are batched and deduplicated. Multi-word queries (e.g. 'attention mechanism transformer') search for papers containing ALL words. Example: \["attention mechanism transformer", "large language models reasoning"]

## `searchField` (type: `string`):

Which field the query text searches against. 'all' searches title, abstract, author, and comments.

## `categories` (type: `array`):

Restrict results to these arXiv subject categories. Leave empty to search all categories. Multiple categories are OR-combined. Examples: cs.AI, cs.LG, cs.CL, cs.CV, cs.NE, stat.ML, math.ST, q-bio.NC, physics.data-an, econ.EM

## `dateFrom` (type: `string`):

Filter papers submitted on or after this date (based on arXiv v1 submission date). Format: YYYY-MM-DD

## `dateTo` (type: `string`):

Filter papers submitted on or before this date (based on arXiv v1 submission date). Format: YYYY-MM-DD

## `maxResults` (type: `integer`):

Maximum number of results to fetch per query. The actor auto-paginates in batches of 200. Hard cap: 10,000 per query (arXiv API limit). Note: very large result sets (>1,000) will be slow due to arXiv's 3.5-second rate-limit delay between pages.

## `sortBy` (type: `string`):

Field used to sort results returned by the arXiv API.

## `sortOrder` (type: `string`):

Sort order for results. Descending returns newest/most-relevant results first.

## `includeAbstract` (type: `boolean`):

Include the full paper abstract in the output. Disable to reduce dataset size when abstracts are not needed.

## `includeCitations` (type: `boolean`):

Fetch citation count, influential citation count, author ORCID IDs, and top related papers from the Semantic Scholar API. Adds ~1-2 seconds per paper due to rate limiting. A CITATION\_GRAPH is also saved to the Key-Value store.

## `semanticScholarApiKey` (type: `string`):

Optional API key for the Semantic Scholar Academic Graph API. Without a key the free tier allows ~1 request/second. With an API key the rate limit is significantly higher, making citation enrichment much faster for large result sets. Get a free key at: https://www.semanticscholar.org/product/api

## `outputFormat` (type: `string`):

Controls what is pushed to the dataset. 'papers': one record per paper (default). 'authors': aggregated author analytics sorted by paper count. 'categories\_stats': paper counts and top papers per arXiv category. 'timeline': publication counts by month and year.

## `proxyConfiguration` (type: `object`):

Optional. Proxies are included and configured automatically — leave this empty unless you want to override with your own.

## Actor input object example

```json
{
  "queries": [
    "large language models"
  ],
  "searchField": "all",
  "categories": [
    "cs.AI",
    "cs.LG",
    "cs.CL"
  ],
  "dateFrom": "2024-01-01",
  "dateTo": "2024-12-31",
  "maxResults": 100,
  "sortBy": "submittedDate",
  "sortOrder": "descending",
  "includeAbstract": true,
  "includeCitations": false,
  "outputFormat": "papers"
}
```

# Actor output Schema

## `summary` (type: `string`):

Quick overview of what the run produced.

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "queries": [
        "large language models"
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("brilliant_gum/arxiv-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "queries": ["large language models"] }

# Run the Actor and wait for it to finish
run = client.actor("brilliant_gum/arxiv-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "queries": [
    "large language models"
  ]
}' |
apify call brilliant_gum/arxiv-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=brilliant_gum/arxiv-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "arXiv Paper Scraper — Citations, Authors, ORCID, Analytics",
        "description": "Scrape academic papers from arXiv via the official Atom API. Filter by category, date, query, or author. Includes citation data, ORCID IDs from Semantic Scholar, citation network graph, and built-in analytics (authors, categories, timeline). Four output formats. Proxies included.",
        "version": "1.1",
        "x-build-id": "rWN2p7WuGyyFpWgJ6"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/brilliant_gum~arxiv-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-brilliant_gum-arxiv-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/brilliant_gum~arxiv-scraper/runs": {
            "post": {
                "operationId": "runs-sync-brilliant_gum-arxiv-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/brilliant_gum~arxiv-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-brilliant_gum-arxiv-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "queries": {
                        "title": "Search Queries",
                        "type": "array",
                        "description": "List of search query strings. Multiple queries are batched and deduplicated. Multi-word queries (e.g. 'attention mechanism transformer') search for papers containing ALL words. Example: [\"attention mechanism transformer\", \"large language models reasoning\"]",
                        "items": {
                            "type": "string"
                        },
                        "default": []
                    },
                    "searchField": {
                        "title": "Search Field",
                        "enum": [
                            "all",
                            "title",
                            "abstract",
                            "author",
                            "category"
                        ],
                        "type": "string",
                        "description": "Which field the query text searches against. 'all' searches title, abstract, author, and comments.",
                        "default": "all"
                    },
                    "categories": {
                        "title": "arXiv Categories",
                        "type": "array",
                        "description": "Restrict results to these arXiv subject categories. Leave empty to search all categories. Multiple categories are OR-combined. Examples: cs.AI, cs.LG, cs.CL, cs.CV, cs.NE, stat.ML, math.ST, q-bio.NC, physics.data-an, econ.EM",
                        "items": {
                            "type": "string"
                        },
                        "default": []
                    },
                    "dateFrom": {
                        "title": "Date From (Submitted)",
                        "pattern": "^\\d{4}-\\d{2}-\\d{2}$",
                        "type": "string",
                        "description": "Filter papers submitted on or after this date (based on arXiv v1 submission date). Format: YYYY-MM-DD"
                    },
                    "dateTo": {
                        "title": "Date To (Submitted)",
                        "pattern": "^\\d{4}-\\d{2}-\\d{2}$",
                        "type": "string",
                        "description": "Filter papers submitted on or before this date (based on arXiv v1 submission date). Format: YYYY-MM-DD"
                    },
                    "maxResults": {
                        "title": "Max Results (per query)",
                        "minimum": 1,
                        "maximum": 10000,
                        "type": "integer",
                        "description": "Maximum number of results to fetch per query. The actor auto-paginates in batches of 200. Hard cap: 10,000 per query (arXiv API limit). Note: very large result sets (>1,000) will be slow due to arXiv's 3.5-second rate-limit delay between pages.",
                        "default": 100
                    },
                    "sortBy": {
                        "title": "Sort By",
                        "enum": [
                            "submittedDate",
                            "relevance",
                            "lastUpdatedDate"
                        ],
                        "type": "string",
                        "description": "Field used to sort results returned by the arXiv API.",
                        "default": "submittedDate"
                    },
                    "sortOrder": {
                        "title": "Sort Order",
                        "enum": [
                            "descending",
                            "ascending"
                        ],
                        "type": "string",
                        "description": "Sort order for results. Descending returns newest/most-relevant results first.",
                        "default": "descending"
                    },
                    "includeAbstract": {
                        "title": "Include Abstract",
                        "type": "boolean",
                        "description": "Include the full paper abstract in the output. Disable to reduce dataset size when abstracts are not needed.",
                        "default": true
                    },
                    "includeCitations": {
                        "title": "Include Citation Data (Semantic Scholar)",
                        "type": "boolean",
                        "description": "Fetch citation count, influential citation count, author ORCID IDs, and top related papers from the Semantic Scholar API. Adds ~1-2 seconds per paper due to rate limiting. A CITATION_GRAPH is also saved to the Key-Value store.",
                        "default": false
                    },
                    "semanticScholarApiKey": {
                        "title": "Semantic Scholar API Key (optional)",
                        "type": "string",
                        "description": "Optional API key for the Semantic Scholar Academic Graph API. Without a key the free tier allows ~1 request/second. With an API key the rate limit is significantly higher, making citation enrichment much faster for large result sets. Get a free key at: https://www.semanticscholar.org/product/api"
                    },
                    "outputFormat": {
                        "title": "Output Format",
                        "enum": [
                            "papers",
                            "authors",
                            "categories_stats",
                            "timeline"
                        ],
                        "type": "string",
                        "description": "Controls what is pushed to the dataset. 'papers': one record per paper (default). 'authors': aggregated author analytics sorted by paper count. 'categories_stats': paper counts and top papers per arXiv category. 'timeline': publication counts by month and year.",
                        "default": "papers"
                    },
                    "proxyConfiguration": {
                        "title": "Proxy Configuration",
                        "type": "object",
                        "description": "Optional. Proxies are included and configured automatically — leave this empty unless you want to override with your own."
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
