# arXiv Papers Scraper (`devilscrapes/arxiv-papers-scraper`) Actor Search arXiv by query, category, or author and get structured paper metadata — title, authors, abstract, primary category, DOI, PDF URL, submitted and updated timestamps. We handle pagination, retries, and rate-limit pacing so you get clean typed rows ready for a research pipeline. - **URL**: https://apify.com/devilscrapes/arxiv-papers-scraper.md - **Developed by:** [DevilScrapes](https://apify.com/devilscrapes) (community) - **Categories:** AI, Developer tools - **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks - **User rating**: No ratings yet ## Pricing Pay per event This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events. Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event ## What's an Apify Actor? Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases. In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours, and optionally produces a well-defined JSON output, datasets with results, or files in key-value store. In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server. Actors are written with capital "A". ## How to integrate an Actor? If asked about integration, you help developers integrate Actors into their projects. You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready. The best way to integrate Actors is as follows. In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md): ```bash npm install apify-client ``` In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md): ```bash pip install apify-client ``` In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md): ````bash # MacOS / Linux curl -fsSL https://apify.com/install-cli.sh | bash # Windows irm https://apify.com/install-cli.ps1 | iex ```bash In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md). If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md). For usage examples, see the [API](#api) section below. For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt). # README

## arXiv Papers Scraper **💰 $1.50 / 1 000 results** · pay only for results · no credit card to try _We do the dirty work so your dataset stays clean._ 😈 Search arXiv by query, category, or author and get structured paper metadata — title, authors, abstract, primary category, DOI, PDF URL, submitted and updated timestamps. We handle the pagination, retries, and rate-limit pacing so you get clean typed rows ready for a research pipeline.

--- ### 🎯 What this scrapes arXiv's Atom feed at export.arxiv.org/api/query is the canonical source for paper metadata — and a notoriously picky one. This Actor wraps it with a sensible input schema, paces requests so we stay polite to the upstream, paginates through results, and writes one structured row per paper. We absorb the transient errors and rate-limit pushback; you get a dataset that drops into research dashboards, citation tracking, or ML training pipelines. ### 🔥 What we handle for you - 🛡️ **Browser fingerprint rotation** — `curl-cffi` impersonates real Chrome / Firefox / Safari TLS handshakes so the target sees a browser, not Python. - 🌐 **Residential proxy rotation** via Apify Proxy — fresh session and exit IP on every block. - 🔁 **Retries with exponential backoff** on `408 / 429 / 5xx` — up to 5 attempts per page, `Retry-After` honoured. - 🧱 **Rate-limit-aware pacing** — when the target pushes back, we slow down instead of getting banned. - 🧊 **Clean, typed dataset rows** — Pydantic-validated, ISO-8601 timestamps, stable IDs, JSON / CSV / Excel export straight from the Apify Console. - 💰 **Pay-Per-Event pricing** — you only pay for results that hit your dataset. No data, no charge. ### 💡 Use cases - **Citation tracking** — schedule weekly runs for `au:` and diff to detect new citations of your work. - **Trend monitoring** — daily pull from `cat:cs.AI` to feed a research digest. - **Dataset curation** — extract every paper matching a topic + date range to seed a literature review. - **Notification pipeline** — pipe into Slack when a new paper matches your saved query. ### ⚙️ How to use it 1. Click **Try for free** at the top of the page. 2. Fill in the input form — most fields have sensible defaults. 3. Click **Start**. Output streams into the run's dataset. 4. Export from **Storage → Dataset** as JSON, CSV, or Excel — or fetch via the API. ### 📥 Input | Field | Type | Required | Default | Notes | |---|---|:--:|---|---| | `searchQuery` | `string` | **yes** | 'cat:cs.AI' | arXiv search query string. Use field prefixes like ti: (title), au: (author), cat: Built by **[Devil Scrapes](https://apify.com/DevilScrapes)** 😈 — a small fleet of opinionated public-data Actors. Honest pricing, real engineering, zero fine print. # Actor input Schema ## `searchQuery` (type: `string`): arXiv search query string. Use field prefixes like ti: (title), au: (author), cat: (category). Examples: cat:cs.AI, ti:transformer AND au:vaswani. ## `sortBy` (type: `string`): Field used to order results. ## `sortOrder` (type: `string`): Ascending or descending. ## `maxResults` (type: `integer`): Total papers to fetch across pages. arXiv recommends ≤30000 per query. Default 50. ## `pageSize` (type: `integer`): Papers per API call. arXiv caps page size at 2000; default 50. ## `proxyConfiguration` (type: `object`): Apify Proxy is optional — arXiv is fine with direct access. Throttle yourself to stay polite. ## Actor input object example ```json { "searchQuery": "cat:cs.CL AND ti:llm", "sortBy": "submittedDate", "sortOrder": "descending", "maxResults": 50, "pageSize": 50, "proxyConfiguration": { "useApifyProxy": false } } ``` # Actor output Schema ## `datasetItems` (type: `string`): All dataset items as JSON. ## `datasetItemsCsv` (type: `string`): Same data exported to CSV. ## `datasetView` (type: `string`): Open the run dataset in the Console. # API You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup. ## JavaScript example ```javascript import { ApifyClient } from 'apify-client'; // Initialize the ApifyClient with your Apify API token // Replace the '' with your token const client = new ApifyClient({ token: '', }); // Prepare Actor input const input = { "searchQuery": "cat:cs.AI", "proxyConfiguration": { "useApifyProxy": false } }; // Run the Actor and wait for it to finish const run = await client.actor("devilscrapes/arxiv-papers-scraper").call(input); // Fetch and print Actor results from the run's dataset (if any) console.log('Results from dataset'); console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`); const { items } = await client.dataset(run.defaultDatasetId).listItems(); items.forEach((item) => { console.dir(item); }); // 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs ``` ## Python example ```python from apify_client import ApifyClient # Initialize the ApifyClient with your Apify API token # Replace '' with your token. client = ApifyClient("") # Prepare the Actor input run_input = { "searchQuery": "cat:cs.AI", "proxyConfiguration": { "useApifyProxy": False }, } # Run the Actor and wait for it to finish run = client.actor("devilscrapes/arxiv-papers-scraper").call(run_input=run_input) # Fetch and print Actor results from the run's dataset (if there are any) print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"]) for item in client.dataset(run["defaultDatasetId"]).iterate_items(): print(item) # 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start ``` ## CLI example ```bash echo '{ "searchQuery": "cat:cs.AI", "proxyConfiguration": { "useApifyProxy": false } }' | apify call devilscrapes/arxiv-papers-scraper --silent --output-dataset ``` ## MCP server setup ```json { "mcpServers": { "apify": { "command": "npx", "args": [ "mcp-remote", "https://mcp.apify.com/?tools=devilscrapes/arxiv-papers-scraper", "--header", "Authorization: Bearer " ] } } } ``` ## OpenAPI specification ```json { "openapi": "3.0.1", "info": { "title": "arXiv Papers Scraper", "description": "Search arXiv by query, category, or author and get structured paper metadata — title, authors, abstract, primary category, DOI, PDF URL, submitted and updated timestamps. We handle pagination, retries, and rate-limit pacing so you get clean typed rows ready for a research pipeline.", "version": "0.4", "x-build-id": "IwwaQ17Ua7MULj7DV" }, "servers": [ { "url": "https://api.apify.com/v2" } ], "paths": { "/acts/devilscrapes~arxiv-papers-scraper/run-sync-get-dataset-items": { "post": { "operationId": "run-sync-get-dataset-items-devilscrapes-arxiv-papers-scraper", "x-openai-isConsequential": false, "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.", "tags": [ "Run Actor" ], "requestBody": { "required": true, "content": { "application/json": { "schema": { "$ref": "#/components/schemas/inputSchema" } } } }, "parameters": [ { "name": "token", "in": "query", "required": true, "schema": { "type": "string" }, "description": "Enter your Apify token here" } ], "responses": { "200": { "description": "OK" } } } }, "/acts/devilscrapes~arxiv-papers-scraper/runs": { "post": { "operationId": "runs-sync-devilscrapes-arxiv-papers-scraper", "x-openai-isConsequential": false, "summary": "Executes an Actor and returns information about the initiated run in response.", "tags": [ "Run Actor" ], "requestBody": { "required": true, "content": { "application/json": { "schema": { "$ref": "#/components/schemas/inputSchema" } } } }, "parameters": [ { "name": "token", "in": "query", "required": true, "schema": { "type": "string" }, "description": "Enter your Apify token here" } ], "responses": { "200": { "description": "OK", "content": { "application/json": { "schema": { "$ref": "#/components/schemas/runsResponseSchema" } } } } } } }, "/acts/devilscrapes~arxiv-papers-scraper/run-sync": { "post": { "operationId": "run-sync-devilscrapes-arxiv-papers-scraper", "x-openai-isConsequential": false, "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.", "tags": [ "Run Actor" ], "requestBody": { "required": true, "content": { "application/json": { "schema": { "$ref": "#/components/schemas/inputSchema" } } } }, "parameters": [ { "name": "token", "in": "query", "required": true, "schema": { "type": "string" }, "description": "Enter your Apify token here" } ], "responses": { "200": { "description": "OK" } } } } }, "components": { "schemas": { "inputSchema": { "type": "object", "required": [ "searchQuery" ], "properties": { "searchQuery": { "title": "arXiv search query", "type": "string", "description": "arXiv search query string. Use field prefixes like ti: (title), au: (author), cat: (category). Examples: cat:cs.AI, ti:transformer AND au:vaswani." }, "sortBy": { "title": "Sort by", "enum": [ "relevance", "lastUpdatedDate", "submittedDate" ], "type": "string", "description": "Field used to order results.", "default": "submittedDate" }, "sortOrder": { "title": "Sort order", "enum": [ "ascending", "descending" ], "type": "string", "description": "Ascending or descending.", "default": "descending" }, "maxResults": { "title": "Max papers", "minimum": 1, "maximum": 5000, "type": "integer", "description": "Total papers to fetch across pages. arXiv recommends ≤30000 per query. Default 50.", "default": 50 }, "pageSize": { "title": "Page size", "minimum": 1, "maximum": 2000, "type": "integer", "description": "Papers per API call. arXiv caps page size at 2000; default 50.", "default": 50 }, "proxyConfiguration": { "title": "Proxy configuration", "type": "object", "description": "Apify Proxy is optional — arXiv is fine with direct access. Throttle yourself to stay polite.", "default": { "useApifyProxy": false } } } }, "runsResponseSchema": { "type": "object", "properties": { "data": { "type": "object", "properties": { "id": { "type": "string" }, "actId": { "type": "string" }, "userId": { "type": "string" }, "startedAt": { "type": "string", "format": "date-time", "example": "2025-01-08T00:00:00.000Z" }, "finishedAt": { "type": "string", "format": "date-time", "example": "2025-01-08T00:00:00.000Z" }, "status": { "type": "string", "example": "READY" }, "meta": { "type": "object", "properties": { "origin": { "type": "string", "example": "API" }, "userAgent": { "type": "string" } } }, "stats": { "type": "object", "properties": { "inputBodyLen": { "type": "integer", "example": 2000 }, "rebootCount": { "type": "integer", "example": 0 }, "restartCount": { "type": "integer", "example": 0 }, "resurrectCount": { "type": "integer", "example": 0 }, "computeUnits": { "type": "integer", "example": 0 } } }, "options": { "type": "object", "properties": { "build": { "type": "string", "example": "latest" }, "timeoutSecs": { "type": "integer", "example": 300 }, "memoryMbytes": { "type": "integer", "example": 1024 }, "diskMbytes": { "type": "integer", "example": 2048 } } }, "buildId": { "type": "string" }, "defaultKeyValueStoreId": { "type": "string" }, "defaultDatasetId": { "type": "string" }, "defaultRequestQueueId": { "type": "string" }, "buildNumber": { "type": "string", "example": "1.0.0" }, "containerUrl": { "type": "string" }, "usage": { "type": "object", "properties": { "ACTOR_COMPUTE_UNITS": { "type": "integer", "example": 0 }, "DATASET_READS": { "type": "integer", "example": 0 }, "DATASET_WRITES": { "type": "integer", "example": 0 }, "KEY_VALUE_STORE_READS": { "type": "integer", "example": 0 }, "KEY_VALUE_STORE_WRITES": { "type": "integer", "example": 1 }, "KEY_VALUE_STORE_LISTS": { "type": "integer", "example": 0 }, "REQUEST_QUEUE_READS": { "type": "integer", "example": 0 }, "REQUEST_QUEUE_WRITES": { "type": "integer", "example": 0 }, "DATA_TRANSFER_INTERNAL_GBYTES": { "type": "integer", "example": 0 }, "DATA_TRANSFER_EXTERNAL_GBYTES": { "type": "integer", "example": 0 }, "PROXY_RESIDENTIAL_TRANSFER_GBYTES": { "type": "integer", "example": 0 }, "PROXY_SERPS": { "type": "integer", "example": 0 } } }, "usageTotalUsd": { "type": "number", "example": 0.00005 }, "usageUsd": { "type": "object", "properties": { "ACTOR_COMPUTE_UNITS": { "type": "integer", "example": 0 }, "DATASET_READS": { "type": "integer", "example": 0 }, "DATASET_WRITES": { "type": "integer", "example": 0 }, "KEY_VALUE_STORE_READS": { "type": "integer", "example": 0 }, "KEY_VALUE_STORE_WRITES": { "type": "number", "example": 0.00005 }, "KEY_VALUE_STORE_LISTS": { "type": "integer", "example": 0 }, "REQUEST_QUEUE_READS": { "type": "integer", "example": 0 }, "REQUEST_QUEUE_WRITES": { "type": "integer", "example": 0 }, "DATA_TRANSFER_INTERNAL_GBYTES": { "type": "integer", "example": 0 }, "DATA_TRANSFER_EXTERNAL_GBYTES": { "type": "integer", "example": 0 }, "PROXY_RESIDENTIAL_TRANSFER_GBYTES": { "type": "integer", "example": 0 }, "PROXY_SERPS": { "type": "integer", "example": 0 } } } } } } } } } } ```