arXiv Paper Scraper — AI Research, Abstracts & PDF Links
Pricing
from $1.84 / 1,000 arxiv papers
arXiv Paper Scraper — AI Research, Abstracts & PDF Links
Search arXiv papers by keyword, ID list, or category. Returns title, authors, abstract, categories, PDF URL, DOI, publish dates, and parse_confidence. Official Atom XML API — no proxy, no auth. Pay per result.
Pricing
from $1.84 / 1,000 arxiv papers
Rating
0.0
(0)
Developer
Vitalii Bondarev
Maintained by CommunityActor stats
0
Bookmarked
1
Total users
0
Monthly active users
4 days ago
Last modified
Categories
Share
arXiv Paper Scraper — AI Research, Abstracts & PDF Links | from $1.50/1K
Used by AI/ML researchers building training datasets, literature review automation tools, and academic institutions monitoring emerging research.
Pricing: $1.50 per 1,000 papers (metadata + PDF links). Include full abstracts for an additional $0.50/1k (includeAbstract=true).
Search arXiv research papers by keyword, arXiv ID, or category. Returns structured metadata including title, authors, abstract, categories, PDF URL, DOI, publish dates, and parse_confidence. Uses the official arXiv Atom XML API — no proxy, no auth required. Pay per result.
What is this actor for?
arXiv (arxiv.org) is the world's largest open-access repository for scientific preprints, with over 2.3 million papers in physics, mathematics, computer science, quantitative biology, and more. Researchers, analysts, and ML practitioners use arXiv daily for literature discovery.
This actor provides a clean, structured interface to the official arXiv API — no scraping, no fragility, no proxy costs for buyers.
What data does it return?
Each result row has 17 fields:
| Field | Description |
|---|---|
arxiv_id | arXiv paper ID (e.g. 2501.05032v2) |
title | Paper title |
summary | Full abstract text (toggle off with includeAbstract=false) |
authors | List of author name strings |
primary_category | Primary arXiv category (e.g. cs.CL, cs.LG) |
categories | All categories the paper is cross-listed in |
published_at | Original submission date (ISO 8601 UTC) |
updated_at | Last updated date (ISO 8601 UTC) |
pdf_url | Direct PDF link |
abstract_url | arXiv abstract page URL |
doi | DOI if available (often present for journal-published papers) |
comment | Author comment (conference/workshop note, code link, etc.) |
journal_ref | Journal reference if published |
query | Search query that returned this paper (null for ID lookups) |
scraped_at | Timestamp of the scrape run (ISO 8601 UTC) |
parse_confidence | Quality score 0.0–1.0; 1.0 = all fields parsed correctly |
warnings | List of warning codes when confidence < 1.0 |
Input
| Parameter | Type | Default | Description |
|---|---|---|---|
searchQueries | string[] | — | Keyword queries. Supports arXiv field prefixes: ti: (title), au: (author), abs: (abstract), cat: (category), all: (all fields). |
idList | string[] | — | Specific arXiv IDs to fetch (e.g. 2501.05032). |
category | string | — | Filter to category (e.g. cs.LG, cs.CL). Applied as cat: prefix when no query given. |
sortBy | enum | relevance | relevance / lastUpdatedDate / submittedDate |
maxItems | integer | 100 | Total results cap. 0 = unlimited. |
includeAbstract | boolean | true | Include full abstract in output. |
pageSize | integer | 50 | Results per API page (1–200). |
delaySeconds | integer | 3 | Delay between API calls. arXiv requires ≥3 seconds. |
Usage examples
Search for recent LLM papers:
{"searchQueries": ["ti:\"large language models\""],"sortBy": "submittedDate","maxItems": 50}
Look up specific papers by ID:
{"idList": ["2501.05032", "2402.14679", "1706.03762"]}
Browse a category:
{"category": "cs.LG","sortBy": "submittedDate","maxItems": 100}
Combined query + category:
{"searchQueries": ["abs:\"retrieval augmented generation\""],"category": "cs.CL","sortBy": "submittedDate","maxItems": 200}
arXiv search query syntax
arXiv supports prefix operators for targeted searches:
ti:transformer— papers with "transformer" in the titleau:"Yann LeCun"— papers by Yann LeCunabs:diffusion— papers with "diffusion" in the abstractcat:cs.CV— papers in Computer Visionall:"attention mechanism"— any field contains phrase- Combine with
AND,OR,ANDNOT
Pricing examples
| Run | Items | Cost |
|---|---|---|
| 100 papers (metadata, no abstract) | 100 | ~$0.15 |
| 100 papers with full abstracts | 100 | ~$0.20 |
| 1,000 papers (metadata) | 1,000 | ~$1.50 |
| Weekly category monitor (cs.LG, 200 papers) | 800/mo | ~$1.20/mo |
You only pay for papers successfully pushed to the dataset.
FAQ
Do I need a proxy or API key? No. The arXiv API is fully public and requires no authentication. Zero proxy cost for buyers.
What formats can I export results in? JSON, CSV, JSONL, Excel — all via Apify dataset export. The PDF URL field lets you download source PDFs programmatically.
Can I monitor a category for new papers on a schedule?
Yes. Use Apify Schedules + sortBy: submittedDate to fetch the latest papers daily or weekly. Combine with a webhook to push new papers into Slack, Notion, or a literature database.
What if the actor returns empty results?
Check your searchQueries syntax — arXiv requires the correct field prefix format (e.g. ti:"attention" not title:attention). For ID lookups, verify the ID exists on arxiv.org. The OUTPUT key-value store reports any failed queries with reason codes.
Edge cases & known limits
- Rate limit: arXiv requests ≥3 seconds between API calls. The actor respects this automatically. Large runs may take time — plan accordingly.
- max_results cap: arXiv API caps single-page results at 2000. The actor paginates automatically for large queries.
- No auth required: The API is publicly accessible with no token or proxy.
- Versioned IDs: arXiv returns versioned IDs (e.g.
2501.05032v2). Thearxiv_idfield contains the version-qualified ID from the response. - LaTeX in abstracts: arXiv abstracts may contain LaTeX markup (e.g.
$\textbf{...}$). Thesummaryfield returns raw text — no LaTeX rendering. - Not affiliated with arXiv: This actor uses the official public API. arXiv and Cornell University are not affiliated with this actor.
Why choose this actor?
- parse_confidence on every record: immediately see if parsing succeeded. No silent failures.
- Zero proxy cost: official API, no residential proxies needed — buyers pay nothing for egress.
- All 3 modes: search + ID lookup + category browse in a single actor.
- Full metadata: doi, comment, journal_ref, versioned PDF URLs — fields competitors skip.
- Batch-friendly: handles multiple queries in one run with automatic pagination.
Competitor comparison
| This actor | scrapestorm/arxiv | Any REST client | |
|---|---|---|---|
| Official Atom XML API | Yes | Unknown | Yes (DIY) |
parse_confidence | Yes | No | No |
| Abstract toggle (save cost) | Yes | Always-on | No |
| Category + keyword + ID in one actor | Yes | Partial | Separate calls |
| Proxy needed | No | Unknown | No |
| Price | $1.50/1k | $9.99/1k | Free (DIY infra) |
We are 6.6× cheaper than scrapestorm for academic paper data.
AI training datasets
Pull all papers in cs.LG since 2020 — abstract + title + category = a ready-made AI research corpus. Use includeAbstract=true for full text, includeAbstract=false for fast metadata-only runs.
Use with AI agents (MCP)
Ask your AI agent "find me the 10 most cited papers on RAG from 2024" — this actor returns structured paper metadata with abstract and PDF link, ready for downstream summarization.
Point your MCP client at this tool:
{"mcpServers": {"apify": {"command": "npx","args": ["mcp-remote","https://mcp.apify.com/?tools=bovi/arxiv-scraper","--header","Authorization: Bearer <YOUR_APIFY_TOKEN>"]}}}
Minimal call:
{ "searchQueries": ["large language models"], "maxItems": 20 }
crossref-scraper cross-sell
Need journal metadata, citation counts, or DOIs for published versions? Use our crossref-scraper alongside this actor to enrich arXiv papers with citation data.
Integrations
Built for ML researchers and R&D teams building literature datasets and monitoring new papers by keyword or category — the JSON/dataset output drops into the tools you already run, no glue code:
- n8n / Make / Zapier — trigger a run or pipe every new dataset item into 500+ apps (Google Sheets, Airtable, Slack, HubSpot, your database) with no code: n8n, Make, Zapier.
- Webhooks — fire your own endpoint the moment a run finishes, to push results straight into your pipeline (docs).
- MCP server — expose this actor as a tool to Claude, Cursor, or any MCP client so an AI agent can pull this data mid-conversation (guide).
- API & SDKs — fetch the dataset as JSON, CSV, or Excel through the Apify REST API or the Python / JS SDKs.
See all Apify integrations.