arXiv Paper Scraper — Citations, Authors, ORCID, Analytics
Pricing
from $3.00 / 1,000 paper scrapeds
arXiv Paper Scraper — Citations, Authors, ORCID, Analytics
Scrape academic papers from arXiv via the official Atom API. Filter by category, date, query, or author. Includes citation data, ORCID IDs from Semantic Scholar, citation network graph, and built-in analytics (authors, categories, timeline). Four output formats. Proxies included.
Pricing
from $3.00 / 1,000 paper scrapeds
Rating
0.0
(0)
Developer
Yuliia Kulakova
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
4 days ago
Last modified
Categories
Share
arXiv Paper Scraper — Citations · Authors · ORCID · Analytics
Extract academic papers from arXiv.org at scale. Built on the official arXiv Atom API — no DOM scraping, no anti-bot games, no breakage. Includes citation data from Semantic Scholar, ORCID author IDs, citation network graphs, and built-in analytics.

Filter by category, date range, author, or full-text query. Four output formats. Proxies included. Free Apify trial.
What You Get
- Official Atom API, not a browser — fast (100 papers in ~6 seconds), stable, won't break when arXiv updates their site
- Full text search with field targeting (title, abstract, author, category, or all)
- 150+ subject categories supported (cs.AI, cs.LG, stat.ML, math.ST, q-bio, physics., econ., …)
- Date range filtering based on arXiv's
submittedDate(v1 submission) - Citation data from Semantic Scholar (optional): citation count, influential citation count, related papers, ORCID author IDs
- Citation network graph export when citations are enabled — ready for Gephi, Cytoscape, or NetworkX
- Four output formats in one actor:
- Papers — one record per paper (default)
- Author analytics — sorted by paper count and citations, with per-author paper lists
- Category statistics — paper counts and top 5 papers per arXiv category
- Timeline — publication counts by month and year
- Legacy paperId support — handles both modern (
2606.11125) and pre-2007 (astro-ph/0408219) arXiv ID formats - Multi-query batching with deduplication — pass several queries, dupes by arxivId are removed automatically
- Built-in retry resilience — exponential back-off on network blips and arXiv 429 / 503 responses
- Proxies included — no setup, works out of the box
Use Cases
For Researchers & PhD Students
Track new papers in your niche daily. Set categories: ["cs.CL"], queries: ["retrieval augmented generation"], schedule to run every morning at 7 AM. Connect output to Slack or Google Sheets via Apify Integrations.
For R&D Teams & Labs
Build a private literature monitoring pipeline. Combine multiple queries (e.g. ["diffusion models", "flow matching", "score-based generative"]) with a 30-day window. Output → internal Notion / Airtable.
For Analysts & Data Scientists
Measure research trends. Use outputFormat: "timeline" with a 5-year date range to chart monthly publication volume in a subject category. Or outputFormat: "categories_stats" to see which subfields dominate a query.
For Citation Network Analysis
Enable includeCitations: true with a Semantic Scholar API key, then load the CITATION_GRAPH Key-Value record into Gephi or Cytoscape. Nodes are papers in your scrape, edges are citation relationships filtered to your dataset.
Quick Start
Paste this into the Input tab and click Start:
{"queries": ["large language models"],"maxResults": 100}
Results appear in the Dataset tab in real time. Analytics (author rankings, category stats, timeline) land in Storage → Key-Value Store. Typical default run: ~6 seconds, 100 papers.
Common Inputs
Track new papers in a specific niche
{"queries": ["retrieval augmented generation"],"categories": ["cs.CL", "cs.IR"],"dateFrom": "2026-05-10","maxResults": 200}
Find papers by a specific author
{"queries": ["Yann LeCun"],"searchField": "author","maxResults": 100}
Browse a category page
{"queries": [""],"categories": ["cs.LG"],"dateFrom": "2026-06-01","maxResults": 500}
Build a citation graph for a research area
{"queries": ["attention is all you need"],"maxResults": 50,"includeCitations": true,"semanticScholarApiKey": "YOUR_FREE_KEY_FROM_SEMANTICSCHOLAR_ORG"}
Author analytics across a subfield
{"queries": ["large language models"],"categories": ["cs.CL"],"maxResults": 1000,"outputFormat": "authors"}
5-year publication timeline for a topic
{"queries": ["neural network"],"categories": ["cs.LG"],"dateFrom": "2021-01-01","dateTo": "2026-06-10","maxResults": 10000,"sortOrder": "ascending","outputFormat": "timeline"}
Tip for timeline runs: use
sortOrder: "ascending"with large date ranges so older papers aren't skipped. Active categories like cs.LG generate 200+ papers per day, so descending sort +maxResults: 500may return only the latest month.
Input Parameters
| Parameter | Type | Description | Default |
|---|---|---|---|
queries | Array | Search query strings — multi-word queries match all words (e.g. "attention transformer") | ["large language models"] |
searchField | String | all, title, abstract, author, or category | all |
categories | Array | arXiv subject codes (OR-combined, e.g. ["cs.AI", "cs.LG"]) | — |
dateFrom | String | ISO 8601 (YYYY-MM-DD), based on v1 submission date | — |
dateTo | String | ISO 8601 (YYYY-MM-DD) | — |
maxResults | Integer | Max papers per query (1–10,000) | 100 |
sortBy | String | submittedDate, relevance, or lastUpdatedDate | submittedDate |
sortOrder | String | descending (newest first) or ascending | descending |
includeAbstract | Boolean | Include full abstract text | true |
includeCitations | Boolean | Enrich with Semantic Scholar (citation count, ORCID, related papers) | false |
semanticScholarApiKey | String (secret) | Optional Semantic Scholar API key — strongly recommended when includeCitations: true | — |
outputFormat | String | papers, authors, categories_stats, or timeline | papers |
proxyConfiguration | Object | Optional — proxies are included automatically | — |
Output — Dataset Fields
Sample paper record (outputFormat: "papers"):
{"paperId": "2606.11107","arxivId": "2606.11107","title": "Multimodal Brain Tumour Classification Using Feature Fusion","authors": [{"name": "Wajih ul Islam","affiliation": null,"orcid": null},{"name": "Volker Steuber","affiliation": null,"orcid": "0000-0003-4683-3173"}],"abstract": "Clinicians diagnose brain tumors by synthesizing patient symptoms, medical history...","primaryCategory": "eess.IV","categories": ["eess.IV", "cs.CV", "cs.LG"],"submittedDate": "2026-06-09","updatedDate": "2026-06-09","pdfUrl": "https://arxiv.org/pdf/2606.11107v1","htmlUrl": null,"absUrl": "https://arxiv.org/abs/2606.11107v1","doi": "10.1109/EXAMPLE.2026.12345","journalRef": "Nature Machine Intelligence 8 (2026) 1042-1057","comments": "12 pages, 5 figures","license": "http://creativecommons.org/licenses/by/4.0/","citationCount": 47,"influentialCitationCount": 8,"relatedPapers": [{"semanticScholarId": "abc123def456","title": "Attention Is All You Need","citationCount": 95442,"isInfluential": true,"relationshipType": "cited_by_this"}],"semanticScholarId": "xyz789uvw012"}
All Fields
| Field | Type | Description |
|---|---|---|
paperId / arxivId | String | Canonical arXiv ID (e.g. 2606.11107 or legacy astro-ph/0408219) |
title | String | Paper title |
authors[] | Array | Each: name, affiliation (rare, only when arXiv provides), orcid (via Semantic Scholar) |
abstract | String | Full abstract text (null if includeAbstract: false) |
primaryCategory | String | Primary arXiv subject code |
categories[] | Array | All assigned arXiv categories |
submittedDate | String | ISO 8601 — when v1 was submitted |
updatedDate | String | ISO 8601 — last revision date |
pdfUrl | String | Direct PDF URL |
htmlUrl | String | HTML version URL (newer papers only, null otherwise) |
absUrl | String | arXiv abstract page URL |
doi | String | Digital Object Identifier (if registered) |
journalRef | String | Journal citation (if peer-reviewed) |
comments | String | Author comments (page count, conference acceptance, etc.) |
license | String | License URL (if specified) |
citationCount | Integer | Total citations from Semantic Scholar (null without includeCitations) |
influentialCitationCount | Integer | Subset rated influential by Semantic Scholar |
relatedPapers[] | Array | Up to 5 most-cited related papers (cited by this paper or citing this paper) |
semanticScholarId | String | Semantic Scholar internal ID |
Analytics Report (Key-Value Store)
Every run saves four analytics records to the Key-Value Store (available in the run's Storage tab), regardless of outputFormat:
AUTHOR_ANALYTICS
List of every unique author across the scrape, with per-author paper count, total citations, ORCID (when available), category distribution, and paper list. Sorted by paper count descending.
CATEGORY_STATS
Per-category breakdown — paper count, total/average citations, and top 5 most-cited papers in each arXiv category. Sorted by paper count.
TIMELINE
Publication volume by month and year — for charting research activity over time.
RUN_SUMMARY
High-level run statistics: total papers, total unique authors, total categories, queries run, citation enrichment status, and date span of the scraped corpus.
CITATION_GRAPH (only when includeCitations: true)
A network graph of papers in your scrape:
- Nodes = papers (with
arxivId,title,citationCount,submittedDate,primaryCategory) - Edges = citation relationships filtered to papers in your dataset (
sourcecitestarget,isInfluentialflag)
Load directly into Gephi, Cytoscape, or NetworkX for visualization and centrality analysis.
Pricing
Pay-per-result, fully transparent:
| Event | Price |
|---|---|
| Actor start | $0.01 |
| Per paper scraped | $0.003 |
Examples
| Papers scraped | Total cost |
|---|---|
| 100 | $0.31 |
| 500 | $1.51 |
| 1,000 | $3.01 |
| 5,000 | $15.01 |
| 10,000 | $30.01 |
Free Apify trial credit ($5) covers ~1,650 papers for evaluation.
Scheduled Runs
Track your niche automatically:
- Open the actor → Schedule → New schedule
- Set a cron expression (e.g.
0 7 * * *for 7 AM daily) - Fresh dataset each run
- Pipe into Google Sheets, Airtable, Slack, webhooks via Apify Integrations
FAQ
Is this scraper legal to use? Yes. It uses the official arXiv Atom API (the same one arXiv themselves publish for programmatic access), respecting their published rate limit (3.5 seconds between paginated requests). No DOM scraping, no rate-limit circumvention.
Do I need to configure a proxy?
No. Proxies are included and configured automatically. The proxyConfiguration field is optional and only used if you want to override with your own.
How accurate is the citation data? Citation counts come from Semantic Scholar's Academic Graph API, which is the same dataset used by major research tools. It covers most papers from 2000 onward; very recent papers (last 1-2 weeks) may not yet be indexed.
Why does Semantic Scholar enrichment need an API key? Without a key, Semantic Scholar's free tier rate-limits aggressively (HTTP 429 within seconds). With a free API key (get one at semanticscholar.org/product/api), enrichment runs at full speed. For runs under ~20 papers without citations, you can skip the key.
How many papers can I scrape per run? Up to 10,000 per query (arXiv API hard limit). Run multiple queries to scale further — they're auto-deduplicated by paper ID.
Why is my timeline showing only the last month?
When sortOrder: descending with a small maxResults and a wide date range, you'll get the newest N papers — which in active categories (cs.LG, cs.CL) fit in a few weeks. For long-range timelines, use sortOrder: ascending and/or increase maxResults to 5,000–10,000.
Does it handle old arXiv ID formats like astro-ph/0408219?
Yes. Both modern (2606.11107) and pre-2007 legacy formats are parsed correctly. The slash in paperId is preserved verbatim.
Can I filter by multiple categories?
Yes. Categories are OR-combined: ["cs.AI", "cs.LG"] returns papers in either category. Combine with a text query and a date range for narrow scopes.
What sort orders does arXiv actually support?
submittedDate, relevance, and lastUpdatedDate. Note that arXiv may prepend featured / new submissions above strict ordering.
Where are author analytics / citation graph stored?
Run → Storage tab → Key-Value Store → click AUTHOR_ANALYTICS, CATEGORY_STATS, TIMELINE, RUN_SUMMARY, or CITATION_GRAPH.
Support
Found a bug or want a new feature? Open an issue in the Issues tab on this actor's page. Response time typically under 24 hours.
Maintained by brilliant_gum on Apify.