Pricing

from $3.00 / 1,000 paper scrapeds

arXiv Paper Scraper — Citations, Authors, ORCID, Analytics

Scrape academic papers from arXiv via the official Atom API. Filter by category, date, query, or author. Includes citation data, ORCID IDs from Semantic Scholar, citation network graph, and built-in analytics (authors, categories, timeline). Four output formats. Proxies included.

Pricing

from $3.00 / 1,000 paper scrapeds

Rating

0.0

(0)

Developer

Yuliia Kulakova

Actor stats

Bookmarked

Total users

Monthly active users

4 days ago

Last modified

arXiv Paper Scraper — Citations · Authors · ORCID · Analytics

Extract academic papers from arXiv.org at scale. Built on the official arXiv Atom API — no DOM scraping, no anti-bot games, no breakage. Includes citation data from Semantic Scholar, ORCID author IDs, citation network graphs, and built-in analytics.

banner

Filter by category, date range, author, or full-text query. Four output formats. Proxies included. Free Apify trial.

What You Get

Official Atom API, not a browser — fast (100 papers in ~6 seconds), stable, won't break when arXiv updates their site
Full text search with field targeting (title, abstract, author, category, or all)
150+ subject categories supported (cs.AI, cs.LG, stat.ML, math.ST, q-bio, physics., econ., …)
Date range filtering based on arXiv's submittedDate (v1 submission)
Citation data from Semantic Scholar (optional): citation count, influential citation count, related papers, ORCID author IDs
Citation network graph export when citations are enabled — ready for Gephi, Cytoscape, or NetworkX
Four output formats in one actor:
1. Papers — one record per paper (default)
2. Author analytics — sorted by paper count and citations, with per-author paper lists
3. Category statistics — paper counts and top 5 papers per arXiv category
4. Timeline — publication counts by month and year
Legacy paperId support — handles both modern (2606.11125) and pre-2007 (astro-ph/0408219) arXiv ID formats
Multi-query batching with deduplication — pass several queries, dupes by arxivId are removed automatically
Built-in retry resilience — exponential back-off on network blips and arXiv 429 / 503 responses
Proxies included — no setup, works out of the box

Use Cases

For Researchers & PhD Students

Track new papers in your niche daily. Set categories: ["cs.CL"], queries: ["retrieval augmented generation"], schedule to run every morning at 7 AM. Connect output to Slack or Google Sheets via Apify Integrations.

For R&D Teams & Labs

Build a private literature monitoring pipeline. Combine multiple queries (e.g. ["diffusion models", "flow matching", "score-based generative"]) with a 30-day window. Output → internal Notion / Airtable.

For Analysts & Data Scientists

Measure research trends. Use outputFormat: "timeline" with a 5-year date range to chart monthly publication volume in a subject category. Or outputFormat: "categories_stats" to see which subfields dominate a query.

For Citation Network Analysis

Enable includeCitations: true with a Semantic Scholar API key, then load the CITATION_GRAPH Key-Value record into Gephi or Cytoscape. Nodes are papers in your scrape, edges are citation relationships filtered to your dataset.

Quick Start

Paste this into the Input tab and click Start:

{
  "queries": ["large language models"],
  "maxResults": 100
}

Results appear in the Dataset tab in real time. Analytics (author rankings, category stats, timeline) land in Storage → Key-Value Store. Typical default run: ~6 seconds, 100 papers.

Common Inputs

Track new papers in a specific niche

{
  "queries": ["retrieval augmented generation"],
  "categories": ["cs.CL", "cs.IR"],
  "dateFrom": "2026-05-10",
  "maxResults": 200
}

Find papers by a specific author

{
  "queries": ["Yann LeCun"],
  "searchField": "author",
  "maxResults": 100
}

Browse a category page

{
  "queries": [""],
  "categories": ["cs.LG"],
  "dateFrom": "2026-06-01",
  "maxResults": 500
}

Build a citation graph for a research area

{
  "queries": ["attention is all you need"],
  "maxResults": 50,
  "includeCitations": true,
  "semanticScholarApiKey": "YOUR_FREE_KEY_FROM_SEMANTICSCHOLAR_ORG"
}

Author analytics across a subfield

{
  "queries": ["large language models"],
  "categories": ["cs.CL"],
  "maxResults": 1000,
  "outputFormat": "authors"
}

5-year publication timeline for a topic

{
  "queries": ["neural network"],
  "categories": ["cs.LG"],
  "dateFrom": "2021-01-01",
  "dateTo": "2026-06-10",
  "maxResults": 10000,
  "sortOrder": "ascending",
  "outputFormat": "timeline"
}

Tip for timeline runs: use sortOrder: "ascending" with large date ranges so older papers aren't skipped. Active categories like cs.LG generate 200+ papers per day, so descending sort + maxResults: 500 may return only the latest month.

Input Parameters

Parameter	Type	Description	Default
`queries`	Array	Search query strings — multi-word queries match all words (e.g. `"attention transformer"`)	`["large language models"]`
`searchField`	String	`all`, `title`, `abstract`, `author`, or `category`	`all`
`categories`	Array	arXiv subject codes (OR-combined, e.g. `["cs.AI", "cs.LG"]`)	—
`dateFrom`	String	ISO 8601 (YYYY-MM-DD), based on v1 submission date	—
`dateTo`	String	ISO 8601 (YYYY-MM-DD)	—
`maxResults`	Integer	Max papers per query (1–10,000)	`100`
`sortBy`	String	`submittedDate`, `relevance`, or `lastUpdatedDate`	`submittedDate`
`sortOrder`	String	`descending` (newest first) or `ascending`	`descending`
`includeAbstract`	Boolean	Include full abstract text	`true`
`includeCitations`	Boolean	Enrich with Semantic Scholar (citation count, ORCID, related papers)	`false`
`semanticScholarApiKey`	String (secret)	Optional Semantic Scholar API key — strongly recommended when `includeCitations: true`	—
`outputFormat`	String	`papers`, `authors`, `categories_stats`, or `timeline`	`papers`
`proxyConfiguration`	Object	Optional — proxies are included automatically	—

Output — Dataset Fields

Sample paper record (outputFormat: "papers"):

{
  "paperId": "2606.11107",
  "arxivId": "2606.11107",
  "title": "Multimodal Brain Tumour Classification Using Feature Fusion",
  "authors": [
    {
      "name": "Wajih ul Islam",
      "affiliation": null,
      "orcid": null
    },
    {
      "name": "Volker Steuber",
      "affiliation": null,
      "orcid": "0000-0003-4683-3173"
    }
  ],
  "abstract": "Clinicians diagnose brain tumors by synthesizing patient symptoms, medical history...",
  "primaryCategory": "eess.IV",
  "categories": ["eess.IV", "cs.CV", "cs.LG"],
  "submittedDate": "2026-06-09",
  "updatedDate": "2026-06-09",
  "pdfUrl": "https://arxiv.org/pdf/2606.11107v1",
  "htmlUrl": null,
  "absUrl": "https://arxiv.org/abs/2606.11107v1",
  "doi": "10.1109/EXAMPLE.2026.12345",
  "journalRef": "Nature Machine Intelligence 8 (2026) 1042-1057",
  "comments": "12 pages, 5 figures",
  "license": "http://creativecommons.org/licenses/by/4.0/",
  "citationCount": 47,
  "influentialCitationCount": 8,
  "relatedPapers": [
    {
      "semanticScholarId": "abc123def456",
      "title": "Attention Is All You Need",
      "citationCount": 95442,
      "isInfluential": true,
      "relationshipType": "cited_by_this"
    }
  ],
  "semanticScholarId": "xyz789uvw012"
}

All Fields

Field	Type	Description
`paperId` / `arxivId`	String	Canonical arXiv ID (e.g. `2606.11107` or legacy `astro-ph/0408219`)
`title`	String	Paper title
`authors[]`	Array	Each: `name`, `affiliation` (rare, only when arXiv provides), `orcid` (via Semantic Scholar)
`abstract`	String	Full abstract text (null if `includeAbstract: false`)
`primaryCategory`	String	Primary arXiv subject code
`categories[]`	Array	All assigned arXiv categories
`submittedDate`	String	ISO 8601 — when v1 was submitted
`updatedDate`	String	ISO 8601 — last revision date
`pdfUrl`	String	Direct PDF URL
`htmlUrl`	String	HTML version URL (newer papers only, null otherwise)
`absUrl`	String	arXiv abstract page URL
`doi`	String	Digital Object Identifier (if registered)
`journalRef`	String	Journal citation (if peer-reviewed)
`comments`	String	Author comments (page count, conference acceptance, etc.)
`license`	String	License URL (if specified)
`citationCount`	Integer	Total citations from Semantic Scholar (null without `includeCitations`)
`influentialCitationCount`	Integer	Subset rated influential by Semantic Scholar
`relatedPapers[]`	Array	Up to 5 most-cited related papers (cited by this paper or citing this paper)
`semanticScholarId`	String	Semantic Scholar internal ID

Analytics Report (Key-Value Store)

Every run saves four analytics records to the Key-Value Store (available in the run's Storage tab), regardless of outputFormat:

`AUTHOR_ANALYTICS`

List of every unique author across the scrape, with per-author paper count, total citations, ORCID (when available), category distribution, and paper list. Sorted by paper count descending.

`CATEGORY_STATS`

Per-category breakdown — paper count, total/average citations, and top 5 most-cited papers in each arXiv category. Sorted by paper count.

`TIMELINE`

Publication volume by month and year — for charting research activity over time.

`RUN_SUMMARY`

High-level run statistics: total papers, total unique authors, total categories, queries run, citation enrichment status, and date span of the scraped corpus.

`CITATION_GRAPH` (only when `includeCitations: true`)

A network graph of papers in your scrape:

Nodes = papers (with arxivId, title, citationCount, submittedDate, primaryCategory)
Edges = citation relationships filtered to papers in your dataset (source cites target, isInfluential flag)

Load directly into Gephi, Cytoscape, or NetworkX for visualization and centrality analysis.

Pricing

Pay-per-result, fully transparent:

Event	Price
Actor start	$0.01
Per paper scraped	$0.003

Examples

Papers scraped	Total cost
100	$0.31
500	$1.51
1,000	$3.01
5,000	$15.01
10,000	$30.01

Free Apify trial credit ($5) covers ~1,650 papers for evaluation.

Scheduled Runs

Track your niche automatically:

Open the actor → Schedule → New schedule
Set a cron expression (e.g. 0 7 * * * for 7 AM daily)
Fresh dataset each run
Pipe into Google Sheets, Airtable, Slack, webhooks via Apify Integrations

FAQ

Is this scraper legal to use? Yes. It uses the official arXiv Atom API (the same one arXiv themselves publish for programmatic access), respecting their published rate limit (3.5 seconds between paginated requests). No DOM scraping, no rate-limit circumvention.

Do I need to configure a proxy? No. Proxies are included and configured automatically. The proxyConfiguration field is optional and only used if you want to override with your own.

How accurate is the citation data? Citation counts come from Semantic Scholar's Academic Graph API, which is the same dataset used by major research tools. It covers most papers from 2000 onward; very recent papers (last 1-2 weeks) may not yet be indexed.

Why does Semantic Scholar enrichment need an API key? Without a key, Semantic Scholar's free tier rate-limits aggressively (HTTP 429 within seconds). With a free API key (get one at semanticscholar.org/product/api), enrichment runs at full speed. For runs under ~20 papers without citations, you can skip the key.

How many papers can I scrape per run? Up to 10,000 per query (arXiv API hard limit). Run multiple queries to scale further — they're auto-deduplicated by paper ID.

Why is my timeline showing only the last month? When sortOrder: descending with a small maxResults and a wide date range, you'll get the newest N papers — which in active categories (cs.LG, cs.CL) fit in a few weeks. For long-range timelines, use sortOrder: ascending and/or increase maxResults to 5,000–10,000.

Does it handle old arXiv ID formats like astro-ph/0408219? Yes. Both modern (2606.11107) and pre-2007 legacy formats are parsed correctly. The slash in paperId is preserved verbatim.

Can I filter by multiple categories? Yes. Categories are OR-combined: ["cs.AI", "cs.LG"] returns papers in either category. Combine with a text query and a date range for narrow scopes.

What sort orders does arXiv actually support? submittedDate, relevance, and lastUpdatedDate. Note that arXiv may prepend featured / new submissions above strict ordering.

Where are author analytics / citation graph stored? Run → Storage tab → Key-Value Store → click AUTHOR_ANALYTICS, CATEGORY_STATS, TIMELINE, RUN_SUMMARY, or CITATION_GRAPH.

Support

Found a bug or want a new feature? Open an issue in the Issues tab on this actor's page. Response time typically under 24 hours.

Maintained by brilliant_gum on Apify.

arXiv Paper Scraper

cloud9_ai/arxiv-paper-scraper

Scrape academic papers from arXiv.org. Search by keyword, browse categories, or get latest papers. Extract titles, abstracts, authors, PDF links, and citation data via arXiv API.

cloud9

arXiv Paper Scraper

plantane/arxiv-scraper

Scrape research papers from arXiv by search query or category. Get titles, abstracts, authors, categories, and PDF links via the public arXiv API.

Daniel

Academic Paper Scraper

labrat011/academic-paper-scraper

Search MILLIONS of academic papers from Semantic Scholar and arXiv by keyword, DOI, or citation graph. Returns titles, authors, abstracts, citation counts, and open access PDFs as clean JSON. Works as an MCP tool for AI agents.

mick_

arXiv Research Paper Scraper

crawlerbros/arxiv-research-paper-scraper

Scrape research papers from arXiv.org - search by query, category, or author; lookup by arXiv ID. Returns title, authors, abstract, PDF URL, DOI, categories, and more. Uses the public arXiv Atom API. No login or proxy required.

Crawler Bros

Semantic Scholar Scraper - Papers, Authors, Citations

gio21/semantic-scholar-scraper

Search and fetch academic papers, authors, citations, and references via the Semantic Scholar Graph API.

Gio

ArXiv Paper Search

gentle_cloud/arxiv-paper-search

Search and extract academic papers from ArXiv. Find papers by keyword, author, or category with full metadata including title, authors, abstract, categories, and PDF links.

Monkey Coder

arXiv Scraper: Papers, Authors, Categories & Search

perconey/arxiv-scraper

Scrape arxiv.org via the official Atom API. Full-text search, by author / title / category, paper detail by id, latest in any category. Returns title, abstract, authors, DOI, PDF link. No auth, no proxies. Pay only per result item.

Perconey

arXiv Paper Scraper

lulzasaur/arxiv-scraper

Search and scrape arXiv academic papers. Get titles, authors, abstracts, categories, PDF links, DOIs. Search by keyword, browse recent papers by category, or fetch by arXiv ID.

lulz bot

Semantic Scholar Academic Paper Scraper

cloud9_ai/semantic-scholar-scraper

Search and extract academic papers, citations, and authors from Semantic Scholar. 200M+ papers with citation graphs and impact metrics. Search papers, get detailed paper info, or find researchers. API key optional. For research and AI.

cloud9

Semantic Scholar Paper Scraper

agenscrape/semantic-scholar-paper-scraper

Scrape academic papers from Semantic Scholar. Search by keyword and extract paper titles, abstracts, authors, citation counts, publication dates, DOIs, open access PDFs... Perfect for literature reviews, citation analysis, and research databases. Real time data output with pagination support.

Agenscrape

arXiv Paper Scraper — Citations, Authors, ORCID, Analytics

arXiv Paper Scraper — Citations · Authors · ORCID · Analytics

What You Get

Use Cases

For Researchers & PhD Students

For R&D Teams & Labs

For Analysts & Data Scientists

For Citation Network Analysis

Quick Start

Common Inputs

Track new papers in a specific niche

Find papers by a specific author

Browse a category page

Build a citation graph for a research area

Author analytics across a subfield

5-year publication timeline for a topic

Input Parameters

Output — Dataset Fields

All Fields

Analytics Report (Key-Value Store)

AUTHOR_ANALYTICS

CATEGORY_STATS

TIMELINE

RUN_SUMMARY

CITATION_GRAPH (only when includeCitations: true)

Pricing

Examples

Scheduled Runs

FAQ

Support

You might also like

arXiv Paper Scraper

arXiv Paper Scraper

Academic Paper Scraper

arXiv Research Paper Scraper

Semantic Scholar Scraper - Papers, Authors, Citations

ArXiv Paper Search

arXiv Scraper: Papers, Authors, Categories & Search

arXiv Paper Scraper

Semantic Scholar Academic Paper Scraper

Semantic Scholar Paper Scraper

`AUTHOR_ANALYTICS`

`CATEGORY_STATS`

`TIMELINE`

`RUN_SUMMARY`

`CITATION_GRAPH` (only when `includeCitations: true`)