arXiv Paper Scraper — Citations, Authors, ORCID, Analytics avatar

arXiv Paper Scraper — Citations, Authors, ORCID, Analytics

Pricing

from $3.00 / 1,000 paper scrapeds

Go to Apify Store
arXiv Paper Scraper — Citations, Authors, ORCID, Analytics

arXiv Paper Scraper — Citations, Authors, ORCID, Analytics

Scrape academic papers from arXiv via the official Atom API. Filter by category, date, query, or author. Includes citation data, ORCID IDs from Semantic Scholar, citation network graph, and built-in analytics (authors, categories, timeline). Four output formats. Proxies included.

Pricing

from $3.00 / 1,000 paper scrapeds

Rating

0.0

(0)

Developer

Yuliia Kulakova

Yuliia Kulakova

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Share

arXiv Paper Scraper — Citations · Authors · ORCID · Analytics

Extract academic papers from arXiv.org at scale. Built on the official arXiv Atom API — no DOM scraping, no anti-bot games, no breakage. Includes citation data from Semantic Scholar, ORCID author IDs, citation network graphs, and built-in analytics.

banner

Filter by category, date range, author, or full-text query. Four output formats. Proxies included. Free Apify trial.


What You Get

  • Official Atom API, not a browser — fast (100 papers in ~6 seconds), stable, won't break when arXiv updates their site
  • Full text search with field targeting (title, abstract, author, category, or all)
  • 150+ subject categories supported (cs.AI, cs.LG, stat.ML, math.ST, q-bio, physics., econ., …)
  • Date range filtering based on arXiv's submittedDate (v1 submission)
  • Citation data from Semantic Scholar (optional): citation count, influential citation count, related papers, ORCID author IDs
  • Citation network graph export when citations are enabled — ready for Gephi, Cytoscape, or NetworkX
  • Four output formats in one actor:
    1. Papers — one record per paper (default)
    2. Author analytics — sorted by paper count and citations, with per-author paper lists
    3. Category statistics — paper counts and top 5 papers per arXiv category
    4. Timeline — publication counts by month and year
  • Legacy paperId support — handles both modern (2606.11125) and pre-2007 (astro-ph/0408219) arXiv ID formats
  • Multi-query batching with deduplication — pass several queries, dupes by arxivId are removed automatically
  • Built-in retry resilience — exponential back-off on network blips and arXiv 429 / 503 responses
  • Proxies included — no setup, works out of the box

Use Cases

For Researchers & PhD Students

Track new papers in your niche daily. Set categories: ["cs.CL"], queries: ["retrieval augmented generation"], schedule to run every morning at 7 AM. Connect output to Slack or Google Sheets via Apify Integrations.

For R&D Teams & Labs

Build a private literature monitoring pipeline. Combine multiple queries (e.g. ["diffusion models", "flow matching", "score-based generative"]) with a 30-day window. Output → internal Notion / Airtable.

For Analysts & Data Scientists

Measure research trends. Use outputFormat: "timeline" with a 5-year date range to chart monthly publication volume in a subject category. Or outputFormat: "categories_stats" to see which subfields dominate a query.

For Citation Network Analysis

Enable includeCitations: true with a Semantic Scholar API key, then load the CITATION_GRAPH Key-Value record into Gephi or Cytoscape. Nodes are papers in your scrape, edges are citation relationships filtered to your dataset.


Quick Start

Paste this into the Input tab and click Start:

{
"queries": ["large language models"],
"maxResults": 100
}

Results appear in the Dataset tab in real time. Analytics (author rankings, category stats, timeline) land in Storage → Key-Value Store. Typical default run: ~6 seconds, 100 papers.


Common Inputs

Track new papers in a specific niche

{
"queries": ["retrieval augmented generation"],
"categories": ["cs.CL", "cs.IR"],
"dateFrom": "2026-05-10",
"maxResults": 200
}

Find papers by a specific author

{
"queries": ["Yann LeCun"],
"searchField": "author",
"maxResults": 100
}

Browse a category page

{
"queries": [""],
"categories": ["cs.LG"],
"dateFrom": "2026-06-01",
"maxResults": 500
}

Build a citation graph for a research area

{
"queries": ["attention is all you need"],
"maxResults": 50,
"includeCitations": true,
"semanticScholarApiKey": "YOUR_FREE_KEY_FROM_SEMANTICSCHOLAR_ORG"
}

Author analytics across a subfield

{
"queries": ["large language models"],
"categories": ["cs.CL"],
"maxResults": 1000,
"outputFormat": "authors"
}

5-year publication timeline for a topic

{
"queries": ["neural network"],
"categories": ["cs.LG"],
"dateFrom": "2021-01-01",
"dateTo": "2026-06-10",
"maxResults": 10000,
"sortOrder": "ascending",
"outputFormat": "timeline"
}

Tip for timeline runs: use sortOrder: "ascending" with large date ranges so older papers aren't skipped. Active categories like cs.LG generate 200+ papers per day, so descending sort + maxResults: 500 may return only the latest month.


Input Parameters

ParameterTypeDescriptionDefault
queriesArraySearch query strings — multi-word queries match all words (e.g. "attention transformer")["large language models"]
searchFieldStringall, title, abstract, author, or categoryall
categoriesArrayarXiv subject codes (OR-combined, e.g. ["cs.AI", "cs.LG"])
dateFromStringISO 8601 (YYYY-MM-DD), based on v1 submission date
dateToStringISO 8601 (YYYY-MM-DD)
maxResultsIntegerMax papers per query (1–10,000)100
sortByStringsubmittedDate, relevance, or lastUpdatedDatesubmittedDate
sortOrderStringdescending (newest first) or ascendingdescending
includeAbstractBooleanInclude full abstract texttrue
includeCitationsBooleanEnrich with Semantic Scholar (citation count, ORCID, related papers)false
semanticScholarApiKeyString (secret)Optional Semantic Scholar API key — strongly recommended when includeCitations: true
outputFormatStringpapers, authors, categories_stats, or timelinepapers
proxyConfigurationObjectOptional — proxies are included automatically

Output — Dataset Fields

Sample paper record (outputFormat: "papers"):

{
"paperId": "2606.11107",
"arxivId": "2606.11107",
"title": "Multimodal Brain Tumour Classification Using Feature Fusion",
"authors": [
{
"name": "Wajih ul Islam",
"affiliation": null,
"orcid": null
},
{
"name": "Volker Steuber",
"affiliation": null,
"orcid": "0000-0003-4683-3173"
}
],
"abstract": "Clinicians diagnose brain tumors by synthesizing patient symptoms, medical history...",
"primaryCategory": "eess.IV",
"categories": ["eess.IV", "cs.CV", "cs.LG"],
"submittedDate": "2026-06-09",
"updatedDate": "2026-06-09",
"pdfUrl": "https://arxiv.org/pdf/2606.11107v1",
"htmlUrl": null,
"absUrl": "https://arxiv.org/abs/2606.11107v1",
"doi": "10.1109/EXAMPLE.2026.12345",
"journalRef": "Nature Machine Intelligence 8 (2026) 1042-1057",
"comments": "12 pages, 5 figures",
"license": "http://creativecommons.org/licenses/by/4.0/",
"citationCount": 47,
"influentialCitationCount": 8,
"relatedPapers": [
{
"semanticScholarId": "abc123def456",
"title": "Attention Is All You Need",
"citationCount": 95442,
"isInfluential": true,
"relationshipType": "cited_by_this"
}
],
"semanticScholarId": "xyz789uvw012"
}

All Fields

FieldTypeDescription
paperId / arxivIdStringCanonical arXiv ID (e.g. 2606.11107 or legacy astro-ph/0408219)
titleStringPaper title
authors[]ArrayEach: name, affiliation (rare, only when arXiv provides), orcid (via Semantic Scholar)
abstractStringFull abstract text (null if includeAbstract: false)
primaryCategoryStringPrimary arXiv subject code
categories[]ArrayAll assigned arXiv categories
submittedDateStringISO 8601 — when v1 was submitted
updatedDateStringISO 8601 — last revision date
pdfUrlStringDirect PDF URL
htmlUrlStringHTML version URL (newer papers only, null otherwise)
absUrlStringarXiv abstract page URL
doiStringDigital Object Identifier (if registered)
journalRefStringJournal citation (if peer-reviewed)
commentsStringAuthor comments (page count, conference acceptance, etc.)
licenseStringLicense URL (if specified)
citationCountIntegerTotal citations from Semantic Scholar (null without includeCitations)
influentialCitationCountIntegerSubset rated influential by Semantic Scholar
relatedPapers[]ArrayUp to 5 most-cited related papers (cited by this paper or citing this paper)
semanticScholarIdStringSemantic Scholar internal ID

Analytics Report (Key-Value Store)

Every run saves four analytics records to the Key-Value Store (available in the run's Storage tab), regardless of outputFormat:

AUTHOR_ANALYTICS

List of every unique author across the scrape, with per-author paper count, total citations, ORCID (when available), category distribution, and paper list. Sorted by paper count descending.

CATEGORY_STATS

Per-category breakdown — paper count, total/average citations, and top 5 most-cited papers in each arXiv category. Sorted by paper count.

TIMELINE

Publication volume by month and year — for charting research activity over time.

RUN_SUMMARY

High-level run statistics: total papers, total unique authors, total categories, queries run, citation enrichment status, and date span of the scraped corpus.

CITATION_GRAPH (only when includeCitations: true)

A network graph of papers in your scrape:

  • Nodes = papers (with arxivId, title, citationCount, submittedDate, primaryCategory)
  • Edges = citation relationships filtered to papers in your dataset (source cites target, isInfluential flag)

Load directly into Gephi, Cytoscape, or NetworkX for visualization and centrality analysis.


Pricing

Pay-per-result, fully transparent:

EventPrice
Actor start$0.01
Per paper scraped$0.003

Examples

Papers scrapedTotal cost
100$0.31
500$1.51
1,000$3.01
5,000$15.01
10,000$30.01

Free Apify trial credit ($5) covers ~1,650 papers for evaluation.


Scheduled Runs

Track your niche automatically:

  1. Open the actor → Schedule → New schedule
  2. Set a cron expression (e.g. 0 7 * * * for 7 AM daily)
  3. Fresh dataset each run
  4. Pipe into Google Sheets, Airtable, Slack, webhooks via Apify Integrations

FAQ

Is this scraper legal to use? Yes. It uses the official arXiv Atom API (the same one arXiv themselves publish for programmatic access), respecting their published rate limit (3.5 seconds between paginated requests). No DOM scraping, no rate-limit circumvention.

Do I need to configure a proxy? No. Proxies are included and configured automatically. The proxyConfiguration field is optional and only used if you want to override with your own.

How accurate is the citation data? Citation counts come from Semantic Scholar's Academic Graph API, which is the same dataset used by major research tools. It covers most papers from 2000 onward; very recent papers (last 1-2 weeks) may not yet be indexed.

Why does Semantic Scholar enrichment need an API key? Without a key, Semantic Scholar's free tier rate-limits aggressively (HTTP 429 within seconds). With a free API key (get one at semanticscholar.org/product/api), enrichment runs at full speed. For runs under ~20 papers without citations, you can skip the key.

How many papers can I scrape per run? Up to 10,000 per query (arXiv API hard limit). Run multiple queries to scale further — they're auto-deduplicated by paper ID.

Why is my timeline showing only the last month? When sortOrder: descending with a small maxResults and a wide date range, you'll get the newest N papers — which in active categories (cs.LG, cs.CL) fit in a few weeks. For long-range timelines, use sortOrder: ascending and/or increase maxResults to 5,000–10,000.

Does it handle old arXiv ID formats like astro-ph/0408219? Yes. Both modern (2606.11107) and pre-2007 legacy formats are parsed correctly. The slash in paperId is preserved verbatim.

Can I filter by multiple categories? Yes. Categories are OR-combined: ["cs.AI", "cs.LG"] returns papers in either category. Combine with a text query and a date range for narrow scopes.

What sort orders does arXiv actually support? submittedDate, relevance, and lastUpdatedDate. Note that arXiv may prepend featured / new submissions above strict ordering.

Where are author analytics / citation graph stored? Run → Storage tab → Key-Value Store → click AUTHOR_ANALYTICS, CATEGORY_STATS, TIMELINE, RUN_SUMMARY, or CITATION_GRAPH.


Support

Found a bug or want a new feature? Open an issue in the Issues tab on this actor's page. Response time typically under 24 hours.

Maintained by brilliant_gum on Apify.