arXiv Scraper: Papers, Authors, Categories & Search avatar

arXiv Scraper: Papers, Authors, Categories & Search

Pricing

$1.00 / 1,000 result items

Go to Apify Store
arXiv Scraper: Papers, Authors, Categories & Search

arXiv Scraper: Papers, Authors, Categories & Search

Scrape arxiv.org via the official Atom API. Full-text search, by author / title / category, paper detail by id, latest in any category. Returns title, abstract, authors, DOI, PDF link. No auth, no proxies. Pay only per result item.

Pricing

$1.00 / 1,000 result items

Rating

0.0

(0)

Developer

Perconey

Perconey

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a day ago

Last modified

Share

What does arXiv Scraper do?

arXiv Scraper pulls research papers from arxiv.org via the official Atom API. Latest papers in any category, free-text search, by author / title / id, with full title, abstract, authors, DOI, journal reference, PDF link. arxiv.org is the canonical preprint server for AI / ML / CS / math / physics / quant-bio - over 2.4 million papers. The actor calls the documented public API directly: no browser, no proxies, no auth.

Try it instantly: pick getLatestPapers, leave category cs.AI, click Start. You get the 30 newest AI papers (title, abstract, authors, PDF link) in under 5 seconds for $0.03.

Why use arXiv Scraper?

  • AI / ML researchers: Daily digest of new papers in your category. Schedule getLatestPapers for cs.AI / cs.CL / cs.LG and never miss a release.
  • Trend analysts: Track which sub-fields are accelerating. Combine getPapersByCategory with sortBy=submittedDate to see week-over-week paper-count deltas.
  • Recruiters / scouts: getPapersByAuthor returns everything a researcher published, with publication dates and co-authors. Ideal for hiring pipelines.
  • Content marketers in tech: Pull abstracts of trending papers and remix into blog content / newsletters. The summary field is rich and license-friendly.
  • AI agent developers: Wire the actor into your knowledge pipeline so your agent always has the latest research summaries to ground on.
  • Academic librarians: Bulk-export your institution's authors. The actor paginates politely (3 s between batches per arXiv guidelines) so multi-thousand-result exports are safe.

How to use arXiv Scraper

  1. Open the Input tab.
  2. Pick an action from the dropdown. getLatestPapers is the simplest starting point.
  3. For getLatestPapers, set category (default cs.AI). Use any arXiv category code like cs.CL, cs.LG, stat.ML, math.OC, q-bio.QM.
  4. For search / by-author / by-title / by-category / paper-detail actions, fill queries.
  5. Tune maxItems (default 30).
  6. Click Start.

Query format by action

ActionQuery format
getLatestPapersleave empty (use category field)
searchPapersfree-text (e.g. large language model)
getPapersByAuthorauthor surname (e.g. Bengio, LeCun, Hinton)
getPapersByCategoryarXiv category code (e.g. cs.AI)
getPapersByTitleexact title phrase (e.g. attention is all you need)
getPaperDetailarXiv id (e.g. 2501.00001 or 2501.00001v2)

Input

FieldRequiredDescription
actionyesWhich lookup. Six options.
queriessometimesRequired for all actions except getLatestPapers.
categorynogetLatestPapers only. arXiv category code. Default cs.AI.
maxItemsnoMax items per query. Default 30. arXiv API caps a single call at 30,000 - we paginate in batches of 100 with the recommended 3 s delay.
sortBynosubmittedDate (default), relevance, or lastUpdatedDate.

Output

Every item carries _type=paper (or error) plus _action.

{
"_type": "paper",
"_action": "getLatestPapers",
"arxiv_id": "2501.00001v1",
"version": 1,
"title": "Toward Foundation Models for Cell-Level Biology",
"summary": "We present a new family of foundation models for single-cell genomics ...",
"authors": ["Jane Doe", "John Smith", "Alex Researcher"],
"author_count": 3,
"categories": ["q-bio.QM", "cs.LG"],
"primary_category": "q-bio.QM",
"published": "2026-01-02T15:30:00Z",
"updated": "2026-01-08T09:12:00Z",
"doi": null,
"journal_ref": null,
"comment": "https://github.com/lab/foundation-cells",
"pdf_url": "https://arxiv.org/pdf/2501.00001v1",
"abs_url": "https://arxiv.org/abs/2501.00001v1"
}

You can download the dataset in JSON, CSV, XML, Excel, RSS or HTML format from the Output tab.

Data fields

TypeKey fields
paperarxiv_id, version, title, summary, authors, author_count, categories, primary_category, published, updated, doi, journal_ref, comment, pdf_url, abs_url

Pricing

Pay-per-result: $0.001 per paper. No flat monthly fee.

Cost examples:

  • Daily 30 newest cs.AI papers: $0.03
  • 1,000 papers by an author: $1.00
  • 5,000 cs.CL papers from the last year for a literature review: $5.00
  • One paper detail lookup: $0.001

Tips

  • Proxy is enabled by default. arxiv aggressively rate-limits per outbound IP and the Apify cloud egress pool is shared across many users - hitting arxiv from a single IP gets you a 429 within seconds. The actor uses the Apify proxy by default to rotate IPs per request. Disable via proxyConfiguration.useApifyProxy: false only if you're sure of your own IP.
  • Pagination is rate-limited. arXiv asks for 3 s between requests, so 30,000 papers take ~15 minutes wall-clock minimum. Plan timeouts accordingly.
  • Category codes are case-sensitive. Use the arXiv taxonomy: https://arxiv.org/category_taxonomy. Common ones: cs.AI, cs.CL (NLP), cs.CV (Vision), cs.LG (ML), stat.ML.
  • Author search matches surnames. Bengio returns Yoshua + Samy + others. Use full names with quotes for disambiguation: "Yoshua Bengio".
  • Comment field often has GitHub links. arxiv:comment is where authors typically paste their code-repo URL. Useful for crawling implementations.
  • Versions matter. A paper id like 2501.00001 returns the latest version. Pin to a specific revision with 2501.00001v2.

FAQ, disclaimers, support

Is this legal? The actor calls arxiv.org's official documented public API, identifies itself with a clear User-Agent, and honors the recommended 3 s inter-request delay. arXiv explicitly supports automated access.

Why is pagination slow? arXiv asks API clients to wait 3 s between requests. We honor that. For large pulls, schedule the actor overnight.

What about citation counts? arXiv does not expose citation counts via its API. For citation metrics you would need Semantic Scholar or Google Scholar (no public API). Open an issue if this matters for your use case.

What about the full paper text? The actor returns the abstract plus a PDF link. To get the full text, download the PDF via the pdf_url field.

Bug or feature request? Open an Issue on the actor's Issues tab. I usually respond within a day.

Need a scraper for Hacker News, Stack Overflow, dev.to, Lemmy, Mastodon, Bluesky, Substack? See my other actors at https://apify.com/perconey.