arXiv Research Paper Scraper avatar

arXiv Research Paper Scraper

Pricing

from $3.00 / 1,000 results

Go to Apify Store
arXiv Research Paper Scraper

arXiv Research Paper Scraper

Scrape research papers from arXiv.org - search by query, category, or author; lookup by arXiv ID. Returns title, authors, abstract, PDF URL, DOI, categories, and more. Uses the public arXiv Atom API. No login or proxy required.

Pricing

from $3.00 / 1,000 results

Rating

0.0

(0)

Developer

Crawler Bros

Crawler Bros

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

Scrape research papers from arXiv.org — the world's largest preprint repository with 2M+ papers in physics, mathematics, computer science, quantitative biology, economics, and more.

Uses the official arXiv Atom API (http://export.arxiv.org/api/query). No login, no API key, no proxy required.


Features

  • Search papers by free-text keyword (e.g. "neural networks", "quantum computing")
  • Browse by category — 25+ subject categories (cs.AI, math.CO, physics.optics, etc.)
  • Get papers by author — all papers by a specific researcher
  • Fetch specific papers by arXiv ID (e.g. 1706.03762 for "Attention Is All You Need")
  • Full metadata — title, authors, abstract, categories, PDF URL, DOI, journal reference, comment
  • Pagination — retrieve up to 2,000 papers per run
  • Rate-limited — respects arXiv's polite-use policy (0.5s between requests)

Input Parameters

ParameterTypeDescriptionDefault
modeSelectsearchPapers, getByCategory, getByAuthor, getByIdsearchPapers
queryStringFree-text query for searchPapers mode"neural networks"
categorySelectSubject category for getByCategory modecs.AI
authorNameStringAuthor name for getByAuthor mode
arxivIdsArrayList of arXiv IDs for getById mode
sortBySelectrelevance, submittedDate, lastUpdatedDatesubmittedDate
sortOrderSelectascending, descendingdescending
maxItemsIntegerMaximum papers to return (1–2000)50

Output Fields

Each record contains:

FieldTypeDescription
arxivIdStringarXiv paper ID (e.g. 2401.12345)
titleStringPaper title
abstractStringFull abstract
authorsArrayList of author names
categoriesArraySubject categories (e.g. ["cs.AI", "cs.LG"])
publishedStringOriginal submission date (ISO 8601)
updatedStringLast update date (ISO 8601)
pdfUrlStringDirect PDF download link
abstractUrlStringarXiv abstract page URL
doiStringDOI if assigned to published version
journalRefStringJournal reference if published
commentStringAuthor comment (e.g. "15 pages, 5 figures")
scrapedAtStringISO 8601 timestamp of when the record was scraped

Supported Categories

Browse all papers in a subject area:

CategoryDescription
cs.AIArtificial Intelligence
cs.LGMachine Learning
cs.CVComputer Vision
cs.CLComputation & Language (NLP)
cs.SESoftware Engineering
cs.CRCryptography & Security
math.COCombinatorics
math.STStatistics Theory
physics.opticsOptics
q-bio.GNGenomics
econ.EMEconometrics
stat.MLMachine Learning (Statistics)
astro-ph.GAAstrophysics of Galaxies
...and moreSee input schema for full list

Example Use Cases

Search for recent AI papers

{
"mode": "searchPapers",
"query": "large language models",
"sortBy": "submittedDate",
"sortOrder": "descending",
"maxItems": 100
}

Browse computer vision papers

{
"mode": "getByCategory",
"category": "cs.CV",
"maxItems": 50
}

Get papers by a specific author

{
"mode": "getByAuthor",
"authorName": "LeCun",
"maxItems": 30
}

Fetch specific papers by ID

{
"mode": "getById",
"arxivIds": ["1706.03762", "2005.14165", "2303.08774"]
}

Data Source

Data is fetched from the official arXiv API (export.arxiv.org), which is freely accessible without registration or authentication.

arXiv is operated by Cornell University and serves as the primary preprint repository for physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering, and economics.


FAQs

Does this require an API key? No. The arXiv API is publicly accessible without authentication.

Is there a rate limit? arXiv recommends at least 0.5 seconds between requests. This actor respects that limit automatically.

How many papers can I scrape? Up to 2,000 papers per run. arXiv's API supports up to 30,000 results per query, but response times increase with page depth.

Can I get papers from a specific date range? Use the searchPapers mode with arXiv's query syntax: "all:neural networks AND submittedDate:[2024 TO 2025]".

Are preprints included? Yes — arXiv is a preprint server, so most papers are preprints before or alongside journal publication.

What's the difference between published and updated? published is when the paper was first submitted to arXiv. updated is when the latest version was uploaded.