ArXiv Academic Paper Scraper avatar
ArXiv Academic Paper Scraper

Pricing

from $1.00 / 1,000 results

Go to Apify Store
ArXiv Academic Paper Scraper

ArXiv Academic Paper Scraper

Scrape academic papers from ArXiv. Extract titles, authors, abstracts, categories, and PDF links. Essential for research and literature reviews.

Pricing

from $1.00 / 1,000 results

Rating

0.0

(0)

Developer

Fortuitous Pirate

Fortuitous Pirate

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

12 hours ago

Last modified

Categories

Share

arXiv Papers Scraper

Scrapes academic research papers from arXiv.org API - 2.3M+ papers in physics, math, CS, and more.

Features

  • Search across 2.3M+ academic papers
  • Filter by category, author, or search terms
  • Configurable sorting and pagination
  • Extracts full metadata including authors, categories, and DOI
  • Respects arXiv rate limits automatically

API Information

PropertyValue
API Sourcehttp://export.arxiv.org/api/query
API KeyNot required
Rate Limits3 seconds between requests (automatically enforced)

Input Parameters

ParameterTypeDefaultDescription
searchQuerystring"machine learning"Search query (e.g., 'quantum computing', 'neural networks')
categorystring-arXiv category filter (e.g., cs.AI, cs.LG, physics, math, q-bio)
authorstring-Filter by author name
sortByenum"relevance"Sort results by: relevance, lastUpdatedDate, submittedDate
sortOrderenum"descending"Sort order: descending, ascending
limitinteger100Maximum number of papers to return (max: 10000)

Example Input

{
"searchQuery": "transformer neural network",
"category": "cs.LG",
"sortBy": "submittedDate",
"sortOrder": "descending",
"limit": 500
}

Output Fields

Each scraped paper includes the following fields:

FieldTypeDescription
arxivIdstringUnique arXiv identifier (e.g., "2301.12345")
titlestringPaper title
summarystringAbstract/summary of the paper
authorsarrayList of author names
categoriesarrayAll arXiv categories the paper belongs to
primaryCategorystringPrimary arXiv category
publishedstringInitial publication date (ISO 8601)
updatedstringLast update date (ISO 8601)
commentstringAuthor comments (page count, figures, etc.)
journalRefstringJournal reference if published
doistringDigital Object Identifier if available
pdfUrlstringDirect link to PDF
abstractUrlstringLink to abstract page
scrapedAtstringTimestamp when the data was scraped

Example Output

{
"arxivId": "2301.12345",
"title": "Attention Is All You Need: A Survey",
"summary": "This paper surveys the transformer architecture and its applications...",
"authors": ["John Smith", "Jane Doe"],
"categories": ["cs.LG", "cs.AI", "cs.CL"],
"primaryCategory": "cs.LG",
"published": "2023-01-15T12:00:00Z",
"updated": "2023-02-20T08:30:00Z",
"comment": "15 pages, 5 figures",
"journalRef": "Nature Machine Intelligence, 2023",
"doi": "10.1234/example.2023.12345",
"pdfUrl": "http://arxiv.org/pdf/2301.12345",
"abstractUrl": "http://arxiv.org/abs/2301.12345",
"scrapedAt": "2024-01-15T10:30:00.000Z"
}

arXiv Categories

Common category codes for filtering:

Computer Science

  • cs.AI - Artificial Intelligence
  • cs.LG - Machine Learning
  • cs.CL - Computation and Language (NLP)
  • cs.CV - Computer Vision
  • cs.NE - Neural and Evolutionary Computing
  • cs.RO - Robotics

Physics

  • physics - All physics
  • hep-th - High Energy Physics - Theory
  • cond-mat - Condensed Matter
  • quant-ph - Quantum Physics

Mathematics

  • math - All mathematics
  • math.ST - Statistics Theory
  • math.OC - Optimization and Control

Other

  • q-bio - Quantitative Biology
  • q-fin - Quantitative Finance
  • stat.ML - Machine Learning (Statistics)
  • eess - Electrical Engineering and Systems Science

Usage Notes

  1. Rate Limiting: The scraper automatically waits 3 seconds between API requests as recommended by arXiv.

  2. Batch Size: Results are fetched in batches of 100 papers per request (arXiv's recommended maximum).

  3. Search Syntax: The search query searches across all fields. For more specific searches, you can use arXiv's query syntax directly.

  4. Large Datasets: For limits above 1000, expect the scraper to take several minutes due to rate limiting.

Local Development

# Install dependencies
npm install
# Run locally with Apify CLI
apify run -i '{"searchQuery": "deep learning", "limit": 10}'

Resources