ArXiv Academic Paper Scraper
Pricing
from $1.00 / 1,000 results
ArXiv Academic Paper Scraper
Scrape academic papers from ArXiv. Extract titles, authors, abstracts, categories, and PDF links. Essential for research and literature reviews.
Pricing
from $1.00 / 1,000 results
Rating
0.0
(0)
Developer
Fortuitous Pirate
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
12 hours ago
Last modified
Categories
Share
arXiv Papers Scraper
Scrapes academic research papers from arXiv.org API - 2.3M+ papers in physics, math, CS, and more.
Features
- Search across 2.3M+ academic papers
- Filter by category, author, or search terms
- Configurable sorting and pagination
- Extracts full metadata including authors, categories, and DOI
- Respects arXiv rate limits automatically
API Information
| Property | Value |
|---|---|
| API Source | http://export.arxiv.org/api/query |
| API Key | Not required |
| Rate Limits | 3 seconds between requests (automatically enforced) |
Input Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
searchQuery | string | "machine learning" | Search query (e.g., 'quantum computing', 'neural networks') |
category | string | - | arXiv category filter (e.g., cs.AI, cs.LG, physics, math, q-bio) |
author | string | - | Filter by author name |
sortBy | enum | "relevance" | Sort results by: relevance, lastUpdatedDate, submittedDate |
sortOrder | enum | "descending" | Sort order: descending, ascending |
limit | integer | 100 | Maximum number of papers to return (max: 10000) |
Example Input
{"searchQuery": "transformer neural network","category": "cs.LG","sortBy": "submittedDate","sortOrder": "descending","limit": 500}
Output Fields
Each scraped paper includes the following fields:
| Field | Type | Description |
|---|---|---|
arxivId | string | Unique arXiv identifier (e.g., "2301.12345") |
title | string | Paper title |
summary | string | Abstract/summary of the paper |
authors | array | List of author names |
categories | array | All arXiv categories the paper belongs to |
primaryCategory | string | Primary arXiv category |
published | string | Initial publication date (ISO 8601) |
updated | string | Last update date (ISO 8601) |
comment | string | Author comments (page count, figures, etc.) |
journalRef | string | Journal reference if published |
doi | string | Digital Object Identifier if available |
pdfUrl | string | Direct link to PDF |
abstractUrl | string | Link to abstract page |
scrapedAt | string | Timestamp when the data was scraped |
Example Output
{"arxivId": "2301.12345","title": "Attention Is All You Need: A Survey","summary": "This paper surveys the transformer architecture and its applications...","authors": ["John Smith", "Jane Doe"],"categories": ["cs.LG", "cs.AI", "cs.CL"],"primaryCategory": "cs.LG","published": "2023-01-15T12:00:00Z","updated": "2023-02-20T08:30:00Z","comment": "15 pages, 5 figures","journalRef": "Nature Machine Intelligence, 2023","doi": "10.1234/example.2023.12345","pdfUrl": "http://arxiv.org/pdf/2301.12345","abstractUrl": "http://arxiv.org/abs/2301.12345","scrapedAt": "2024-01-15T10:30:00.000Z"}
arXiv Categories
Common category codes for filtering:
Computer Science
cs.AI- Artificial Intelligencecs.LG- Machine Learningcs.CL- Computation and Language (NLP)cs.CV- Computer Visioncs.NE- Neural and Evolutionary Computingcs.RO- Robotics
Physics
physics- All physicshep-th- High Energy Physics - Theorycond-mat- Condensed Matterquant-ph- Quantum Physics
Mathematics
math- All mathematicsmath.ST- Statistics Theorymath.OC- Optimization and Control
Other
q-bio- Quantitative Biologyq-fin- Quantitative Financestat.ML- Machine Learning (Statistics)eess- Electrical Engineering and Systems Science
Usage Notes
-
Rate Limiting: The scraper automatically waits 3 seconds between API requests as recommended by arXiv.
-
Batch Size: Results are fetched in batches of 100 papers per request (arXiv's recommended maximum).
-
Search Syntax: The search query searches across all fields. For more specific searches, you can use arXiv's query syntax directly.
-
Large Datasets: For limits above 1000, expect the scraper to take several minutes due to rate limiting.
Local Development
# Install dependenciesnpm install# Run locally with Apify CLIapify run -i '{"searchQuery": "deep learning", "limit": 10}'