arXiv Research Paper Scraper
Pricing
from $4.99 / 1,000 results
Go to Apify Store
arXiv Research Paper Scraper
Extract comprehensive research paper data from arXiv search results including titles, authors, abstracts, categories, and more.
Pricing
from $4.99 / 1,000 results
Rating
0.0
(0)
Developer
Coding Frontned
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
Extract comprehensive research paper metadata from arXiv โ the premier open-access preprint server for physics, mathematics, computer science, and more. ๐๐
Features
- Full paper metadata โ arXiv ID, title, authors, abstract, categories, dates
- PDF & abstract links โ direct links to papers
- Pagination โ automatically iterates through pages to reach
maxItems - Deduplication โ no duplicate papers across pages
- Flexible search โ search by all fields, title, author, abstract, category, etc.
- Sorting โ sort by relevance, submission date, or last updated date
- No anti-bot issues โ arXiv is an open academic resource
Input Parameters
| Field | Type | Default | Description |
|---|---|---|---|
query | string | (required) | Search query (e.g. "large language models", "quantum computing") |
searchType | string | "all" | Search field: all, ti (title), au (author), abs (abstract), cat (category) |
sortBy | string | "relevance" | Sort by: relevance, lastUpdatedDate, submittedDate |
sortOrder | string | "descending" | Sort order: descending, ascending |
maxItems | integer | 50 | Maximum number of papers to extract (1โ1000) |
proxyConfiguration | object | โ | Apify proxy config |
Example INPUT.json
{"query": "large language models","searchType": "all","sortBy": "submittedDate","sortOrder": "descending","maxItems": 50}
Output Fields
| Field | Type | Description |
|---|---|---|
position | integer | Rank in results (1-based) |
arxivId | string | arXiv paper ID (e.g. 2401.12345) |
title | string | Full paper title |
authors | array | List of author names |
abstract | string | Full paper abstract |
primaryCategory | string | Primary subject category (e.g. cs.AI) |
categories | array | All subject categories |
submittedDate | string | Original submission date |
updatedDate | string | Last updated date |
abstractUrl | string | URL to the abstract page |
pdfUrl | string | Direct link to the PDF |
comments | string | Author comments (e.g. "20 pages, 5 figures") |
journalRef | string | Journal reference if published |
doi | string | DOI if available |
reportNumber | string | Report number if available |
searchQuery | string | Query used for this result |
scrapedAt | string | ISO 8601 timestamp |
Example Output
{"position": 1,"arxivId": "2501.12345","title": "Scaling Laws for Neural Language Models","authors": ["Jared Kaplan", "Sam McCandlish"],"abstract": "We study empirical scaling laws for language model performance...","primaryCategory": "cs.LG","categories": ["cs.LG", "cs.CL", "stat.ML"],"submittedDate": "15 January, 2025","updatedDate": null,"abstractUrl": "https://arxiv.org/abs/2501.12345","pdfUrl": "https://arxiv.org/pdf/2501.12345","comments": "35 pages, 14 figures","journalRef": null,"doi": null,"searchQuery": "large language models","scrapedAt": "2025-05-01T12:00:00.000Z"}
Pagination
arXiv returns 25 results per page. The scraper automatically navigates through pages using the start offset parameter until maxItems is reached or no more results are available.
Use Cases
- Academic research monitoring โ track new papers in your field
- Trend analysis โ identify emerging topics and research directions
- Author profiling โ collect all papers by specific authors
- Citation database โ build reference datasets for research tools
- Competitive intelligence โ monitor publications from research groups
- AI/ML dataset creation โ collect paper abstracts for NLP training
Notes
- arXiv is a free, open-access resource โ no authentication needed
- Results may vary slightly based on arXiv's real-time indexing
- The
abstractfield contains the full abstract text - Use
searchType: "au"to search by author name (e.g."Hinton, Geoffrey") - Use
searchType: "cat"with category codes like"cs.AI","math.CO","hep-th"