arXiv Paper & Author Scraper
Under maintenancePricing
Pay per usage
arXiv Paper & Author Scraper
Under maintenanceExtract academic papers, abstracts, and author details from arXiv using the official API. Ideal for research monitoring, literature reviews, and building academic datasets.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Automly
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
5 days ago
Last modified
Categories
Share
Extract academic papers, abstracts, and author details from arXiv using the official API. This actor is perfect for research monitoring, systematic literature reviews, building academic datasets, and feeding RAG pipelines with the latest scientific publications.
Why use this actor?
- Official API reliability — Uses the arXiv export API for stable, structured data without scraping complexity.
- Research monitoring — Track new papers in specific fields or by keyword.
- Literature reviews — Collect abstracts, authors, and categories for systematic analysis.
- Academic lead generation — Build lists of researchers and their affiliations by topic.
- RAG & AI pipelines — Feed paper abstracts and metadata into vector databases for semantic search.
Features
- Search papers by free-text query or arXiv category codes
- Filter by date range (last week, last month, last year, or custom range)
- Sort by relevance, submission date, or last updated date
- Extract full abstracts and author lists with affiliations
- Output authors as separate records for easy analysis
- Respects arXiv polite usage policy with built-in rate limiting
Input
| Field | Type | Default | Description |
|---|---|---|---|
| searchQuery | string | — | arXiv search query, e.g. machine learning or cat:cs.AI |
| categories | array | — | List of arXiv category codes, e.g. ["cs.AI", "cs.LG"] |
| dateRange | string | — | lastWeek, lastMonth, lastYear, or YYYY-MM-DD TO YYYY-MM-DD |
| maxResults | integer | 100 | Maximum papers to return (1–500) |
| extractAuthors | boolean | true | Include author records as separate rows |
| extractAbstract | boolean | true | Include paper abstracts |
| sortBy | string | relevance | relevance, lastUpdatedDate, or submittedDate |
| sortOrder | string | descending | ascending or descending |
Example input
{"searchQuery": "large language models","categories": ["cs.CL", "cs.AI"],"dateRange": "lastMonth","maxResults": 50,"extractAuthors": true,"sortBy": "submittedDate","sortOrder": "descending"}
Output
Each record includes a type field to distinguish entities.
Paper
| Field | Type | Description |
|---|---|---|
| type | string | paper |
| arxivId | string | arXiv identifier |
| url | string | arXiv abstract page URL |
| pdfUrl | string | Direct PDF URL |
| title | string | Paper title |
| abstract | string | Paper abstract |
| publishedAt | string | ISO 8601 submission date |
| updatedAt | string | ISO 8601 last update date |
| authors | array | List of {name, affiliation} objects |
| categories | array | arXiv category codes |
| primaryCategory | string | Primary arXiv category |
Author
| Field | Type | Description |
|---|---|---|
| type | string | author |
| arxivId | string | Associated paper identifier |
| paperTitle | string | Associated paper title |
| name | string | Author name |
| affiliation | string | Author affiliation |
Limits and caveats
- arXiv API returns up to 100 results per request; the actor paginates automatically.
- A 3-second delay is enforced between requests to respect arXiv's polite usage policy.
- Only publicly available papers are returned.
- Author affiliations are only available when provided by the submitter.
Pricing
This actor uses Pay Per Event pricing. You are charged only for successfully extracted data.
| Event | Price | Description |
|---|---|---|
| Paper scraped | $0.003 | Each paper successfully extracted |
| Author scraped | $0.001 | Each author record successfully extracted |
Tiered discounts apply based on your Apify subscription level. A small actor-start fee may also apply.
FAQ
Do I need an arXiv account? No. The arXiv API is completely open and requires no authentication.
Can I download the full PDF?
The actor returns direct PDF URLs in the pdfUrl field. You can download them separately.
What categories are available?
arXiv uses codes like cs.AI (Artificial Intelligence), cs.LG (Machine Learning), cs.CL (Computation and Language), physics.gen-ph, math.ST, etc. See the full list at arxiv.org.
How recent is the data? Data reflects the current arXiv index at the time of the run. New papers are typically available within minutes of submission.