ArXiv Scraper
Pricing
Pay per event
ArXiv Scraper
Scrape ArXiv research papers — titles, authors, abstracts, subjects, submission dates, and PDF links.
Pricing
Pay per event
Rating
0.0
(0)
Developer

Stas Persiianenko
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 hours ago
Last modified
Categories
Share
Scrape research papers from ArXiv by keyword. Extract titles, authors, abstracts, subjects, submission dates, comments, and PDF links from search results.
What does ArXiv Scraper do?
ArXiv Scraper searches ArXiv for research papers matching your keywords and extracts structured data from the results. It collects complete paper metadata including full abstracts, author lists, subject categories, and direct PDF download links.
The scraper uses ArXiv's search interface and supports sorting by relevance, submission date, or announcement date with configurable result limits and pagination.
Why scrape ArXiv?
ArXiv is the world's largest open-access repository for scientific preprints, hosting over 2.5 million papers across physics, mathematics, computer science, biology, economics, and more. Researchers submit papers to ArXiv before or alongside traditional journal publication.
Key reasons to scrape ArXiv:
- Literature reviews — Collect papers on a topic for systematic reviews
- Research monitoring — Track new papers in your field of study
- Citation analysis — Build datasets of papers for bibliometric research
- ML training data — Gather abstracts and metadata for NLP models
- Competitive intelligence — Monitor research output from specific institutions
Use cases
- Academic researchers tracking publications in their field
- Data scientists building paper recommendation systems
- Research teams doing systematic literature reviews
- AI companies monitoring state-of-the-art research
- PhD students surveying related work for dissertations
- Science journalists tracking breakthroughs across disciplines
How to scrape ArXiv
- Go to ArXiv Scraper on Apify Store
- Enter one or more search keywords
- Choose sort order (relevance, submission date, or announcement date)
- Set max results per search and max pages
- Click Start and wait for results
- Download data as JSON, CSV, or Excel
Input parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
searchQueries | string[] | (required) | Keywords to search on ArXiv |
sortBy | string | "relevance" | Sort by: relevance, submittedDate, or announcedDate |
sortOrder | string | "descending" | Sort direction: descending or ascending |
maxResultsPerSearch | integer | 100 | Max papers per keyword |
maxSearchPages | integer | 5 | Max pages per keyword (50 papers/page) |
maxRequestRetries | integer | 3 | Retry attempts for failed requests |
Input example
{"searchQueries": ["transformer neural network", "large language model"],"sortBy": "submittedDate","sortOrder": "descending","maxResultsPerSearch": 50,"maxSearchPages": 2}
Output
Each paper in the dataset contains:
| Field | Type | Description |
|---|---|---|
arxivId | string | ArXiv paper ID (e.g., "2603.00888") |
title | string | Paper title |
authors | string[] | List of author names |
abstract | string | Full abstract text |
subjects | string[] | ArXiv subject categories (e.g., "cs.LG") |
submittedDate | string | Submission date (e.g., "28 February, 2026") |
comments | string | Author comments (page count, conference, etc.) |
pdfUrl | string | Direct link to PDF |
abstractUrl | string | Link to abstract page |
scrapedAt | string | ISO timestamp of extraction |
Output example
{"arxivId": "2603.00853","title": "Neural Discrimination-Prompted Transformers for Efficient UHD Image Restoration","authors": ["Cong Wang", "Jinshan Pan", "Liyan Wang", "Wei Wang", "Yang Yang"],"abstract": "We propose a simple yet effective UHDPromer, a neural discrimination-prompted Transformer, for Ultra-High-Definition (UHD) image restoration and enhancement...","subjects": ["cs.CV"],"submittedDate": "28 February, 2026","comments": "Accepted by IJCV'26; code is available at https://github.com/supersupercong/uhdpromer","pdfUrl": "https://arxiv.org/pdf/2603.00853","abstractUrl": "https://arxiv.org/abs/2603.00853","scrapedAt": "2026-03-03T02:40:25.176Z"}
Pricing
ArXiv Scraper uses pay-per-event pricing:
| Event | Price |
|---|---|
| Run started | $0.001 |
| Paper extracted | $0.002 per paper |
Cost examples
| Scenario | Papers | Cost |
|---|---|---|
| Quick search | 50 | $0.101 |
| Medium search | 200 | $0.401 |
| Large survey | 500 | $1.001 |
Platform costs (compute) are minimal — typically under $0.001 per run.
Using ArXiv Scraper with the Apify API
Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });const run = await client.actor('automation-lab/arxiv-scraper').call({searchQueries: ['attention mechanism'],sortBy: 'submittedDate',maxResultsPerSearch: 100,});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(`Found ${items.length} papers`);items.forEach(paper => {console.log(`${paper.arxivId}: ${paper.title}`);});
Python
from apify_client import ApifyClientclient = ApifyClient('YOUR_API_TOKEN')run = client.actor('automation-lab/arxiv-scraper').call(run_input={'searchQueries': ['attention mechanism'],'sortBy': 'submittedDate','maxResultsPerSearch': 100,})dataset = client.dataset(run['defaultDatasetId']).list_items().itemsprint(f'Found {len(dataset)} papers')for paper in dataset:print(f"{paper['arxivId']}: {paper['title']}")
Integrations
ArXiv Scraper works with all Apify integrations:
- Webhooks — Get notified when a scrape completes
- API — Trigger runs programmatically and fetch results
- Scheduled runs — Monitor ArXiv on a daily or weekly schedule
- Google Sheets — Export papers directly to a spreadsheet
- Slack / Email — Send notifications when new papers match your criteria
Connect ArXiv Scraper to Zapier, Make, or Google Sheets for automated workflows.
Tips
- Use specific keywords for better results — ArXiv's search is broad by default
- Sort by submission date to find the latest papers first
- Combine multiple queries to search across related topics in a single run
- Check subject codes — ArXiv uses category codes like
cs.LG(Machine Learning),cs.CV(Computer Vision),stat.ML(Statistics ML) - Set reasonable limits — Start with 50–100 papers per search and increase if needed
- PDF links work directly — Download PDFs programmatically using the
pdfUrlfield
FAQ
How many papers can I scrape?
Each search page returns up to 50 papers. With maxSearchPages set to 20, you can get up to 1,000 papers per keyword.
Does it scrape full paper text?
No — it extracts metadata and abstracts from search results. For full paper text, download the PDF using the provided pdfUrl.
Can I search by author? The scraper currently uses ArXiv's "all fields" search. Include author names in your search keywords to find papers by specific researchers.
How often is ArXiv updated? ArXiv receives new submissions daily (excluding weekends). Sort by submission date to see the latest papers.