ArXiv Scraper
Pricing
Pay per event
ArXiv Scraper
Scrape ArXiv research papers β titles, authors, abstracts, subjects, submission dates, and PDF links.
Pricing
Pay per event
Rating
0.0
(0)
Developer
Stas Persiianenko
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
4 hours ago
Last modified
Share
Scrape research papers from ArXiv by keyword. Extract titles, authors, abstracts, subjects, submission dates, comments, and PDF links from search results.
What does ArXiv Scraper do?
ArXiv Scraper searches ArXiv for research papers matching your keywords and extracts structured data from the results. It collects complete paper metadata including full abstracts, author lists, subject categories, and direct PDF download links.
The scraper uses ArXiv's search interface and supports sorting by relevance, submission date, or announcement date with configurable result limits and pagination.
Who is it for?
- π Academic researchers β collecting papers and metadata for literature reviews
- π Data scientists β building datasets of research trends and citation patterns
- π€ ML engineers β tracking new model releases and benchmark results
- π° Science journalists β monitoring breakthroughs across research fields
- π’ R&D teams β staying current with state-of-the-art publications in their domain
Why scrape ArXiv?
ArXiv is the world's largest open-access repository for scientific preprints, hosting over 2.5 million papers across physics, mathematics, computer science, biology, economics, and more. Researchers submit papers to ArXiv before or alongside traditional journal publication.
Key reasons to scrape ArXiv:
- Literature reviews β Collect papers on a topic for systematic reviews
- Research monitoring β Track new papers in your field of study
- Citation analysis β Build datasets of papers for bibliometric research
- ML training data β Gather abstracts and metadata for NLP models, LLM fine-tuning, or RAG pipeline knowledge bases
- Competitive intelligence β Monitor research output from specific institutions
Use cases
- Academic researchers tracking publications in their field
- Data scientists building paper recommendation systems
- Research teams doing systematic literature reviews
- AI companies monitoring state-of-the-art research
- PhD students surveying related work for dissertations
- Science journalists tracking breakthroughs across disciplines
- AI/ML engineers building RAG pipelines or training datasets from academic literature
How to scrape ArXiv
- Go to ArXiv Scraper on Apify Store
- Enter one or more search keywords
- Choose sort order (relevance, submission date, or announcement date)
- Set max results per search and max pages
- Click Start and wait for results
- Download data as JSON, CSV, or Excel
Input parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
searchQueries | string[] | (required) | Keywords to search on ArXiv |
sortBy | string | "relevance" | Sort by: relevance, submittedDate, or announcedDate |
sortOrder | string | "descending" | Sort direction: descending or ascending |
maxResultsPerSearch | integer | 100 | Max papers per keyword |
maxSearchPages | integer | 5 | Max pages per keyword (50 papers/page) |
maxRequestRetries | integer | 3 | Retry attempts for failed requests |
Input example
{"searchQueries": ["transformer neural network", "large language model"],"sortBy": "submittedDate","sortOrder": "descending","maxResultsPerSearch": 50,"maxSearchPages": 2}
Output
Each paper in the dataset contains:
| Field | Type | Description |
|---|---|---|
arxivId | string | ArXiv paper ID (e.g., "2603.00888") |
title | string | Paper title |
authors | string[] | List of author names |
abstract | string | Full abstract text |
subjects | string[] | ArXiv subject categories (e.g., "cs.LG") |
submittedDate | string | Submission date (e.g., "28 February, 2026") |
comments | string | Author comments (page count, conference, etc.) |
pdfUrl | string | Direct link to PDF |
abstractUrl | string | Link to abstract page |
scrapedAt | string | ISO timestamp of extraction |
Output example
{"arxivId": "2603.00853","title": "Neural Discrimination-Prompted Transformers for Efficient UHD Image Restoration","authors": ["Cong Wang", "Jinshan Pan", "Liyan Wang", "Wei Wang", "Yang Yang"],"abstract": "We propose a simple yet effective UHDPromer, a neural discrimination-prompted Transformer, for Ultra-High-Definition (UHD) image restoration and enhancement...","subjects": ["cs.CV"],"submittedDate": "28 February, 2026","comments": "Accepted by IJCV'26; code is available at https://github.com/supersupercong/uhdpromer","pdfUrl": "https://arxiv.org/pdf/2603.00853","abstractUrl": "https://arxiv.org/abs/2603.00853","scrapedAt": "2026-03-03T02:40:25.176Z"}
How much does it cost to scrape ArXiv?
ArXiv Scraper uses pay-per-event pricing:
| Event | Price |
|---|---|
| Run started | $0.001 |
| Paper extracted | $0.002 per paper |
Cost examples
| Scenario | Papers | Cost |
|---|---|---|
| Quick search | 50 | $0.101 |
| Medium search | 200 | $0.401 |
| Large survey | 500 | $1.001 |
Platform costs (compute) are minimal β typically under $0.001 per run.
Using ArXiv Scraper with the Apify API
Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });const run = await client.actor('automation-lab/arxiv-scraper').call({searchQueries: ['attention mechanism'],sortBy: 'submittedDate',maxResultsPerSearch: 100,});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(`Found ${items.length} papers`);items.forEach(paper => {console.log(`${paper.arxivId}: ${paper.title}`);});
Python
from apify_client import ApifyClientclient = ApifyClient('YOUR_API_TOKEN')run = client.actor('automation-lab/arxiv-scraper').call(run_input={'searchQueries': ['attention mechanism'],'sortBy': 'submittedDate','maxResultsPerSearch': 100,})dataset = client.dataset(run['defaultDatasetId']).list_items().itemsprint(f'Found {len(dataset)} papers')for paper in dataset:print(f"{paper['arxivId']}: {paper['title']}")
Integrations
ArXiv Scraper works with all Apify integrations:
- Webhooks β Get notified when a scrape completes
- API β Trigger runs programmatically and fetch results
- Scheduled runs β Monitor ArXiv on a daily or weekly schedule
- Google Sheets β Export papers directly to a spreadsheet
- Slack / Email β Send notifications when new papers match your criteria
Connect ArXiv Scraper to Zapier, Make, or Google Sheets for automated workflows.
Tips
- Use specific keywords for better results β ArXiv's search is broad by default
- Sort by submission date to find the latest papers first
- Combine multiple queries to search across related topics in a single run
- Check subject codes β ArXiv uses category codes like
cs.LG(Machine Learning),cs.CV(Computer Vision),stat.ML(Statistics ML) - Set reasonable limits β Start with 50β100 papers per search and increase if needed
- PDF links work directly β Download PDFs programmatically using the
pdfUrlfield
Use with AI agents via MCP
ArXiv Scraper is available as a tool for AI assistants via the Model Context Protocol (MCP).
Setup for Claude Code
$claude mcp add --transport http apify "https://mcp.apify.com"
Setup for Claude Desktop, Cursor, or VS Code
Add this to your MCP config file:
{"mcpServers": {"apify": {"url": "https://mcp.apify.com"}}}
Example prompts
- "Find recent papers about 'transformer architectures' on ArXiv"
- "Get the top AI papers published this month on ArXiv"
- "Search ArXiv for papers on reinforcement learning from human feedback and summarize the key findings"
cURL
curl -X POST "https://api.apify.com/v2/acts/automation-lab~arxiv-scraper/runs?token=YOUR_API_TOKEN" \-H "Content-Type: application/json" \-d '{"searchQueries": ["transformer neural network"],"sortBy": "submittedDate","maxResultsPerSearch": 50}'
Legality
Scraping publicly available data is generally legal according to the US Court of Appeals ruling (HiQ Labs v. LinkedIn). This actor only accesses publicly available information and does not require authentication. Always review and comply with the target website's Terms of Service before scraping. For personal data, ensure compliance with GDPR, CCPA, and other applicable privacy regulations.
FAQ
How many papers can I scrape?
Each search page returns up to 50 papers. With maxSearchPages set to 20, you can get up to 1,000 papers per keyword.
Does it scrape full paper text?
No β it extracts metadata and abstracts from search results. For full paper text, download the PDF using the provided pdfUrl.
Can I search by author? The scraper currently uses ArXiv's "all fields" search. Include author names in your search keywords to find papers by specific researchers.
How often is ArXiv updated? ArXiv receives new submissions daily (excluding weekends). Sort by submission date to see the latest papers.
Why am I getting empty results?
ArXiv's search can be strict with certain keyword combinations. Try simpler or broader keywords. Also check that your maxSearchPages is high enough -- each page only returns 50 papers.
The scraper is running slowly or timing out.
ArXiv may throttle requests during peak hours. Increase maxRequestRetries to handle transient failures, and avoid running very large scrapes (500+ papers) during US business hours when ArXiv traffic is highest.
Other research and academic scrapers
- Crossref Scraper β Search and extract scholarly article metadata from Crossref.
- OpenAlex Scraper β Extract research papers and citation data from OpenAlex.
- Semantic Scholar Scraper β Scrape paper metadata and citations from Semantic Scholar.
- Wikipedia Scraper β Extract articles and structured data from Wikipedia.