Research Paper Scraper
Pricing
from $10.00 / 1,000 results
Research Paper Scraper
Gather information such as paper titles, authors, abstracts, categories, PDF links, DOIs, and additional relevant details. This process does not require an API key for access.
Pricing
from $10.00 / 1,000 results
Rating
0.0
(0)
Developer
Jamshaid Arif
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
5 days ago
Last modified
Categories
Share
π arXiv Research Paper Scraper
Scrape academic papers from arXiv.org β the world's largest open-access preprint repository with 2M+ papers in Physics, Mathematics, Computer Science, Biology, Finance, and Statistics.
Extract paper titles, authors, abstracts, categories, PDF links, DOIs, and more. No API key needed.
What does this actor do?
This actor searches arXiv.org and returns structured data for academic papers. It uses the official arXiv API, which is free and open β no authentication required.
Use it to:
- Monitor new papers in your research field daily
- Build literature review databases for any topic
- Track a specific author's publication history
- Retrieve full metadata for known papers by their arXiv IDs
- Filter papers by date range, category, or keywords in the abstract
- Analyze research trends by category and year distribution
How to use
1. Choose a scrape mode
The actor supports 4 scraping modes, selected via the scrape_mode input:
| Mode | When to use | Required input |
|---|---|---|
| Keyword Search | Find papers on a topic | search_query |
| Author Search | Find all papers by a researcher | author_name |
| Category Browse | Browse latest papers in a field | category |
| ID Lookup | Fetch specific known papers | paper_ids |
2. Configure your search
Fill in the input fields in Apify Console. Only the fields relevant to your chosen mode are required β everything else is optional.
3. Run and export
Click Start and wait for the results. Export your dataset as JSON, CSV, Excel, or connect it to Google Sheets, Slack, or any other integration.
Input parameters
Core settings
| Field | Type | Default | Description |
|---|---|---|---|
scrape_mode | Select | keyword_search | Which search mode to use |
search_query | Text | machine learning | Keywords to search for (Keyword Search mode) |
search_field | Select | all | Which field to search: all, ti (title), abs (abstract), au (author) |
author_name | Text | β | Researcher name (Author Search mode). Example: Yoshua Bengio |
category | Text | β | arXiv category code (Category Browse mode, or as a filter in other modes) |
paper_ids | Text | β | Comma-separated arXiv IDs (ID Lookup mode). Example: 1706.03762,2005.14165 |
Filters
| Field | Type | Default | Description |
|---|---|---|---|
date_from | Text | β | Only papers from this date onward (YYYY-MM-DD) |
date_to | Text | β | Only papers up to this date (YYYY-MM-DD) |
abstract_contains | Text | β | Abstract must contain ALL of these comma-separated keywords |
exclude_keywords | Text | β | Exclude papers containing ANY of these keywords in title or abstract |
min_authors | Number | 0 | Minimum number of authors (0 = no filter) |
Output settings
| Field | Type | Default | Description |
|---|---|---|---|
max_results | Number | 50 | Number of papers to scrape (1β2000) |
sort_by | Select | relevance | Sort order: relevance, submitted_date, or last_updated |
include_abstract | Boolean | true | Include full abstract text |
include_pdf_links | Boolean | true | Include direct PDF download URLs |
Input examples
Example 1: Find recent NLP papers about transformers
{"scrape_mode": "keyword_search","search_query": "transformer attention mechanism","search_field": "all","category": "cs.CL","max_results": 50,"sort_by": "submitted_date","date_from": "2025-01-01"}
Example 2: Get all papers by a specific author
{"scrape_mode": "author_search","author_name": "Geoffrey Hinton","max_results": 100,"sort_by": "submitted_date","include_abstract": true}
Example 3: Browse latest Machine Learning papers
{"scrape_mode": "category_browse","category": "cs.LG","max_results": 30,"sort_by": "submitted_date"}
Example 4: Fetch 5 famous AI papers by ID
{"scrape_mode": "id_lookup","paper_ids": "1706.03762, 2005.14165, 1512.03385, 1406.2661, 2301.07041"}
Example 5: Advanced β diffusion models in ML, 2024+, excluding "survey"
{"scrape_mode": "keyword_search","search_query": "diffusion model","search_field": "ti","category": "cs.LG","max_results": 40,"sort_by": "submitted_date","date_from": "2024-01-01","abstract_contains": "generative","exclude_keywords": "survey, review"}
Output format
The actor produces two types of records in the dataset:
Summary record (first item)
The first record is a summary of the entire scrape run:
{"type": "summary","total_papers": 50,"unique_authors": 187,"top_categories": [{ "category": "cs.CL", "count": 32 },{ "category": "cs.LG", "count": 18 },{ "category": "cs.AI", "count": 9 }],"year_distribution": {"2024": 12,"2025": 38},"query_used": "all:transformer AND all:attention AND cat:cs.CL","scrape_mode": "keyword_search"}
Paper records
Each paper is a flat JSON object:
{"rank": 1,"arxiv_id": "2301.07041v1","title": "LLaMA: Open and Efficient Foundation Language Models","authors": ["Hugo Touvron", "Thibaut Lavril", "Gautier Izacard", "..."],"authors_short": "Hugo Touvron, Thibaut Lavril, Gautier Izacard et al.","num_authors": 14,"published": "2023-02-27","updated": "2023-02-27","primary_category": "cs.CL","categories": ["cs.CL"],"category_names": ["Computer Science"],"abstract": "We introduce LLaMA, a collection of foundation language models...","abstract_length": 1247,"arxiv_url": "http://arxiv.org/abs/2301.07041v1","pdf_url": "http://arxiv.org/pdf/2301.07041v1","source_url": "https://arxiv.org/e-print/2301.07041v1","doi": "","journal_ref": "","comment": "Submitted to NeurIPS 2023","links": [{ "title": "pdf", "href": "http://arxiv.org/pdf/2301.07041v1", "type": "application/pdf" }],"scraped_at": "2026-04-04T12:00:00+00:00"}
arXiv categories quick reference
Use these codes in the category field:
Computer Science
| Code | Field |
|---|---|
cs.AI | Artificial Intelligence |
cs.CL | Computation & Language (NLP) |
cs.CV | Computer Vision |
cs.LG | Machine Learning |
cs.NE | Neural & Evolutionary Computing |
cs.RO | Robotics |
cs.SE | Software Engineering |
cs.CR | Cryptography & Security |
cs.DB | Databases |
cs.IR | Information Retrieval |
Other popular categories
| Code | Field |
|---|---|
stat.ML | Statistics β Machine Learning |
math.OC | Mathematics β Optimization & Control |
quant-ph | Quantum Physics |
econ.GN | Economics β General |
q-bio.NC | Quantitative Biology β Neurons & Cognition |
q-fin.ST | Quantitative Finance β Statistical Finance |
The full category list is at arxiv.org/category_taxonomy.
Advanced query syntax
In search_query, you can use arXiv's native query syntax for precise searches:
| Syntax | Meaning | Example |
|---|---|---|
ti:keyword | Search in title only | ti:deep learning |
abs:keyword | Search in abstract only | abs:reinforcement |
au:name | Search by author | au:bengio |
cat:code | Search by category | cat:cs.AI |
AND | Both terms must match | ti:neural AND ti:network |
OR | Either term matches | cat:cs.CL OR cat:cs.LG |
ANDNOT | Exclude a term | ti:transformer ANDNOT ti:survey |
"phrase" | Exact phrase match | abs:"language model" |
Example: Find papers by Bengio about attention in NLP:
au:bengio AND abs:attention AND cat:cs.CL
How much does it cost to run?
This actor is very lightweight because arXiv's API is free and doesn't require browser rendering.
| Papers | Estimated time | Apify platform credits |
|---|---|---|
| 50 | ~30 seconds | < $0.01 |
| 200 | ~2 minutes | ~$0.01 |
| 500 | ~5 minutes | ~$0.02 |
| 2000 | ~20 minutes | ~$0.05 |
The arXiv API requests a 3-second delay between requests, which the actor respects automatically.
Integrations and scheduling
Schedule daily paper monitoring
Set up a scheduled run in Apify Console to scrape new papers in your field every day. Combine with the Google Sheets or Slack integration to get automatic notifications.
Example: Daily cs.AI papers to Slack
- Create a scheduled task with
category_browsemode andcs.AIcategory - Sort by
submitted_date, limit tomax_results: 20 - Connect the Slack integration to post results to your
#researchchannel
Export to Google Sheets
After each run, the dataset can be automatically synced to a Google Sheet for easy sharing with your research team.
Tips and best practices
- Start with a small
max_results(10β20) to preview your query, then scale up. - Use
categoryas a filter in Keyword Search mode to narrow results to your field. - Combine
abstract_containswith broad searches to find niche papers β for example, searchcat:cs.LGbut require "graph neural" in the abstract. - Use
exclude_keywordsto filter out survey/review papers if you only want original research. - arXiv limits results to 2000 per query β for larger datasets, run multiple searches with different date ranges.
- Sort by
submitted_datefor monitoring new papers; sort byrelevancefor topic exploration.
Changelog
v1.0.0 (2026-04-04)
- Initial release
- 4 scraping modes: keyword search, author search, category browse, ID lookup
- Advanced filtering: date range, abstract keywords, exclude keywords, min authors
- Summary statistics with top categories and year distribution
- Batched dataset output for large result sets