ArXiv Scraper avatar

ArXiv Scraper

Pricing

Pay per event

Go to Apify Store
ArXiv Scraper

ArXiv Scraper

Scrape ArXiv research papers — titles, authors, abstracts, subjects, submission dates, and PDF links.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Stas Persiianenko

Stas Persiianenko

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 hours ago

Last modified

Categories

Share

Scrape research papers from ArXiv by keyword. Extract titles, authors, abstracts, subjects, submission dates, comments, and PDF links from search results.

What does ArXiv Scraper do?

ArXiv Scraper searches ArXiv for research papers matching your keywords and extracts structured data from the results. It collects complete paper metadata including full abstracts, author lists, subject categories, and direct PDF download links.

The scraper uses ArXiv's search interface and supports sorting by relevance, submission date, or announcement date with configurable result limits and pagination.

Why scrape ArXiv?

ArXiv is the world's largest open-access repository for scientific preprints, hosting over 2.5 million papers across physics, mathematics, computer science, biology, economics, and more. Researchers submit papers to ArXiv before or alongside traditional journal publication.

Key reasons to scrape ArXiv:

  • Literature reviews — Collect papers on a topic for systematic reviews
  • Research monitoring — Track new papers in your field of study
  • Citation analysis — Build datasets of papers for bibliometric research
  • ML training data — Gather abstracts and metadata for NLP models
  • Competitive intelligence — Monitor research output from specific institutions

Use cases

  • Academic researchers tracking publications in their field
  • Data scientists building paper recommendation systems
  • Research teams doing systematic literature reviews
  • AI companies monitoring state-of-the-art research
  • PhD students surveying related work for dissertations
  • Science journalists tracking breakthroughs across disciplines

How to scrape ArXiv

  1. Go to ArXiv Scraper on Apify Store
  2. Enter one or more search keywords
  3. Choose sort order (relevance, submission date, or announcement date)
  4. Set max results per search and max pages
  5. Click Start and wait for results
  6. Download data as JSON, CSV, or Excel

Input parameters

ParameterTypeDefaultDescription
searchQueriesstring[](required)Keywords to search on ArXiv
sortBystring"relevance"Sort by: relevance, submittedDate, or announcedDate
sortOrderstring"descending"Sort direction: descending or ascending
maxResultsPerSearchinteger100Max papers per keyword
maxSearchPagesinteger5Max pages per keyword (50 papers/page)
maxRequestRetriesinteger3Retry attempts for failed requests

Input example

{
"searchQueries": ["transformer neural network", "large language model"],
"sortBy": "submittedDate",
"sortOrder": "descending",
"maxResultsPerSearch": 50,
"maxSearchPages": 2
}

Output

Each paper in the dataset contains:

FieldTypeDescription
arxivIdstringArXiv paper ID (e.g., "2603.00888")
titlestringPaper title
authorsstring[]List of author names
abstractstringFull abstract text
subjectsstring[]ArXiv subject categories (e.g., "cs.LG")
submittedDatestringSubmission date (e.g., "28 February, 2026")
commentsstringAuthor comments (page count, conference, etc.)
pdfUrlstringDirect link to PDF
abstractUrlstringLink to abstract page
scrapedAtstringISO timestamp of extraction

Output example

{
"arxivId": "2603.00853",
"title": "Neural Discrimination-Prompted Transformers for Efficient UHD Image Restoration",
"authors": ["Cong Wang", "Jinshan Pan", "Liyan Wang", "Wei Wang", "Yang Yang"],
"abstract": "We propose a simple yet effective UHDPromer, a neural discrimination-prompted Transformer, for Ultra-High-Definition (UHD) image restoration and enhancement...",
"subjects": ["cs.CV"],
"submittedDate": "28 February, 2026",
"comments": "Accepted by IJCV'26; code is available at https://github.com/supersupercong/uhdpromer",
"pdfUrl": "https://arxiv.org/pdf/2603.00853",
"abstractUrl": "https://arxiv.org/abs/2603.00853",
"scrapedAt": "2026-03-03T02:40:25.176Z"
}

Pricing

ArXiv Scraper uses pay-per-event pricing:

EventPrice
Run started$0.001
Paper extracted$0.002 per paper

Cost examples

ScenarioPapersCost
Quick search50$0.101
Medium search200$0.401
Large survey500$1.001

Platform costs (compute) are minimal — typically under $0.001 per run.

Using ArXiv Scraper with the Apify API

Node.js

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const run = await client.actor('automation-lab/arxiv-scraper').call({
searchQueries: ['attention mechanism'],
sortBy: 'submittedDate',
maxResultsPerSearch: 100,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`Found ${items.length} papers`);
items.forEach(paper => {
console.log(`${paper.arxivId}: ${paper.title}`);
});

Python

from apify_client import ApifyClient
client = ApifyClient('YOUR_API_TOKEN')
run = client.actor('automation-lab/arxiv-scraper').call(run_input={
'searchQueries': ['attention mechanism'],
'sortBy': 'submittedDate',
'maxResultsPerSearch': 100,
})
dataset = client.dataset(run['defaultDatasetId']).list_items().items
print(f'Found {len(dataset)} papers')
for paper in dataset:
print(f"{paper['arxivId']}: {paper['title']}")

Integrations

ArXiv Scraper works with all Apify integrations:

  • Webhooks — Get notified when a scrape completes
  • API — Trigger runs programmatically and fetch results
  • Scheduled runs — Monitor ArXiv on a daily or weekly schedule
  • Google Sheets — Export papers directly to a spreadsheet
  • Slack / Email — Send notifications when new papers match your criteria

Connect ArXiv Scraper to Zapier, Make, or Google Sheets for automated workflows.

Tips

  • Use specific keywords for better results — ArXiv's search is broad by default
  • Sort by submission date to find the latest papers first
  • Combine multiple queries to search across related topics in a single run
  • Check subject codes — ArXiv uses category codes like cs.LG (Machine Learning), cs.CV (Computer Vision), stat.ML (Statistics ML)
  • Set reasonable limits — Start with 50–100 papers per search and increase if needed
  • PDF links work directly — Download PDFs programmatically using the pdfUrl field

FAQ

How many papers can I scrape? Each search page returns up to 50 papers. With maxSearchPages set to 20, you can get up to 1,000 papers per keyword.

Does it scrape full paper text? No — it extracts metadata and abstracts from search results. For full paper text, download the PDF using the provided pdfUrl.

Can I search by author? The scraper currently uses ArXiv's "all fields" search. Include author names in your search keywords to find papers by specific researchers.

How often is ArXiv updated? ArXiv receives new submissions daily (excluding weekends). Sort by submission date to see the latest papers.