Pricing

Pay per event

Go to Apify Store

ArXiv Scraper

Try for free

Scrape ArXiv research papers — titles, authors, abstracts, subjects, submission dates, and PDF links.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Stas Persiianenko

Actor stats

Bookmarked

Total users

Monthly active users

10 days ago

Last modified

What does ArXiv Scraper do?

ArXiv Scraper searches ArXiv for research papers matching your keywords and extracts structured data from the results. It collects complete paper metadata including full abstracts, author lists, subject categories, and direct PDF download links.

The scraper uses ArXiv's search interface and supports sorting by relevance, submission date, or announcement date with configurable result limits and pagination.

Who is it for?

🎓 Academic researchers — collecting papers and metadata for literature reviews
📊 Data scientists — building datasets of research trends and citation patterns
🤖 ML engineers — tracking new model releases and benchmark results
📰 Science journalists — monitoring breakthroughs across research fields
🏢 R&D teams — staying current with state-of-the-art publications in their domain

Why scrape ArXiv?

ArXiv is the world's largest open-access repository for scientific preprints, hosting over 2.5 million papers across physics, mathematics, computer science, biology, economics, and more. Researchers submit papers to ArXiv before or alongside traditional journal publication.

Key reasons to scrape ArXiv:

Literature reviews — Collect papers on a topic for systematic reviews
Research monitoring — Track new papers in your field of study
Citation analysis — Build datasets of papers for bibliometric research
ML training data — Gather abstracts and metadata for NLP models, LLM fine-tuning, or RAG pipeline knowledge bases
Competitive intelligence — Monitor research output from specific institutions

Use cases

Academic researchers tracking publications in their field
Data scientists building paper recommendation systems
Research teams doing systematic literature reviews
AI companies monitoring state-of-the-art research
PhD students surveying related work for dissertations
Science journalists tracking breakthroughs across disciplines
AI/ML engineers building RAG pipelines or training datasets from academic literature

How to scrape ArXiv

Go to ArXiv Scraper on Apify Store
Enter one or more search keywords
Choose sort order (relevance, submission date, or announcement date)
Set max results per search and max pages
Click Start and wait for results
Download data as JSON, CSV, or Excel

Input parameters

Parameter	Type	Default	Description
`searchQueries`	string[]	(required)	Keywords to search on ArXiv
`sortBy`	string	`"relevance"`	Sort by: `relevance`, `submittedDate`, or `announcedDate`
`sortOrder`	string	`"descending"`	Sort direction: `descending` or `ascending`
`maxResultsPerSearch`	integer	`100`	Max papers per keyword
`maxSearchPages`	integer	`5`	Max pages per keyword (50 papers/page)
`maxRequestRetries`	integer	`3`	Retry attempts for failed requests

Input example

{
    "searchQueries": ["transformer neural network", "large language model"],
    "sortBy": "submittedDate",
    "sortOrder": "descending",
    "maxResultsPerSearch": 50,
    "maxSearchPages": 2
}

Output

Each paper in the dataset contains:

Field	Type	Description
`arxivId`	string	ArXiv paper ID (e.g., `"2603.00888"`)
`title`	string	Paper title
`authors`	string[]	List of author names
`abstract`	string	Full abstract text
`subjects`	string[]	ArXiv subject categories (e.g., `"cs.LG"`)
`submittedDate`	string	Submission date (e.g., `"28 February, 2026"`)
`comments`	string	Author comments (page count, conference, etc.)
`pdfUrl`	string	Direct link to PDF
`abstractUrl`	string	Link to abstract page
`scrapedAt`	string	ISO timestamp of extraction

Output example

{
    "arxivId": "2603.00853",
    "title": "Neural Discrimination-Prompted Transformers for Efficient UHD Image Restoration",
    "authors": ["Cong Wang", "Jinshan Pan", "Liyan Wang", "Wei Wang", "Yang Yang"],
    "abstract": "We propose a simple yet effective UHDPromer, a neural discrimination-prompted Transformer, for Ultra-High-Definition (UHD) image restoration and enhancement...",
    "subjects": ["cs.CV"],
    "submittedDate": "28 February, 2026",
    "comments": "Accepted by IJCV'26; code is available at https://github.com/supersupercong/uhdpromer",
    "pdfUrl": "https://arxiv.org/pdf/2603.00853",
    "abstractUrl": "https://arxiv.org/abs/2603.00853",
    "scrapedAt": "2026-03-03T02:40:25.176Z"
}

How much does it cost to scrape ArXiv?

ArXiv Scraper uses pay-per-event pricing:

Event	Price
Run started	$0.001
Paper extracted	$0.002 per paper

Cost examples

Scenario	Papers	Cost
Quick search	50	$0.101
Medium search	200	$0.401
Large survey	500	$1.001

Platform costs (compute) are minimal — typically under $0.001 per run.

API usage

Node.js

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });

const run = await client.actor('automation-lab/arxiv-scraper').call({
    searchQueries: ['attention mechanism'],
    sortBy: 'submittedDate',
    maxResultsPerSearch: 100,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`Found ${items.length} papers`);
items.forEach(paper => {
    console.log(`${paper.arxivId}: ${paper.title}`);
});

Python

from apify_client import ApifyClient

client = ApifyClient('YOUR_API_TOKEN')

run = client.actor('automation-lab/arxiv-scraper').call(run_input={
    'searchQueries': ['attention mechanism'],
    'sortBy': 'submittedDate',
    'maxResultsPerSearch': 100,
})

dataset = client.dataset(run['defaultDatasetId']).list_items().items
print(f'Found {len(dataset)} papers')
for paper in dataset:
    print(f"{paper['arxivId']}: {paper['title']}")

cURL

curl -X POST "https://api.apify.com/v2/acts/automation-lab~arxiv-scraper/runs?token=YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "searchQueries": ["transformer neural network"],
    "sortBy": "submittedDate",
    "maxResultsPerSearch": 50
  }'

Integrations

ArXiv Scraper works with all Apify integrations:

Webhooks — Get notified when a scrape completes
API — Trigger runs programmatically and fetch results
Scheduled runs — Monitor ArXiv on a daily or weekly schedule
Google Sheets — Export papers directly to a spreadsheet
Slack / Email — Send notifications when new papers match your criteria

Connect ArXiv Scraper to Zapier, Make, or Google Sheets for automated workflows.

Tips

Use specific keywords for better results — ArXiv's search is broad by default
Sort by submission date to find the latest papers first
Combine multiple queries to search across related topics in a single run
Check subject codes — ArXiv uses category codes like cs.LG (Machine Learning), cs.CV (Computer Vision), stat.ML (Statistics ML)
Set reasonable limits — Start with 50–100 papers per search and increase if needed
PDF links work directly — Download PDFs programmatically using the pdfUrl field

Use with AI agents via MCP

ArXiv Scraper is available as a tool for AI assistants via the Model Context Protocol (MCP).

Setup for Claude Code

$claude mcp add --transport http apify "https://mcp.apify.com?tools=automation-lab/arxiv-scraper"

Setup for Claude Desktop, Cursor, or VS Code

Add this to your MCP config file:

{
    "mcpServers": {
        "apify": {
            "url": "https://mcp.apify.com?tools=automation-lab/arxiv-scraper"
        }
    }
}

Example prompts

"Find recent papers about 'transformer architectures' on ArXiv"
"Get the top AI papers published this month on ArXiv"
"Search ArXiv for papers on reinforcement learning from human feedback and summarize the key findings"

Legality

Scraping publicly available data is generally legal according to the US Court of Appeals ruling (HiQ Labs v. LinkedIn). This actor only accesses publicly available information and does not require authentication. Always review and comply with the target website's Terms of Service before scraping. For personal data, ensure compliance with GDPR, CCPA, and other applicable privacy regulations.

FAQ

How many papers can I scrape? Each search page returns up to 50 papers. With maxSearchPages set to 20, you can get up to 1,000 papers per keyword.

Does it scrape full paper text? No — it extracts metadata and abstracts from search results. For full paper text, download the PDF using the provided pdfUrl.

Can I search by author? The scraper currently uses ArXiv's "all fields" search. Include author names in your search keywords to find papers by specific researchers.

How often is ArXiv updated? ArXiv receives new submissions daily (excluding weekends). Sort by submission date to see the latest papers.

Why am I getting empty results? ArXiv's search can be strict with certain keyword combinations. Try simpler or broader keywords. Also check that your maxSearchPages is high enough -- each page only returns 50 papers.

The scraper is running slowly or timing out. ArXiv may throttle requests during peak hours. Increase maxRequestRetries to handle transient failures, and avoid running very large scrapes (500+ papers) during US business hours when ArXiv traffic is highest.

Other research and academic scrapers

Crossref Scraper — Search and extract scholarly article metadata from Crossref.
OpenAlex Scraper — Extract research papers and citation data from OpenAlex.
Semantic Scholar Scraper — Scrape paper metadata and citations from Semantic Scholar.
Wikipedia Scraper — Extract articles and structured data from Wikipedia.

arXiv Paper Scraper

plantane/arxiv-scraper

Scrape research papers from arXiv by search query or category. Get titles, abstracts, authors, categories, and PDF links via the public arXiv API.

Daniel

arXiv Paper Scraper

cloud9_ai/arxiv-paper-scraper

Scrape academic papers from arXiv.org. Search by keyword, browse categories, or get latest papers. Extract titles, abstracts, authors, PDF links, and citation data via arXiv API.

cloud9

arXiv Paper Scraper

skystone_labs/arxiv-scraper

Extract research papers from arXiv using the official API. Get titles, authors, abstracts, PDF URLs, categories, and more. Perfect for research datasets and literature reviews.

Skystone

ArXiv Academic Paper Scraper

fortuitous_pirate/arxiv-scraper

Scrape academic papers from ArXiv. Extract titles, authors, abstracts, categories, and PDF links. Essential for research and literature reviews.

Fortuitous Pirate

Arxiv Keyword Spider

getdataforme/arxiv-keyword-spider

Arxiv Keyword Spider efficiently scrapes arXiv.org for research papers using keywords, delivering comprehensive metadata like titles, authors, abstracts, and categories. Perfect for academic research, market analysis, and trend monitoring....

GetDataForMe

arXiv Scraper

artificially/arxiv-scraper

Search and extract academic papers from arXiv.org. Get paper titles, authors, abstracts, categories, and PDF links for AI/ML, physics, math, and more.

Artificially

ArXiv Research Paper Scraper

datapilot/arxiv-research-paper-scraper

arXiv Research Paper Scraper retrieves academic paper metadata from the arXiv API based on a keyword. It extracts titles, abstracts, authors with affiliations, DOI, categories, submission dates, and PDF links. Supports proxy usage and outputs structured JSON results for research and data analysis.

Data Pilot

ArXiv Paper Scraper

sheshinmcfly/arxiv-paper-scraper

Search and extract scientific papers from ArXiv.org across any field. Returns title, authors, full abstract, PDF link, arXiv ID, categories, and submission date. Ideal for AI research monitoring, RAG pipelines, literature reviews, and academic trend analysis. No API key needed.

Sheshinmcfly

📚 arXiv Article Metadata Scraper - Cheap

scrapestorm/arxiv-article-metadata-scraper---cheap

Discover top arXiv papers with ⚡fast metadata extraction! Sort by 🔥 relevance 🕒 submission date or 📚 subject area. Get key info like titles, abstracts, authors, PDF links & more. Perfect for 📊 literature reviews, trend tracking, academic research & building high-quality AI training datasets!

Storm_Scraper

ArXiv Paper Search

gentle_cloud/arxiv-paper-search

Search and extract academic papers from ArXiv. Find papers by keyword, author, or category with full metadata including title, authors, abstract, categories, and PDF links.