arXiv Paper Scraper - Search Research Preprints, No API Key avatar

arXiv Paper Scraper - Search Research Preprints, No API Key

Pricing

$1.00 / 1,000 paper scrapeds

Go to Apify Store
arXiv Paper Scraper - Search Research Preprints, No API Key

arXiv Paper Scraper - Search Research Preprints, No API Key

Scrape arXiv research papers by keyword, category (cs.AI, cs.LG, quant-ph) or author. Returns titles, abstracts, authors, dates, DOIs & PDF links as clean JSON. No API key. Use it as an MCP server in Claude, ChatGPT & AI agents for research monitoring.

Pricing

$1.00 / 1,000 paper scrapeds

Rating

0.0

(0)

Developer

The Mine Works

The Mine Works

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

1

Monthly active users

3 hours ago

Last modified

Share

arXiv Preprint Search — AI, Physics, Math, Biology & More

Search 2.3 million+ preprints from arXiv — the world's leading open-access repository for scientific preprints — directly from Apify. Retrieve titles, abstracts, authors, categories, submission dates, and direct PDF links across computer science, physics, mathematics, quantitative biology, economics, and more. No API key required.

Why This Actor?

arXiv is where cutting-edge research in AI, machine learning, and physics appears months before peer-reviewed publication. Virtually every significant deep learning paper — from the original Transformer to GPT, BERT, Stable Diffusion, and beyond — was posted to arXiv first. For anyone tracking the frontier of science and technology in real time, arXiv is an indispensable source.

This actor wraps the official arXiv Atom API (export.arxiv.org/api/query) and delivers clean structured JSON — one paper per dataset row — with built-in rate limiting that stays within arXiv's published 3 requests per second guideline.

Target buyers and use cases:

  • AI/ML researchers and engineers monitoring cs.AI, cs.LG, and cs.CL daily for new techniques, architectures, and benchmarks
  • Tech companies building literature pipelines that surface relevant research to internal teams, surface competitive intelligence, or trigger alerts on new papers in a domain
  • NLP training data teams collecting abstracts and titles as structured corpora for language model pretraining, fine-tuning, or evaluation
  • Investors and analysts tracking research output as a leading indicator of commercial activity in AI, biotech, quantum computing, and other technology sectors
  • University research groups conducting systematic mapping reviews of a field's preprint landscape

arXiv Categories

arXiv organizes preprints into a taxonomy of subject areas. Common categories include:

CodeSubject
cs.AIArtificial Intelligence
cs.LGMachine Learning
cs.CLComputation and Language (NLP)
cs.CVComputer Vision and Pattern Recognition
cs.RORobotics
quant-phQuantum Physics
q-bio.GNGenomics
math.AGAlgebraic Geometry
econ.GNGeneral Economics
stat.MLStatistics — Machine Learning

Leave the category field blank to search across all subject areas simultaneously.

Query Syntax

The actor supports arXiv's native query prefix syntax:

PrefixFieldExample
ti:Titleti:attention mechanism
au:Authorau:Vaswani
abs:Abstractabs:diffusion model
cat:Categorycat:cs.AI

Combine prefixes with AND, OR, ANDNOT: ti:transformer AND cat:cs.CL ANDNOT abs:image

Inputs

FieldTypeDescriptionDefault
querystringSearch query with optional prefix syntaxlarge language models
categorystringarXiv category code (e.g. cs.AI, quant-ph). Leave blank for all
sortByselectsubmittedDate, relevance, or lastUpdatedDatesubmittedDate
dateFromstringFilter papers submitted from this date (YYYYMMDD)
maxResultsintegerMaximum papers to return (1–2,000)100

Output Format

Each paper is stored as one item in the Apify dataset:

{
"arxiv_id": "2310.06825v2",
"title": "Mistral 7B",
"abstract": "We introduce Mistral 7B, a 7-billion-parameter language model engineered for superior performance and efficiency...",
"authors": ["Albert Q. Jiang", "Alexandre Sablayrolles", "Arthur Mensch"],
"categories": ["cs.CL", "cs.AI", "cs.LG"],
"primary_category": "cs.CL",
"published_date": "2023-10-10T17:27:55Z",
"updated_date": "2023-10-16T17:27:55Z",
"doi": null,
"pdf_url": "https://arxiv.org/pdf/2310.06825v2",
"url": "https://arxiv.org/abs/2310.06825v2",
"scraped_at": "2024-11-15T09:22:11.000Z"
}

A summary record is appended at the end with total paper count and run timestamp.

Rate Limiting

arXiv explicitly requests that automated access stay under 3 requests per second and avoid peak hours. This actor enforces a 3.1-second delay between paginated API calls and uses automatic retry with exponential backoff on 429 and 5xx responses. Do not run multiple concurrent instances against arXiv — the actor is designed for sequential, respectful access.

Pricing

First 25 results are free on every Apify account — no charge until you exceed the free tier.

After the free tier: $3 per 1,000 papers (Pay-Per-Event billing). A 1,000-paper run costs $3.00. A 2,000-paper run costs $6.00. You are charged only for papers actually delivered.

Frequently Asked Questions

Q: Does arXiv require an API key? No. The arXiv Atom feed API is completely open and requires no authentication. This actor works immediately with no setup beyond providing your search query.

Q: How recent are the papers? arXiv papers are available via the API on the day of submission, which is typically the same day or the next business day after authors submit. The actor sorts by submittedDate descending by default, so the newest papers appear first.

Q: Can I get papers from a specific author? Yes. Use the au: prefix in your query, e.g. au:LeCun or au:Hinton Geoffrey. Note that arXiv author names are not normalized — the same author may appear under different name formats across papers. For robust author tracking, consider combining au: with ti: or abs: filters.

Q: What does primary_category mean? Each paper is assigned one primary subject category by the authors at submission time, plus optional cross-listed categories. The primary_category field is the authors' declared main subject area. The categories array contains all subject tags including cross-listings.

Q: Can I retrieve the full PDF content? The actor returns the direct pdf_url for every paper. Downloading PDFs requires a separate HTTP request. For PDF text extraction at scale, pair this actor with a PDF parsing step in your Apify pipeline.

Q: Why is maxResults capped at 2,000? The arXiv API's practical reliability degrades for very large offsets. For retrievals beyond 2,000 papers, partition your search by date range (dateFrom) or category and run the actor multiple times, then deduplicate by arxiv_id.

Q: Are the papers peer-reviewed? arXiv is a preprint server. Papers are not peer-reviewed before posting — they are screened for basic relevance and academic quality by arXiv moderators but are not refereed by journals. Many papers are later published in peer-reviewed venues; the doi field, when present, links to the published version.

Use in Claude, ChatGPT & any MCP agent

This actor is also a Model Context Protocol (MCP) server tool — call it directly from Claude, ChatGPT, Cursor, Windsurf, or any MCP-compatible AI agent. The agent only pays for results delivered (same pay-per-result model).

  • Per-actor MCP endpoint: https://mcp.apify.com/?tools=themineworks/arxiv-preprint-search
  • Full Mine Works MCP server (all tools): https://the-mine-works-mcp.hatchable.site/api/mcp
// Call this actor as a tool via apify-client (Node)
import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });
const run = await client.actor('themineworks/arxiv-preprint-search').call({ /* input from the table above */ });
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items);