arXiv Papers Scraper
Pricing
from $1.00 / 1,000 results
arXiv Papers Scraper
Scrape academic preprints from arXiv.org by keyword, author, or category. Returns clean records with title, authors, abstract, categories, PDF URL, DOI. HTTP-only via the public arXiv API. No login, no proxy.
Pricing
from $1.00 / 1,000 results
Rating
5.0
(20)
Developer
Crawler Bros
Actor stats
20
Bookmarked
2
Total users
1
Monthly active users
4 days ago
Last modified
Categories
Share
Search arXiv.org — the world's largest open-access archive of scientific preprints (2.5M+ papers across CS, math, physics, biology, finance, economics) — and return clean structured records for every match. HTTP-only via the public arXiv API. No login, no cookies, no proxy.
What this actor does
- Queries the arXiv API (
https://export.arxiv.org/api/query) by keyword, author, and/or category - Parses the Atom XML response into one structured JSON record per paper
- Filters by date range, DOI presence, abstract length, abstract keyword
- Sorts by relevance, submitted-date, or last-updated-date
- Walks paginated results until
maxItemsis reached - Respects arXiv's 1-request-per-3-seconds rate limit
Output per paper
arxivId— e.g.2401.12345title,abstract,abstractWordCountauthors[],authorCount,affiliations[]categories[],primaryCategory— e.g.cs.LGsubmittedAt,updatedAt— ISO-8601 UTCdoi— when published in a journaljournalRef— full citationcomment— author's note (e.g. "15 pages, 5 figures")pdfUrl— direct PDF download linkhtmlUrl— abstract page on arXiv.orgrecordType: "paper",scrapedAt
Empty fields are omitted (no nulls).
Input
| Field | Type | Default | Description |
|---|---|---|---|
searchQuery | string | "large language models" | Free-text query against title + abstract + authors |
categories | array | [] | arXiv subject codes (e.g. cs.LG, stat.ML). 50+ choices in the dropdown |
authorContains | string | – | Filter by author name substring |
sortBy | enum | submittedDate | relevance / submittedDate / lastUpdatedDate |
sortOrder | enum | descending | descending (newest first) / ascending |
dateRangeFrom | string | – | Drop papers submitted before this ISO date |
dateRangeTo | string | – | Drop papers submitted after this ISO date |
maxItems | int | 50 | Hard cap on emitted papers (1–5000) |
includeDoiOnly | bool | false | Drop papers without a DOI (typically pre-publication) |
minAbstractLength | int | – | Drop papers with abstracts shorter than N characters |
abstractContains | string | – | Only emit papers whose abstract contains this substring |
Example: latest LLM papers
{"searchQuery": "large language models","categories": ["cs.CL", "cs.LG"],"sortBy": "submittedDate","maxItems": 100}
Example: papers by a specific author
{"authorContains": "Yann LeCun","sortBy": "submittedDate","maxItems": 50}
Example: published papers (DOI required)
{"searchQuery": "transformer","categories": ["cs.LG"],"includeDoiOnly": true,"minAbstractLength": 200,"dateRangeFrom": "2024-01-01"}
Example: niche query
{"searchQuery": "diffusion model","categories": ["cs.CV"],"abstractContains": "image generation","sortBy": "relevance","maxItems": 25}
Use cases
- AI/ML research tracking — daily run on
cs.LG+cs.AIto surface new methods - Literature review automation — feed every paper matching your query into your RAG index
- Author following — watch a specific researcher's new submissions
- Trend analysis — count papers per topic over time to chart research interest
- Citation database — pair with Crossref/DOI lookup for full bibliographic records
- Academic content marketing — find papers citing techniques your tool implements
FAQ
Does it require a login or cookies? No. arXiv's API is fully public.
Is a proxy needed? No. arXiv accepts requests from any IP. The actor honors arXiv's 3-seconds-between-requests rate limit by default.
How fresh is the data? Real-time. arXiv typically posts new papers within hours of submission.
Can I get the full PDF? The actor returns pdfUrl — a direct link to the PDF. Download it with any HTTP client.
Why is doi missing on some papers? arXiv preprints don't always have a DOI assigned at the time of upload. Set includeDoiOnly=true to filter to peer-reviewed or journal-published versions only.
What's the difference between searchQuery and abstractContains? searchQuery is sent to arXiv's server-side search (ranks by relevance). abstractContains is a client-side substring filter applied AFTER fetching. Use searchQuery for relevance, abstractContains for narrow keyword filtering on top of that.
Why limit to 5000 items? arXiv's API allows up to 30k results per query but pagination beyond a few thousand becomes very slow due to the 3-second rate limit. For larger crawls, run multiple actor runs with different dateRangeFrom/dateRangeTo windows.
Can I scrape the PDF text content? Not directly — this actor returns metadata only. Pair it with a downstream PDF-extraction actor if you need full-text.
How are categories specified? Use arXiv's official codes (e.g. cs.LG for ML, stat.ML for stats ML, cs.CL for NLP, q-bio.QM for quantitative biology). The dropdown lists 50+ common codes; the full taxonomy is at arxiv.org/category_taxonomy.