arXiv Papers Scraper
Pricing
Pay per event
arXiv Papers Scraper
Search arXiv by query, category, or author and get structured paper metadata — title, authors, abstract, primary category, DOI, PDF URL, submitted and updated timestamps. We handle pagination, retries, and rate-limit pacing so you get clean typed rows ready for a research pipeline.
Pricing
Pay per event
Rating
0.0
(0)
Developer
DevilScrapes
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
7 days ago
Last modified
Categories
Share
🎯 What this scrapes
arXiv's Atom feed at export.arxiv.org/api/query is the canonical source for paper metadata — and a notoriously picky one. This Actor wraps it with a sensible input schema, paces requests so we stay polite to the upstream, paginates through results, and writes one structured row per paper. We absorb the transient errors and rate-limit pushback; you get a dataset that drops into research dashboards, citation tracking, or ML training pipelines.
🔥 What we handle for you
- 🛡️ Browser fingerprint rotation —
curl-cffiimpersonates real Chrome / Firefox / Safari TLS handshakes so the target sees a browser, not Python. - 🌐 Residential proxy rotation via Apify Proxy — fresh session and exit IP on every block.
- 🔁 Retries with exponential backoff on
408 / 429 / 5xx— up to 5 attempts per page,Retry-Afterhonoured. - 🧱 Rate-limit-aware pacing — when the target pushes back, we slow down instead of getting banned.
- 🧊 Clean, typed dataset rows — Pydantic-validated, ISO-8601 timestamps, stable IDs, JSON / CSV / Excel export straight from the Apify Console.
- 💰 Pay-Per-Event pricing — you only pay for results that hit your dataset. No data, no charge.
💡 Use cases
- Citation tracking — schedule weekly runs for
au:<your-name>and diff to detect new citations of your work. - Trend monitoring — daily pull from
cat:cs.AIto feed a research digest. - Dataset curation — extract every paper matching a topic + date range to seed a literature review.
- Notification pipeline — pipe into Slack when a new paper matches your saved query.
⚙️ How to use it
- Click Try for free at the top of the page.
- Fill in the input form — most fields have sensible defaults.
- Click Start. Output streams into the run's dataset.
- Export from Storage → Dataset as JSON, CSV, or Excel — or fetch via the API.
📥 Input
| Field | Type | Required | Default | Notes |
|---|---|---|---|---|
searchQuery | string | yes | 'cat:cs.AI' | arXiv search query string. Use field prefixes like ti: (title), au: (author), cat:</code |
sortBy | string | no | 'submittedDate' | Field used to order results. |
sortOrder | string | no | 'descending' | Ascending or descending. |
maxResults | integer | no | 50 | Total papers to fetch across pages. arXiv recommends ≤30000 per query. Default 50. |
pageSize | integer | no | 50 | Papers per API call. arXiv caps page size at 2000; default 50. |
proxyConfiguration | object | no | {'useApifyProxy': False} | Apify Proxy is optional — arXiv is fine with direct access. Throttle yourself to stay polite. |
Example input
{"searchQuery": "cat:cs.AI","sortBy": "submittedDate","sortOrder": "descending","maxResults": 3,"pageSize": 3,"proxyConfiguration": {"useApifyProxy": false}}
📤 Output
Every row is one dataset item.
| Field | Type | Notes |
|---|---|---|
arxiv_id | string | arXiv identifier (e.g. 2401.12345v2). |
url | string | Abstract page URL on arxiv.org. |
pdf_url | string | Direct PDF URL. |
title | string | Paper title (whitespace-normalised). |
summary | string | Abstract. |
authors | array | Author names (preserving order). |
primary_category | string | arXiv primary category slug (e.g. cs.AI). |
categories | array | All arXiv categories the paper is tagged with. |
doi | ['string', 'null'] | DOI if set in the metadata. |
journal_ref | ['string', 'null'] | Journal reference if available. |
comment | ['string', 'null'] | Authors' comment (page counts, conf acceptance, etc.). |
published | string | Original submission timestamp (ISO-8601 UTC). |
updated | string | Last revision timestamp. |
scraped_at | string | When this row was recorded. |
Example output
{"arxiv_id": "2401.12345v2","url": "https://arxiv.org/abs/2401.12345v2","pdf_url": "https://arxiv.org/pdf/2401.12345v2","title": "Scaling Laws for Sparse Mixture-of-Experts Language Models","authors": ["Alex Doe","Jamie Smith"],"primary_category": "cs.CL","published": "2026-04-12T16:00:00+00:00"}
💰 Pricing
Pay-Per-Event — you pay only when these events fire:
| Event | USD | What it is |
|---|---|---|
actor-start | $0.005 | One-off warm-up charge per run |
result | $0.0015 | Per dataset item |
Example: 1 000 results at the rates above ≈ $1.50. No subscription, no minimum, no card to start — Apify gives every new account $5 of free credit.
🚧 Limitations
We use only the Atom API. Full-text search (over PDF content) is not supported; queries operate on metadata fields. Author disambiguation across name collisions is on the user — arXiv does not assign canonical author IDs in the public API.
❓ FAQ
Is this legal?
Yes — arXiv publishes the Atom API expressly for programmatic access. We respect their rate-limit guidance.
Can I download PDFs?
Not in this Actor — we surface the PDF URL, so you can pull the file with a follow-up Actor or curl.
Why don't I see DOIs?
Most preprints don't have one until the paper is journal-published. We surface null when missing.
How fresh is the data?
Within a few hours of arXiv ingest. Newly-submitted papers usually appear within 1-2 hours.
💬 Your feedback
Spotted a bug, hit a weird edge case, or need a new field? Open an issue on the Actor's Issues tab on Apify Console — we ship fixes weekly and we read every report.