arXiv Papers Scraper avatar

arXiv Papers Scraper

Pricing

Pay per event

Go to Apify Store
arXiv Papers Scraper

arXiv Papers Scraper

Search arXiv by query, category, or author and get structured paper metadata — title, authors, abstract, primary category, DOI, PDF URL, submitted and updated timestamps. We handle pagination, retries, and rate-limit pacing so you get clean typed rows ready for a research pipeline.

Pricing

Pay per event

Rating

0.0

(0)

Developer

DevilScrapes

DevilScrapes

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

7 days ago

Last modified

Share


🎯 What this scrapes

arXiv's Atom feed at export.arxiv.org/api/query is the canonical source for paper metadata — and a notoriously picky one. This Actor wraps it with a sensible input schema, paces requests so we stay polite to the upstream, paginates through results, and writes one structured row per paper. We absorb the transient errors and rate-limit pushback; you get a dataset that drops into research dashboards, citation tracking, or ML training pipelines.

🔥 What we handle for you

  • 🛡️ Browser fingerprint rotationcurl-cffi impersonates real Chrome / Firefox / Safari TLS handshakes so the target sees a browser, not Python.
  • 🌐 Residential proxy rotation via Apify Proxy — fresh session and exit IP on every block.
  • 🔁 Retries with exponential backoff on 408 / 429 / 5xx — up to 5 attempts per page, Retry-After honoured.
  • 🧱 Rate-limit-aware pacing — when the target pushes back, we slow down instead of getting banned.
  • 🧊 Clean, typed dataset rows — Pydantic-validated, ISO-8601 timestamps, stable IDs, JSON / CSV / Excel export straight from the Apify Console.
  • 💰 Pay-Per-Event pricing — you only pay for results that hit your dataset. No data, no charge.

💡 Use cases

  • Citation tracking — schedule weekly runs for au:<your-name> and diff to detect new citations of your work.
  • Trend monitoring — daily pull from cat:cs.AI to feed a research digest.
  • Dataset curation — extract every paper matching a topic + date range to seed a literature review.
  • Notification pipeline — pipe into Slack when a new paper matches your saved query.

⚙️ How to use it

  1. Click Try for free at the top of the page.
  2. Fill in the input form — most fields have sensible defaults.
  3. Click Start. Output streams into the run's dataset.
  4. Export from Storage → Dataset as JSON, CSV, or Excel — or fetch via the API.

📥 Input

FieldTypeRequiredDefaultNotes
searchQuerystringyes'cat:cs.AI'arXiv search query string. Use field prefixes like ti: (title), au: (author), cat:</code
sortBystringno'submittedDate'Field used to order results.
sortOrderstringno'descending'Ascending or descending.
maxResultsintegerno50Total papers to fetch across pages. arXiv recommends ≤30000 per query. Default 50.
pageSizeintegerno50Papers per API call. arXiv caps page size at 2000; default 50.
proxyConfigurationobjectno{'useApifyProxy': False}Apify Proxy is optional — arXiv is fine with direct access. Throttle yourself to stay polite.

Example input

{
"searchQuery": "cat:cs.AI",
"sortBy": "submittedDate",
"sortOrder": "descending",
"maxResults": 3,
"pageSize": 3,
"proxyConfiguration": {
"useApifyProxy": false
}
}

📤 Output

Every row is one dataset item.

FieldTypeNotes
arxiv_idstringarXiv identifier (e.g. 2401.12345v2).
urlstringAbstract page URL on arxiv.org.
pdf_urlstringDirect PDF URL.
titlestringPaper title (whitespace-normalised).
summarystringAbstract.
authorsarrayAuthor names (preserving order).
primary_categorystringarXiv primary category slug (e.g. cs.AI).
categoriesarrayAll arXiv categories the paper is tagged with.
doi['string', 'null']DOI if set in the metadata.
journal_ref['string', 'null']Journal reference if available.
comment['string', 'null']Authors' comment (page counts, conf acceptance, etc.).
publishedstringOriginal submission timestamp (ISO-8601 UTC).
updatedstringLast revision timestamp.
scraped_atstringWhen this row was recorded.

Example output

{
"arxiv_id": "2401.12345v2",
"url": "https://arxiv.org/abs/2401.12345v2",
"pdf_url": "https://arxiv.org/pdf/2401.12345v2",
"title": "Scaling Laws for Sparse Mixture-of-Experts Language Models",
"authors": [
"Alex Doe",
"Jamie Smith"
],
"primary_category": "cs.CL",
"published": "2026-04-12T16:00:00+00:00"
}

💰 Pricing

Pay-Per-Event — you pay only when these events fire:

EventUSDWhat it is
actor-start$0.005One-off warm-up charge per run
result$0.0015Per dataset item

Example: 1 000 results at the rates above ≈ $1.50. No subscription, no minimum, no card to start — Apify gives every new account $5 of free credit.

🚧 Limitations

We use only the Atom API. Full-text search (over PDF content) is not supported; queries operate on metadata fields. Author disambiguation across name collisions is on the user — arXiv does not assign canonical author IDs in the public API.

❓ FAQ

Is this legal?

Yes — arXiv publishes the Atom API expressly for programmatic access. We respect their rate-limit guidance.

Can I download PDFs?

Not in this Actor — we surface the PDF URL, so you can pull the file with a follow-up Actor or curl.

Why don't I see DOIs?

Most preprints don't have one until the paper is journal-published. We surface null when missing.

How fresh is the data?

Within a few hours of arXiv ingest. Newly-submitted papers usually appear within 1-2 hours.

💬 Your feedback

Spotted a bug, hit a weird edge case, or need a new field? Open an issue on the Actor's Issues tab on Apify Console — we ship fixes weekly and we read every report.