arXiv Papers Scraper avatar

arXiv Papers Scraper

Pricing

from $1.00 / 1,000 results

Go to Apify Store
arXiv Papers Scraper

arXiv Papers Scraper

Scrape academic preprints from arXiv.org by keyword, author, or category. Returns clean records with title, authors, abstract, categories, PDF URL, DOI. HTTP-only via the public arXiv API. No login, no proxy.

Pricing

from $1.00 / 1,000 results

Rating

5.0

(20)

Developer

Crawler Bros

Crawler Bros

Maintained by Community

Actor stats

20

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Share

Search arXiv.org — the world's largest open-access archive of scientific preprints (2.5M+ papers across CS, math, physics, biology, finance, economics) — and return clean structured records for every match. HTTP-only via the public arXiv API. No login, no cookies, no proxy.

What this actor does

  • Queries the arXiv API (https://export.arxiv.org/api/query) by keyword, author, and/or category
  • Parses the Atom XML response into one structured JSON record per paper
  • Filters by date range, DOI presence, abstract length, abstract keyword
  • Sorts by relevance, submitted-date, or last-updated-date
  • Walks paginated results until maxItems is reached
  • Respects arXiv's 1-request-per-3-seconds rate limit

Output per paper

  • arxivId — e.g. 2401.12345
  • title, abstract, abstractWordCount
  • authors[], authorCount, affiliations[]
  • categories[], primaryCategory — e.g. cs.LG
  • submittedAt, updatedAt — ISO-8601 UTC
  • doi — when published in a journal
  • journalRef — full citation
  • comment — author's note (e.g. "15 pages, 5 figures")
  • pdfUrl — direct PDF download link
  • htmlUrl — abstract page on arXiv.org
  • recordType: "paper", scrapedAt

Empty fields are omitted (no nulls).

Input

FieldTypeDefaultDescription
searchQuerystring"large language models"Free-text query against title + abstract + authors
categoriesarray[]arXiv subject codes (e.g. cs.LG, stat.ML). 50+ choices in the dropdown
authorContainsstringFilter by author name substring
sortByenumsubmittedDaterelevance / submittedDate / lastUpdatedDate
sortOrderenumdescendingdescending (newest first) / ascending
dateRangeFromstringDrop papers submitted before this ISO date
dateRangeTostringDrop papers submitted after this ISO date
maxItemsint50Hard cap on emitted papers (1–5000)
includeDoiOnlyboolfalseDrop papers without a DOI (typically pre-publication)
minAbstractLengthintDrop papers with abstracts shorter than N characters
abstractContainsstringOnly emit papers whose abstract contains this substring

Example: latest LLM papers

{
"searchQuery": "large language models",
"categories": ["cs.CL", "cs.LG"],
"sortBy": "submittedDate",
"maxItems": 100
}

Example: papers by a specific author

{
"authorContains": "Yann LeCun",
"sortBy": "submittedDate",
"maxItems": 50
}

Example: published papers (DOI required)

{
"searchQuery": "transformer",
"categories": ["cs.LG"],
"includeDoiOnly": true,
"minAbstractLength": 200,
"dateRangeFrom": "2024-01-01"
}

Example: niche query

{
"searchQuery": "diffusion model",
"categories": ["cs.CV"],
"abstractContains": "image generation",
"sortBy": "relevance",
"maxItems": 25
}

Use cases

  • AI/ML research tracking — daily run on cs.LG + cs.AI to surface new methods
  • Literature review automation — feed every paper matching your query into your RAG index
  • Author following — watch a specific researcher's new submissions
  • Trend analysis — count papers per topic over time to chart research interest
  • Citation database — pair with Crossref/DOI lookup for full bibliographic records
  • Academic content marketing — find papers citing techniques your tool implements

FAQ

Does it require a login or cookies? No. arXiv's API is fully public.

Is a proxy needed? No. arXiv accepts requests from any IP. The actor honors arXiv's 3-seconds-between-requests rate limit by default.

How fresh is the data? Real-time. arXiv typically posts new papers within hours of submission.

Can I get the full PDF? The actor returns pdfUrl — a direct link to the PDF. Download it with any HTTP client.

Why is doi missing on some papers? arXiv preprints don't always have a DOI assigned at the time of upload. Set includeDoiOnly=true to filter to peer-reviewed or journal-published versions only.

What's the difference between searchQuery and abstractContains? searchQuery is sent to arXiv's server-side search (ranks by relevance). abstractContains is a client-side substring filter applied AFTER fetching. Use searchQuery for relevance, abstractContains for narrow keyword filtering on top of that.

Why limit to 5000 items? arXiv's API allows up to 30k results per query but pagination beyond a few thousand becomes very slow due to the 3-second rate limit. For larger crawls, run multiple actor runs with different dateRangeFrom/dateRangeTo windows.

Can I scrape the PDF text content? Not directly — this actor returns metadata only. Pair it with a downstream PDF-extraction actor if you need full-text.

How are categories specified? Use arXiv's official codes (e.g. cs.LG for ML, stat.ML for stats ML, cs.CL for NLP, q-bio.QM for quantitative biology). The dropdown lists 50+ common codes; the full taxonomy is at arxiv.org/category_taxonomy.