Pricing

from $1.00 / 1,000 results

arXiv Papers Scraper

Scrape academic preprints from arXiv.org by keyword, author, or category. Returns clean records with title, authors, abstract, categories, PDF URL, DOI. HTTP-only via the public arXiv API. No login, no proxy.

Pricing

from $1.00 / 1,000 results

Rating

0.0

(0)

Developer

Crawler Bros

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

What this actor does

Queries the arXiv API (https://export.arxiv.org/api/query) by keyword, author, and/or category
Parses the Atom XML response into one structured JSON record per paper
Filters by date range, DOI presence, abstract length, abstract keyword
Sorts by relevance, submitted-date, or last-updated-date
Walks paginated results until maxItems is reached
Respects arXiv's 1-request-per-3-seconds rate limit

Output per paper

arxivId — e.g. 2401.12345
title, abstract, abstractWordCount
authors[], authorCount, affiliations[]
categories[], primaryCategory — e.g. cs.LG
submittedAt, updatedAt — ISO-8601 UTC
doi — when published in a journal
journalRef — full citation
comment — author's note (e.g. "15 pages, 5 figures")
pdfUrl — direct PDF download link
htmlUrl — abstract page on arXiv.org
recordType: "paper", scrapedAt

Empty fields are omitted (no nulls).

Input

Field	Type	Default	Description
`searchQuery`	string	`"large language models"`	Free-text query against title + abstract + authors
`categories`	array	`[]`	arXiv subject codes (e.g. `cs.LG`, `stat.ML`). 50+ choices in the dropdown
`authorContains`	string	–	Filter by author name substring
`sortBy`	enum	`submittedDate`	`relevance` / `submittedDate` / `lastUpdatedDate`
`sortOrder`	enum	`descending`	`descending` (newest first) / `ascending`
`dateRangeFrom`	string	–	Drop papers submitted before this ISO date
`dateRangeTo`	string	–	Drop papers submitted after this ISO date
`maxItems`	int	`50`	Hard cap on emitted papers (1–5000)
`includeDoiOnly`	bool	`false`	Drop papers without a DOI (typically pre-publication)
`minAbstractLength`	int	–	Drop papers with abstracts shorter than N characters
`abstractContains`	string	–	Only emit papers whose abstract contains this substring

Example: latest LLM papers

{
  "searchQuery": "large language models",
  "categories": ["cs.CL", "cs.LG"],
  "sortBy": "submittedDate",
  "maxItems": 100
}

Example: papers by a specific author

{
  "authorContains": "Yann LeCun",
  "sortBy": "submittedDate",
  "maxItems": 50
}

Example: published papers (DOI required)

{
  "searchQuery": "transformer",
  "categories": ["cs.LG"],
  "includeDoiOnly": true,
  "minAbstractLength": 200,
  "dateRangeFrom": "2024-01-01"
}

Example: niche query

{
  "searchQuery": "diffusion model",
  "categories": ["cs.CV"],
  "abstractContains": "image generation",
  "sortBy": "relevance",
  "maxItems": 25
}

Use cases

AI/ML research tracking — daily run on cs.LG + cs.AI to surface new methods
Literature review automation — feed every paper matching your query into your RAG index
Author following — watch a specific researcher's new submissions
Trend analysis — count papers per topic over time to chart research interest
Citation database — pair with Crossref/DOI lookup for full bibliographic records
Academic content marketing — find papers citing techniques your tool implements

FAQ

Does it require a login or cookies? No. arXiv's API is fully public.

Is a proxy needed? No. arXiv accepts requests from any IP. The actor honors arXiv's 3-seconds-between-requests rate limit by default.

How fresh is the data? Real-time. arXiv typically posts new papers within hours of submission.

Can I get the full PDF? The actor returns pdfUrl — a direct link to the PDF. Download it with any HTTP client.

Why is doi missing on some papers? arXiv preprints don't always have a DOI assigned at the time of upload. Set includeDoiOnly=true to filter to peer-reviewed or journal-published versions only.

What's the difference between searchQuery and abstractContains? searchQuery is sent to arXiv's server-side search (ranks by relevance). abstractContains is a client-side substring filter applied AFTER fetching. Use searchQuery for relevance, abstractContains for narrow keyword filtering on top of that.

Why limit to 5000 items? arXiv's API allows up to 30k results per query but pagination beyond a few thousand becomes very slow due to the 3-second rate limit. For larger crawls, run multiple actor runs with different dateRangeFrom/dateRangeTo windows.

Can I scrape the PDF text content? Not directly — this actor returns metadata only. Pair it with a downstream PDF-extraction actor if you need full-text.

How are categories specified? Use arXiv's official codes (e.g. cs.LG for ML, stat.ML for stats ML, cs.CL for NLP, q-bio.QM for quantitative biology). The dropdown lists 50+ common codes; the full taxonomy is at arxiv.org/category_taxonomy.

arXiv Research Paper Scraper

crawlerbros/arxiv-research-paper-scraper

Scrape research papers from arXiv.org - search by query, category, or author; lookup by arXiv ID. Returns title, authors, abstract, PDF URL, DOI, categories, and more. Uses the public arXiv Atom API. No login or proxy required.

Crawler Bros

arXiv Papers Scraper

troy_007/arxiv-papers-scraper

Search and export arXiv research papers by query, category, or author — title, abstract, authors, categories, dates, PDF link, and DOI. Uses the official arXiv API.

Pathik Shah

Arxiv Papers Scraper

chimerical_quicklime/arxiv-papers-scraper

Search arXiv preprints via the public Atom API. Returns title, authors, abstract, categories, published date, updated date, DOI, journal reference, and PDF link. Filter by category, author, or keyword.

Khrystyna Skotte

arXiv — Scientific Papers & Preprints Search

omao/arxiv

Search 2.4M+ arXiv preprints into clean JSON: title, authors, abstract, categories, DOI, journal reference, submission dates and PDF URL. Powered by the official arXiv API. No API key.

Marouane Oulabass

arXiv Research Papers & Abstracts Scraper

scrapers_lat/arxiv-papers-scraper

Scrape arXiv preprints by keyword, author or subject with arXiv ID, title, authors, abstract, subject categories, DOI, publication and update dates and PDF links. Export to JSON, CSV or Excel.

Scrapers Lat

ArXiv Papers Scraper

leftwinglautus/arxiv-papers-scraper

Search and scrape academic papers from the arXiv API by keyword, category, or author.

Moeeze Hassan

ArXiv Paper Search

gentle_cloud/arxiv-paper-search

Search and extract academic papers from ArXiv. Find papers by keyword, author, or category with full metadata including title, authors, abstract, categories, and PDF links.

Monkey Coder

arXiv Paper Scraper

cloud9_ai/arxiv-paper-scraper

Scrape academic papers from arXiv.org. Search by keyword, browse categories, or get latest papers. Extract titles, abstracts, authors, PDF links, and citation data via arXiv API.

cloud9

arXiv Paper Scraper

plantane/arxiv-scraper

Scrape research papers from arXiv by search query or category. Get titles, abstracts, authors, categories, and PDF links via the public arXiv API.

Daniel

arXiv Scraper: Papers, Authors, Categories & Search

perconey/arxiv-scraper

Scrape arxiv.org via the official Atom API. Full-text search, by author / title / category, paper detail by id, latest in any category. Returns title, abstract, authors, DOI, PDF link. No auth, no proxies. Pay only per result item.