arXiv Scraper avatar

arXiv Scraper

Pricing

from $2.50 / 1,000 results

Go to Apify Store
arXiv Scraper

arXiv Scraper

[πŸ’° $2.5 / 1K] Search arXiv and extract paper metadata β€” titles, authors, abstracts, subject categories, DOIs, journal references, submission dates, and PDF links. Search by keyword, title, author, or category, or fetch specific papers by arXiv ID.

Pricing

from $2.50 / 1,000 results

Rating

0.0

(0)

Developer

SolidCode

SolidCode

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Search arXiv at scale and pull clean, structured paper metadata β€” titles, full author lists with affiliations, abstracts, subject categories, DOIs, journal references, submission and revision dates, and direct PDF and abstract-page links. Search by keyword, by title, author, or abstract individually, by subject category, or fetch exact papers by arXiv ID. Built for researchers, data scientists, and librarians who need a ready-to-use arXiv dataset without manual copy-paste or wrestling with raw repository feeds one page at a time.

Why This Scraper?

  • ~40 subject categories across 8 disciplines β€” pick from a labeled list spanning computer science, statistics, mathematics, physics, quantitative biology, quantitative finance, economics, and electrical engineering. Select cs.LG, stat.ML, math.PR, quant-ph and more with a checkbox β€” no codes to memorize.
  • Field-specific search, not just keywords β€” match words in the title, the author name, or the abstract as separate inputs, then combine them. Find "transformer" in the title by Vaswani in the cs.CL category in one run.
  • Direct arXiv-ID lookup, including legacy IDs β€” paste a list of IDs to fetch exact papers. Handles both modern (2310.06825) and legacy slash-style (cond-mat/0011267) identifiers, so decades-old preprints come back just as cleanly as last week's.
  • Full author affiliations β€” every author arrives as a structured record with name and institutional affiliation when the paper lists one, ready for co-author and institution analysis.
  • DOI and journal reference for published-version cross-linking β€” when authors register a DOI or cite the published venue, both fields land in the row, letting you join preprints to their peer-reviewed counterparts.
  • Direct PDF and abstract-page links on every paper β€” a pdfUrl for the full text and an absUrl for the human-readable landing page, so downstream tools can fetch or link without rebuilding URLs.
  • Sort by relevance, submission date, or last-updated date β€” newest-first or oldest-first, so you can surface the freshest preprints or build a chronological corpus.
  • Up to 50,000 papers per run β€” set the result cap to zero to sweep an entire topic, with a built-in safety ceiling so a broad query never runs away.

Use Cases

Academic Literature & Systematic Review

  • Assemble a complete reading list for a topic, sorted by relevance or recency
  • Narrow a survey to a single subject category to cut cross-field noise
  • Pull every preprint by a specific author for a focused author study
  • Track the latest submissions in a field by sorting on submission date

Research Trend & Citation Analysis

  • Measure publication volume in an emerging sub-field over time
  • Map which institutions are most active via author affiliations
  • Detect bursts of activity by sweeping recent submissions in a category
  • Build a chronological corpus to chart how terminology shifts year over year

Competitive R&D Intelligence

  • Monitor what a competing lab or research group is publishing on a topic
  • Benchmark output across institutions using affiliation data
  • Spot new directions before they reach peer-reviewed journals
  • Watch a category daily for the newest preprints in your space

ML & AI Dataset Building

  • Harvest abstracts at scale to train or fine-tune domain models
  • Build a labeled corpus by subject category for classification tasks
  • Collect title-abstract pairs for summarization and retrieval datasets
  • Gather a topic-specific text set for embeddings and semantic search

Bibliographic Database Enrichment

  • Cross-reference preprints to published versions via DOI and journal reference
  • Fill in missing abstracts, categories, and dates in an existing catalog
  • Resolve legacy slash-style IDs to current metadata
  • Enrich a reference manager export with affiliations and revision dates

Grant & Patent Prior-Art Search

  • Surface the earliest preprints describing a technique for prior-art review
  • Document the state of the art in a field for a grant proposal
  • Trace an idea back to its first submission date on arXiv
  • Compile a dated evidence trail across multiple subject categories

Getting Started

The simplest possible run β€” one topic, 50 papers:

{
"searchQuery": "large language models",
"maxResults": 50
}

Field-Specific Search by Category

Find recent computer-vision papers whose title mentions diffusion, newest first:

{
"title": "diffusion",
"categories": ["cs.CV", "cs.LG"],
"sortBy": "submittedDate",
"sortOrder": "descending",
"maxResults": 200
}

Fetch Specific Papers by ID

Pull exact papers β€” modern and legacy IDs together β€” ignoring all search fields:

{
"arxivIds": ["2310.06825", "1706.03762", "cond-mat/0011267"]
}

Author and Abstract Search Combined

Every author preprint mentioning reinforcement learning in the abstract:

{
"author": "Yann LeCun",
"abstract": "reinforcement learning",
"categories": ["cs.AI", "cs.LG", "stat.ML"],
"sortBy": "lastUpdatedDate",
"maxResults": 500
}

Input Reference

Combine any of these fields, or paste arXiv IDs to fetch exact papers.

ParameterTypeDefaultDescription
searchQuerystring"large language models"Free-text search across the whole paper (title, abstract, authors). Advanced users can use field prefixes like ti:, au:, abs:, cat: and boolean operators.
titlestringnullOnly include papers whose title contains these words.
authorstringnullOnly include papers by this author (e.g. "Yann LeCun" or "Hinton").
abstractstringnullOnly include papers whose abstract contains these words.
categoriesarray[]Restrict results to selected arXiv subject areas. Choose from ~40 labeled categories across 8 disciplines; leave empty to search all subjects.
arxivIdsarray[]Fetch specific papers by arXiv ID (e.g. 2310.06825 or legacy cond-mat/0011267). When set, the search fields above are ignored.

Results

ParameterTypeDefaultDescription
maxResultsinteger50Maximum papers to return. Set to 0 to fetch all matches, with a safety cap of 50,000 so very broad searches don't run indefinitely. Ignored when fetching by ID.
sortByselectRelevanceOrder results by Relevance, Submission date, or Last updated date.
sortOrderselectNewest first (descending)Newest first (descending) or Oldest first (ascending). Most useful when sorting by date.

Output

Each paper is one flat row in the dataset. Here is a representative result:

{
"arxivId": "1706.03762",
"version": 7,
"title": "Attention Is All You Need",
"authors": [
{ "name": "Ashish Vaswani", "affiliation": "Google Brain" },
{ "name": "Noam Shazeer", "affiliation": "Google Brain" },
{ "name": "Niki Parmar", "affiliation": "Google Research" }
],
"abstract": "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder...",
"primaryCategory": "cs.CL",
"categories": ["cs.CL", "cs.LG"],
"publishedDate": "2017-06-12T17:57:34Z",
"updatedDate": "2023-08-02T00:41:18Z",
"doi": "10.48550/arXiv.1706.03762",
"journalRef": "Advances in Neural Information Processing Systems 30 (2017)",
"comments": "15 pages, 5 figures",
"pdfUrl": "https://arxiv.org/pdf/1706.03762v7",
"absUrl": "https://arxiv.org/abs/1706.03762v7"
}

Core Fields

FieldTypeDescription
titlestringPaper title, whitespace-normalized
authorsobject[]One record per author: { name, affiliation } (affiliation included when the paper lists it)
abstractstringFull abstract text
primaryCategorystringPrimary arXiv subject code (e.g. cs.CL)
categoriesstring[]All subject codes on the paper
commentsstring|nullAuthor comments (e.g. "15 pages, 5 figures")

Identifiers & Cross-References

FieldTypeDescription
arxivIdstringarXiv identifier without version (e.g. 1706.03762)
versionintegerVersion number (v7 β†’ 7)
doistring|nullDOI when the authors registered one
journalRefstring|nullJournal reference / citation when the paper is published
FieldTypeDescription
publishedDatestringFirst-submitted timestamp (ISO 8601)
updatedDatestringLast-updated timestamp (ISO 8601)
pdfUrlstringDirect link to the full-text PDF
absUrlstringLink to the arXiv abstract landing page

Tips for Best Results

  • Use field prefixes for precision. In searchQuery you can write ti:transformer to match only titles or cat:cs.CL to scope a subject β€” power users can build advanced boolean queries like ti:transformer AND abs:translation in a single field.
  • Narrow by category to cut noise. A broad term like "networks" spans biology, physics, and computer science. Selecting one or two subject categories sharpens results dramatically and lowers your result count.
  • Sort by submission date for the freshest preprints. Set sortBy to Submission date with Newest first to surface the very latest work in a field β€” ideal for daily monitoring and trend tracking.
  • Fetch by ID when you know exactly what you want. Pasting arXiv IDs is the fastest, most precise path β€” it skips search entirely and returns those exact papers, legacy slash-style IDs included.
  • Start small, then scale. Run with maxResults of 25–50 to confirm the data matches your needs, then raise the cap or set it to 0 to sweep a whole topic.
  • Keep DOI and journal reference for cross-linking. When present, these fields let you match a preprint to its peer-reviewed version β€” invaluable for bibliographic enrichment and citation work.
  • Combine title, author, and abstract for laser-focused queries. The three field inputs are AND-joined, so a name in author plus a phrase in abstract returns only papers that satisfy both.

Pricing

From $2.50 per 1,000 results β€” a flat per-result rate that undercuts comparable arXiv extractors, with no hidden surcharges. Bronze, Silver, and Gold subscribers pay progressively less; the table below shows total cost at each discount tier.

ResultsNo discountBronzeSilverGold
100$0.30$0.28$0.265$0.25
1,000$3.00$2.80$2.65$2.50
10,000$30.00$28.00$26.50$25.00
100,000$300.00$280.00$265.00$250.00

A "result" is any paper row in the output dataset. No compute or time-based charges β€” you pay per result, plus a small fixed per-run start fee.

Integrations

Export data in JSON, CSV, Excel, XML, or RSS. Connect to 1,500+ apps via:

  • Zapier / Make / n8n β€” Workflow automation
  • Google Sheets β€” Direct spreadsheet export
  • Slack / Email β€” Notifications on new results
  • Webhooks β€” Trigger custom workflows on run completion
  • Apify API β€” Full programmatic access

arXiv content is openly accessible, and this actor is designed for legitimate academic research, literature review, bibliometrics, and dataset building. Each paper on arXiv is distributed under its own license chosen by the authors β€” respect those individual licenses when reusing abstracts or full text. Users are responsible for complying with applicable laws and arXiv's terms of use, including making reasonable-rate requests. Do not use extracted data for spam, harassment, or any illegal purpose.