arXiv Scraper
Pricing
from $2.50 / 1,000 results
arXiv Scraper
[π° $2.5 / 1K] Search arXiv and extract paper metadata β titles, authors, abstracts, subject categories, DOIs, journal references, submission dates, and PDF links. Search by keyword, title, author, or category, or fetch specific papers by arXiv ID.
Pricing
from $2.50 / 1,000 results
Rating
0.0
(0)
Developer
SolidCode
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Search arXiv at scale and pull clean, structured paper metadata β titles, full author lists with affiliations, abstracts, subject categories, DOIs, journal references, submission and revision dates, and direct PDF and abstract-page links. Search by keyword, by title, author, or abstract individually, by subject category, or fetch exact papers by arXiv ID. Built for researchers, data scientists, and librarians who need a ready-to-use arXiv dataset without manual copy-paste or wrestling with raw repository feeds one page at a time.
Why This Scraper?
- ~40 subject categories across 8 disciplines β pick from a labeled list spanning computer science, statistics, mathematics, physics, quantitative biology, quantitative finance, economics, and electrical engineering. Select cs.LG, stat.ML, math.PR, quant-ph and more with a checkbox β no codes to memorize.
- Field-specific search, not just keywords β match words in the title, the author name, or the abstract as separate inputs, then combine them. Find "transformer" in the title by Vaswani in the cs.CL category in one run.
- Direct arXiv-ID lookup, including legacy IDs β paste a list of IDs to fetch exact papers. Handles both modern (
2310.06825) and legacy slash-style (cond-mat/0011267) identifiers, so decades-old preprints come back just as cleanly as last week's. - Full author affiliations β every author arrives as a structured record with name and institutional affiliation when the paper lists one, ready for co-author and institution analysis.
- DOI and journal reference for published-version cross-linking β when authors register a DOI or cite the published venue, both fields land in the row, letting you join preprints to their peer-reviewed counterparts.
- Direct PDF and abstract-page links on every paper β a
pdfUrlfor the full text and anabsUrlfor the human-readable landing page, so downstream tools can fetch or link without rebuilding URLs. - Sort by relevance, submission date, or last-updated date β newest-first or oldest-first, so you can surface the freshest preprints or build a chronological corpus.
- Up to 50,000 papers per run β set the result cap to zero to sweep an entire topic, with a built-in safety ceiling so a broad query never runs away.
Use Cases
Academic Literature & Systematic Review
- Assemble a complete reading list for a topic, sorted by relevance or recency
- Narrow a survey to a single subject category to cut cross-field noise
- Pull every preprint by a specific author for a focused author study
- Track the latest submissions in a field by sorting on submission date
Research Trend & Citation Analysis
- Measure publication volume in an emerging sub-field over time
- Map which institutions are most active via author affiliations
- Detect bursts of activity by sweeping recent submissions in a category
- Build a chronological corpus to chart how terminology shifts year over year
Competitive R&D Intelligence
- Monitor what a competing lab or research group is publishing on a topic
- Benchmark output across institutions using affiliation data
- Spot new directions before they reach peer-reviewed journals
- Watch a category daily for the newest preprints in your space
ML & AI Dataset Building
- Harvest abstracts at scale to train or fine-tune domain models
- Build a labeled corpus by subject category for classification tasks
- Collect title-abstract pairs for summarization and retrieval datasets
- Gather a topic-specific text set for embeddings and semantic search
Bibliographic Database Enrichment
- Cross-reference preprints to published versions via DOI and journal reference
- Fill in missing abstracts, categories, and dates in an existing catalog
- Resolve legacy slash-style IDs to current metadata
- Enrich a reference manager export with affiliations and revision dates
Grant & Patent Prior-Art Search
- Surface the earliest preprints describing a technique for prior-art review
- Document the state of the art in a field for a grant proposal
- Trace an idea back to its first submission date on arXiv
- Compile a dated evidence trail across multiple subject categories
Getting Started
Basic Keyword Search
The simplest possible run β one topic, 50 papers:
{"searchQuery": "large language models","maxResults": 50}
Field-Specific Search by Category
Find recent computer-vision papers whose title mentions diffusion, newest first:
{"title": "diffusion","categories": ["cs.CV", "cs.LG"],"sortBy": "submittedDate","sortOrder": "descending","maxResults": 200}
Fetch Specific Papers by ID
Pull exact papers β modern and legacy IDs together β ignoring all search fields:
{"arxivIds": ["2310.06825", "1706.03762", "cond-mat/0011267"]}
Author and Abstract Search Combined
Every author preprint mentioning reinforcement learning in the abstract:
{"author": "Yann LeCun","abstract": "reinforcement learning","categories": ["cs.AI", "cs.LG", "stat.ML"],"sortBy": "lastUpdatedDate","maxResults": 500}
Input Reference
Search
Combine any of these fields, or paste arXiv IDs to fetch exact papers.
| Parameter | Type | Default | Description |
|---|---|---|---|
searchQuery | string | "large language models" | Free-text search across the whole paper (title, abstract, authors). Advanced users can use field prefixes like ti:, au:, abs:, cat: and boolean operators. |
title | string | null | Only include papers whose title contains these words. |
author | string | null | Only include papers by this author (e.g. "Yann LeCun" or "Hinton"). |
abstract | string | null | Only include papers whose abstract contains these words. |
categories | array | [] | Restrict results to selected arXiv subject areas. Choose from ~40 labeled categories across 8 disciplines; leave empty to search all subjects. |
arxivIds | array | [] | Fetch specific papers by arXiv ID (e.g. 2310.06825 or legacy cond-mat/0011267). When set, the search fields above are ignored. |
Results
| Parameter | Type | Default | Description |
|---|---|---|---|
maxResults | integer | 50 | Maximum papers to return. Set to 0 to fetch all matches, with a safety cap of 50,000 so very broad searches don't run indefinitely. Ignored when fetching by ID. |
sortBy | select | Relevance | Order results by Relevance, Submission date, or Last updated date. |
sortOrder | select | Newest first (descending) | Newest first (descending) or Oldest first (ascending). Most useful when sorting by date. |
Output
Each paper is one flat row in the dataset. Here is a representative result:
{"arxivId": "1706.03762","version": 7,"title": "Attention Is All You Need","authors": [{ "name": "Ashish Vaswani", "affiliation": "Google Brain" },{ "name": "Noam Shazeer", "affiliation": "Google Brain" },{ "name": "Niki Parmar", "affiliation": "Google Research" }],"abstract": "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder...","primaryCategory": "cs.CL","categories": ["cs.CL", "cs.LG"],"publishedDate": "2017-06-12T17:57:34Z","updatedDate": "2023-08-02T00:41:18Z","doi": "10.48550/arXiv.1706.03762","journalRef": "Advances in Neural Information Processing Systems 30 (2017)","comments": "15 pages, 5 figures","pdfUrl": "https://arxiv.org/pdf/1706.03762v7","absUrl": "https://arxiv.org/abs/1706.03762v7"}
Core Fields
| Field | Type | Description |
|---|---|---|
title | string | Paper title, whitespace-normalized |
authors | object[] | One record per author: { name, affiliation } (affiliation included when the paper lists it) |
abstract | string | Full abstract text |
primaryCategory | string | Primary arXiv subject code (e.g. cs.CL) |
categories | string[] | All subject codes on the paper |
comments | string|null | Author comments (e.g. "15 pages, 5 figures") |
Identifiers & Cross-References
| Field | Type | Description |
|---|---|---|
arxivId | string | arXiv identifier without version (e.g. 1706.03762) |
version | integer | Version number (v7 β 7) |
doi | string|null | DOI when the authors registered one |
journalRef | string|null | Journal reference / citation when the paper is published |
Dates & Links
| Field | Type | Description |
|---|---|---|
publishedDate | string | First-submitted timestamp (ISO 8601) |
updatedDate | string | Last-updated timestamp (ISO 8601) |
pdfUrl | string | Direct link to the full-text PDF |
absUrl | string | Link to the arXiv abstract landing page |
Tips for Best Results
- Use field prefixes for precision. In
searchQueryyou can writeti:transformerto match only titles orcat:cs.CLto scope a subject β power users can build advanced boolean queries liketi:transformer AND abs:translationin a single field. - Narrow by category to cut noise. A broad term like "networks" spans biology, physics, and computer science. Selecting one or two subject categories sharpens results dramatically and lowers your result count.
- Sort by submission date for the freshest preprints. Set
sortByto Submission date with Newest first to surface the very latest work in a field β ideal for daily monitoring and trend tracking. - Fetch by ID when you know exactly what you want. Pasting arXiv IDs is the fastest, most precise path β it skips search entirely and returns those exact papers, legacy slash-style IDs included.
- Start small, then scale. Run with
maxResultsof 25β50 to confirm the data matches your needs, then raise the cap or set it to0to sweep a whole topic. - Keep DOI and journal reference for cross-linking. When present, these fields let you match a preprint to its peer-reviewed version β invaluable for bibliographic enrichment and citation work.
- Combine title, author, and abstract for laser-focused queries. The three field inputs are AND-joined, so a name in
authorplus a phrase inabstractreturns only papers that satisfy both.
Pricing
From $2.50 per 1,000 results β a flat per-result rate that undercuts comparable arXiv extractors, with no hidden surcharges. Bronze, Silver, and Gold subscribers pay progressively less; the table below shows total cost at each discount tier.
| Results | No discount | Bronze | Silver | Gold |
|---|---|---|---|---|
| 100 | $0.30 | $0.28 | $0.265 | $0.25 |
| 1,000 | $3.00 | $2.80 | $2.65 | $2.50 |
| 10,000 | $30.00 | $28.00 | $26.50 | $25.00 |
| 100,000 | $300.00 | $280.00 | $265.00 | $250.00 |
A "result" is any paper row in the output dataset. No compute or time-based charges β you pay per result, plus a small fixed per-run start fee.
Integrations
Export data in JSON, CSV, Excel, XML, or RSS. Connect to 1,500+ apps via:
- Zapier / Make / n8n β Workflow automation
- Google Sheets β Direct spreadsheet export
- Slack / Email β Notifications on new results
- Webhooks β Trigger custom workflows on run completion
- Apify API β Full programmatic access
Legal & Ethical Use
arXiv content is openly accessible, and this actor is designed for legitimate academic research, literature review, bibliometrics, and dataset building. Each paper on arXiv is distributed under its own license chosen by the authors β respect those individual licenses when reusing abstracts or full text. Users are responsible for complying with applicable laws and arXiv's terms of use, including making reasonable-rate requests. Do not use extracted data for spam, harassment, or any illegal purpose.