arXiv Search & Paper Scraper
Pricing
from $1.00 / 1,000 results
arXiv Search & Paper Scraper
Search arXiv and get clean structured JSON for each paper: title, authors, abstract, categories, DOI, PDF link, and dates. Built for research, datasets, and AI pipelines.
Pricing
from $1.00 / 1,000 results
Rating
0.0
(0)
Developer
Nicolas van Arkens
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
arXiv Search & Paper Scraper π
Search arXiv and get clean, structured JSON for every paper β title, authors, abstract, categories, DOI, journal reference, PDF link, and dates. The arXiv API returns awkward Atom XML; this actor does the parsing for you and hands back tidy records ready for analysis, datasets, citation management, or feeding papers to an LLM.
Why use it
- π Flexible search β by keywords, author, arXiv category, or title
- π₯ Authors as a clean list β not a blob of XML
- π·οΈ Categories split out β primary category plus all cross-listed ones
- π Direct PDF + abstract links β and DOI / journal reference when available
- π Parsed dates β published and last-updated
- π§Ή Normalized text β abstracts cleaned of the API's messy whitespace
- βοΈ Sort by relevance, last updated, or submission date
Use cases
- Literature reviews & research β pull every recent paper in a field
- Building datasets β assemble structured corpora of papers and abstracts
- LLM / RAG pipelines β feed clean abstracts and metadata to models
- Trend monitoring β track new submissions in a category over time
- Citation & reference tooling β grab DOIs and journal refs at scale
Input
| Field | Description |
|---|---|
| Search query | Free-text keywords across all fields. |
| Author | Restrict to an author (phrase match). |
| Category | arXiv code, e.g. cs.LG, cs.CL, stat.ML. |
| Title contains | Restrict by title phrase. |
| Sort by / order | Relevance, last updated, or submitted; asc/desc. |
| Maximum papers | How many to return. |
Output
{"arxivId": "1706.03762v7","version": 7,"title": "Attention Is All You Need","summary": "The dominant sequence transduction models are based on...","authors": ["Ashish Vaswani", "Noam Shazeer", "Niki Parmar"],"authorCount": 3,"primaryCategory": "cs.CL","categories": ["cs.CL", "cs.LG"],"published": "2017-06-12T17:57:34Z","updated": "2023-08-02T00:41:18Z","doi": "10.5555/3295222.3295349","journalRef": "NeurIPS 2017","pdfUrl": "http://arxiv.org/pdf/1706.03762v7","absUrl": "http://arxiv.org/abs/1706.03762v7"}
Export to JSON, CSV, or Excel, or pull via the Apify API. Connect to Sheets, Notion, Slack, Zapier, or Make.
Notes
- Uses the official public arXiv API. Independent tool, not affiliated with arXiv or Cornell University.
- Please be considerate with large jobs; the actor paces requests to respect arXiv's API guidelines.
- arXiv category reference: see arxiv.org/category_taxonomy for the full list of codes.