arXiv Search & Paper Scraper avatar

arXiv Search & Paper Scraper

Pricing

from $1.00 / 1,000 results

Go to Apify Store
arXiv Search & Paper Scraper

arXiv Search & Paper Scraper

Search arXiv and get clean structured JSON for each paper: title, authors, abstract, categories, DOI, PDF link, and dates. Built for research, datasets, and AI pipelines.

Pricing

from $1.00 / 1,000 results

Rating

0.0

(0)

Developer

Nicolas van Arkens

Nicolas van Arkens

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

arXiv Search & Paper Scraper πŸ“š

Search arXiv and get clean, structured JSON for every paper β€” title, authors, abstract, categories, DOI, journal reference, PDF link, and dates. The arXiv API returns awkward Atom XML; this actor does the parsing for you and hands back tidy records ready for analysis, datasets, citation management, or feeding papers to an LLM.

Why use it

  • πŸ”Ž Flexible search β€” by keywords, author, arXiv category, or title
  • πŸ‘₯ Authors as a clean list β€” not a blob of XML
  • 🏷️ Categories split out β€” primary category plus all cross-listed ones
  • πŸ”— Direct PDF + abstract links β€” and DOI / journal reference when available
  • πŸ“… Parsed dates β€” published and last-updated
  • 🧹 Normalized text β€” abstracts cleaned of the API's messy whitespace
  • ↕️ Sort by relevance, last updated, or submission date

Use cases

  • Literature reviews & research β€” pull every recent paper in a field
  • Building datasets β€” assemble structured corpora of papers and abstracts
  • LLM / RAG pipelines β€” feed clean abstracts and metadata to models
  • Trend monitoring β€” track new submissions in a category over time
  • Citation & reference tooling β€” grab DOIs and journal refs at scale

Input

FieldDescription
Search queryFree-text keywords across all fields.
AuthorRestrict to an author (phrase match).
CategoryarXiv code, e.g. cs.LG, cs.CL, stat.ML.
Title containsRestrict by title phrase.
Sort by / orderRelevance, last updated, or submitted; asc/desc.
Maximum papersHow many to return.

Output

{
"arxivId": "1706.03762v7",
"version": 7,
"title": "Attention Is All You Need",
"summary": "The dominant sequence transduction models are based on...",
"authors": ["Ashish Vaswani", "Noam Shazeer", "Niki Parmar"],
"authorCount": 3,
"primaryCategory": "cs.CL",
"categories": ["cs.CL", "cs.LG"],
"published": "2017-06-12T17:57:34Z",
"updated": "2023-08-02T00:41:18Z",
"doi": "10.5555/3295222.3295349",
"journalRef": "NeurIPS 2017",
"pdfUrl": "http://arxiv.org/pdf/1706.03762v7",
"absUrl": "http://arxiv.org/abs/1706.03762v7"
}

Export to JSON, CSV, or Excel, or pull via the Apify API. Connect to Sheets, Notion, Slack, Zapier, or Make.

Notes

  • Uses the official public arXiv API. Independent tool, not affiliated with arXiv or Cornell University.
  • Please be considerate with large jobs; the actor paces requests to respect arXiv's API guidelines.
  • arXiv category reference: see arxiv.org/category_taxonomy for the full list of codes.