Arxiv Paper Scraper avatar

Arxiv Paper Scraper

Pricing

$1.00 / 1,000 papers

Go to Apify Store
Arxiv Paper Scraper

Arxiv Paper Scraper

Pricing

$1.00 / 1,000 papers

Rating

0.0

(0)

Developer

Technical Dost Solutions

Technical Dost Solutions

Maintained by Community

Actor stats

0

Bookmarked

11

Total users

3

Monthly active users

2 hours ago

Last modified

Categories

Share

ArXiv Paper Scraper — Search & Export Research Papers (Titles, Abstracts, Authors, PDF Links)

ArXiv Paper Scraper is a fast, reliable way to search arXiv.org and export research papers as clean, structured JSON. Give it a search query or arXiv categories (like cs.AI, cs.LG, stat.ML) and the ArXiv Paper Scraper returns titles, abstracts, authors, publication dates, subject categories, DOIs, and direct PDF links — ready for spreadsheets, dashboards, literature reviews, or AI/ML trend monitoring.

Built for researchers, data scientists, ML engineers, librarians, and anyone who needs to scrape arXiv papers in bulk without writing API-parsing code. It uses the official arXiv API under the hood, so results are accurate and respect arXiv's rate limits.

Features

  • Search by keyword — full-text query across titles, abstracts, and authors (e.g. large language models, diffusion models, protein folding).
  • Filter by arXiv category — narrow results to one or more categories such as cs.AI, cs.LG, stat.ML, physics.gen-ph, math.OC.
  • Bulk export — pull up to 1,000 papers per run with automatic batching and pagination.
  • Sort control — order by submission date, last-updated date, or relevance, ascending or descending.
  • Rich structured output — every paper includes title, abstract/summary, authors, primary + all categories, published/updated timestamps, DOI, journal reference, and PDF/HTML/abstract links.
  • Clean JSON / CSV / Excel — export from the dataset in any format, or pull via the Apify API into Zapier, Make, Google Sheets, and more.

Input

FieldTypeDescription
searchQuerystringSearch terms for finding papers (e.g. machine learning).
categoriesarrayarXiv categories to filter (e.g. cs.AI, cs.LG, stat.ML). Optional.
maxResultsintegerMaximum number of papers to extract (1–1000, default 100).
sortBystringsubmittedDate, lastUpdatedDate, or relevance.
sortOrderstringdescending or ascending.

Example input:

{
"searchQuery": "large language models",
"categories": ["cs.CL", "cs.AI"],
"maxResults": 200,
"sortBy": "submittedDate",
"sortOrder": "descending"
}

Output

Each paper is stored as one dataset item:

{
"id": "2406.01234v1",
"title": "A Survey of Large Language Models for Scientific Discovery",
"summary": "We review recent advances in applying large language models to...",
"authors": ["Jane Doe", "John Smith"],
"published": "2024-06-03T17:59:00Z",
"updated": "2024-06-05T12:10:00Z",
"categories": ["cs.CL", "cs.AI"],
"primaryCategory": "cs.CL",
"links": {
"abstract": "http://arxiv.org/abs/2406.01234v1",
"pdf": "http://arxiv.org/pdf/2406.01234v1",
"html": "http://arxiv.org/abs/2406.01234v1"
},
"doi": null,
"comment": "12 pages, 4 figures",
"journalRef": null,
"scrapedAt": "2024-06-06T09:00:00Z"
}

How to scrape arXiv papers

  1. Enter a Search Query (e.g. reinforcement learning) and/or one or more Categories (e.g. cs.LG).
  2. Set Maximum Results and choose how to Sort them.
  3. Click Start — the ArXiv Paper Scraper queries the official arXiv API, paginates through results, and writes each paper to the dataset.
  4. Download the results as JSON, CSV, or Excel, or fetch them through the Apify API and pipe them into your tools.

FAQ

Which arXiv categories can I use? Any official arXiv category code, such as cs.AI, cs.LG, cs.CL, cs.CV, stat.ML, math.OC, physics.gen-ph, q-bio, or econ.EM. You can pass several at once.

How many papers can I get per run? Up to 1,000 per run. The scraper batches requests automatically and respects arXiv's rate limits.

Can I get the full PDF text? The scraper returns the direct PDF link for each paper (links.pdf). You can feed those links into a PDF-text extractor if you need the full body.

Is this allowed? Yes — it uses the public, official arXiv API and honors its usage guidelines, including rate limiting between batches.

Can I automate it? Yes. Schedule runs on Apify or trigger them via the API, and connect the output to Zapier, Make, Google Sheets, Slack, or a database.

Pricing

This Actor uses pay-per-event pricing — you only pay for what you run, with no monthly subscription. Most search-and-export jobs cost a fraction of a cent in compute. See the Pricing tab on this Actor's page for current per-event rates.