Arxiv Paper Scraper
Pricing
$1.00 / 1,000 papers
Arxiv Paper Scraper
Pricing
$1.00 / 1,000 papers
Rating
0.0
(0)
Developer
Technical Dost Solutions
Maintained by CommunityActor stats
0
Bookmarked
11
Total users
3
Monthly active users
2 hours ago
Last modified
Categories
Share
ArXiv Paper Scraper — Search & Export Research Papers (Titles, Abstracts, Authors, PDF Links)
ArXiv Paper Scraper is a fast, reliable way to search arXiv.org and export research papers as clean, structured JSON. Give it a search query or arXiv categories (like cs.AI, cs.LG, stat.ML) and the ArXiv Paper Scraper returns titles, abstracts, authors, publication dates, subject categories, DOIs, and direct PDF links — ready for spreadsheets, dashboards, literature reviews, or AI/ML trend monitoring.
Built for researchers, data scientists, ML engineers, librarians, and anyone who needs to scrape arXiv papers in bulk without writing API-parsing code. It uses the official arXiv API under the hood, so results are accurate and respect arXiv's rate limits.
Features
- Search by keyword — full-text query across titles, abstracts, and authors (e.g.
large language models,diffusion models,protein folding). - Filter by arXiv category — narrow results to one or more categories such as
cs.AI,cs.LG,stat.ML,physics.gen-ph,math.OC. - Bulk export — pull up to 1,000 papers per run with automatic batching and pagination.
- Sort control — order by submission date, last-updated date, or relevance, ascending or descending.
- Rich structured output — every paper includes title, abstract/summary, authors, primary + all categories, published/updated timestamps, DOI, journal reference, and PDF/HTML/abstract links.
- Clean JSON / CSV / Excel — export from the dataset in any format, or pull via the Apify API into Zapier, Make, Google Sheets, and more.
Input
| Field | Type | Description |
|---|---|---|
searchQuery | string | Search terms for finding papers (e.g. machine learning). |
categories | array | arXiv categories to filter (e.g. cs.AI, cs.LG, stat.ML). Optional. |
maxResults | integer | Maximum number of papers to extract (1–1000, default 100). |
sortBy | string | submittedDate, lastUpdatedDate, or relevance. |
sortOrder | string | descending or ascending. |
Example input:
{"searchQuery": "large language models","categories": ["cs.CL", "cs.AI"],"maxResults": 200,"sortBy": "submittedDate","sortOrder": "descending"}
Output
Each paper is stored as one dataset item:
{"id": "2406.01234v1","title": "A Survey of Large Language Models for Scientific Discovery","summary": "We review recent advances in applying large language models to...","authors": ["Jane Doe", "John Smith"],"published": "2024-06-03T17:59:00Z","updated": "2024-06-05T12:10:00Z","categories": ["cs.CL", "cs.AI"],"primaryCategory": "cs.CL","links": {"abstract": "http://arxiv.org/abs/2406.01234v1","pdf": "http://arxiv.org/pdf/2406.01234v1","html": "http://arxiv.org/abs/2406.01234v1"},"doi": null,"comment": "12 pages, 4 figures","journalRef": null,"scrapedAt": "2024-06-06T09:00:00Z"}
How to scrape arXiv papers
- Enter a Search Query (e.g.
reinforcement learning) and/or one or more Categories (e.g.cs.LG). - Set Maximum Results and choose how to Sort them.
- Click Start — the ArXiv Paper Scraper queries the official arXiv API, paginates through results, and writes each paper to the dataset.
- Download the results as JSON, CSV, or Excel, or fetch them through the Apify API and pipe them into your tools.
FAQ
Which arXiv categories can I use?
Any official arXiv category code, such as cs.AI, cs.LG, cs.CL, cs.CV, stat.ML, math.OC, physics.gen-ph, q-bio, or econ.EM. You can pass several at once.
How many papers can I get per run? Up to 1,000 per run. The scraper batches requests automatically and respects arXiv's rate limits.
Can I get the full PDF text?
The scraper returns the direct PDF link for each paper (links.pdf). You can feed those links into a PDF-text extractor if you need the full body.
Is this allowed? Yes — it uses the public, official arXiv API and honors its usage guidelines, including rate limiting between batches.
Can I automate it? Yes. Schedule runs on Apify or trigger them via the API, and connect the output to Zapier, Make, Google Sheets, Slack, or a database.
Pricing
This Actor uses pay-per-event pricing — you only pay for what you run, with no monthly subscription. Most search-and-export jobs cost a fraction of a cent in compute. See the Pricing tab on this Actor's page for current per-event rates.