arXiv Preprint Scraper
Pricing
Pay per event
arXiv Preprint Scraper
Export preprints from arXiv.org. Search 2.5M+ open-access papers across physics, mathematics, computer science, biology, economics, and quantitative finance. Query by keyword, author, category, or date range. Pull titles, authors, abstracts, categories, DOIs, journal refs, and PDF links.
Pricing
Pay per event
Rating
5.0
(1)
Developer
ParseForge
Maintained by CommunityActor stats
0
Bookmarked
15
Total users
2
Monthly active users
3 days ago
Last modified
Categories
Share

📐 arXiv Preprint Scraper
🚀 Export open-access research papers in seconds. Query 2.5M+ arXiv preprints across physics, math, computer science, biology, finance, statistics, and economics by keyword, author, category, or date range. No API key, no registration, no XML parsing.
🕒 Last updated: 2026-05-27 · 📊 13 fields per record · 📄 2.5M+ preprints · 🧠 8 macro disciplines · 🔖 170+ subject categories
The arXiv Preprint Scraper searches the open-access arXiv archive and returns 13 fields per record, including arXiv ID, title, authors, abstract, primary category, every secondary category, DOI, journal reference, comment, publication dates, and direct links to both the PDF and the abstract page. arXiv has hosted preprints since 1991 and has become the de facto place where physicists, computer scientists, and mathematicians first publish new work.
The catalog covers eight macro disciplines and more than 170 subject categories, from cs.LG (machine learning) and math.AG (algebraic geometry) to q-bio.NC (neurons and cognition) and econ.EM (econometrics). This Actor accepts the full arXiv query syntax, so you can filter by title, abstract, author, category, or any boolean combination, and download the dataset as CSV, Excel, JSON, or XML.
| 🎯 Target Audience | 💡 Primary Use Cases |
|---|---|
| ML engineers, academic researchers, literature-review teams, science journalists, R&D groups, librarians, citation tools, AI agents | Paper discovery, trend tracking, author monitoring, citation graphs, RAG/training data, alert pipelines, systematic reviews |
📋 What the arXiv Scraper does
Four research workflows in a single run:
- 🔍 Keyword search. Use
all:transformerorti:attention abs:retrievalto scope to titles or abstracts. - 👤 Author monitoring. Use
au:hintonorau:lecun AND cat:cs.LGto track an author's output. - 🧠 Category feeds. Use
cat:cs.LG,cat:hep-ph, orcat:q-bio.NCfor category-specific firehoses. - 📅 Recency sort. Order by
submittedDateorlastUpdatedDate, descending or ascending, to surface the latest work first.
Each record includes the arXiv identifier, full title, every co-author, the abstract, primary and secondary categories, the DOI if assigned, journal reference, author comments, both published and updated timestamps, and ready-to-open PDF and abstract URLs.
💡 Why it matters: scientific output doubles roughly every nine years. Tracking the literature by hand is impossible. Calling the public arXiv interface yourself means writing an XML parser, respecting rate limits, and managing pagination. This Actor turns that into a one-click data pull that returns clean JSON.
🎬 Full Demo
🚧 Coming soon: a 3-minute walkthrough showing how to go from sign-up to a downloaded dataset of papers.
⚙️ Input
| Input | Type | Default | Behavior |
|---|---|---|---|
| maxItems | integer | 10 | Records to return. Free plan caps at 10, paid plan at 1,000,000. |
| searchQuery | string | "all:transformer" | arXiv query syntax. Prefixes: `all`, `ti`, `abs`, `au`, `cat`, `id`. Boolean operators `AND`, `OR`, `ANDNOT` are supported. |
| sortBy | string | "relevance" | One of `relevance`, `lastUpdatedDate`, or `submittedDate`. |
| sortOrder | string | "descending" | `descending` (newest first) or `ascending` (oldest first). |
Example: 50 most recent machine-learning preprints.
{"maxItems": 50,"searchQuery": "cat:cs.LG","sortBy": "submittedDate","sortOrder": "descending"}
Example: every paper by Yann LeCun on neural networks, newest first.
{"maxItems": 100,"searchQuery": "au:lecun AND abs:neural","sortBy": "submittedDate","sortOrder": "descending"}
⚠️ Good to Know: arXiv is a preprint server. Most papers are pre-publication and may not yet be peer-reviewed. The
journalReffield is populated once an author updates the metadata after journal acceptance, and thedoifield follows the same rule. For systematic reviews, combine this Actor with a peer-review check downstream.
📊 Output
Each preprint record contains 13 fields. Download the dataset as CSV, Excel, JSON, or XML.
🧾 Schema
| Field | Type | Example |
|---|---|---|
🆔 arxivId | string | "1706.03762v7" |
📄 title | string | "Attention Is All You Need" |
👥 authors | string[] | ["Ashish Vaswani", "Noam Shazeer", "..."] |
📝 summary | string | "The dominant sequence transduction models..." |
🧠 primaryCategory | string | "cs.CL" |
🏷️ categories | string[] | ["cs.CL", "cs.LG"] |
🔗 doi | string | null | "10.48550/arXiv.1706.03762" |
📚 journalRef | string | null | "NeurIPS 2017" |
💬 comment | string | null | "15 pages, 5 figures" |
📅 publishedDate | ISO 8601 | "2017-06-12T17:57:34Z" |
🔁 updatedDate | ISO 8601 | "2023-08-02T00:41:18Z" |
📥 pdfUrl | string | "https://arxiv.org/pdf/1706.03762" |
🔖 abstractUrl | string | "http://arxiv.org/abs/1706.03762v7" |
🕒 scrapedAt | ISO 8601 | "2026-05-27T00:00:00.000Z" |
📦 Sample records
✨ Why choose this Actor
| Capability | |
|---|---|
| 📚 | 2.5M+ preprints. Every paper hosted on arXiv across physics, math, CS, statistics, quantitative biology, quantitative finance, economics, and electrical engineering. |
| 🎯 | Full arXiv query syntax. Title, abstract, author, category, ID, and boolean operators all work. |
| 📅 | Recency sort. Sort by submission date or last update for date-bounded discovery. |
| ⚡ | Fast. 100 records per page, fully paginated. 1,000 papers in under two minutes. |
| 🧰 | Ready for downstream pipelines. Clean JSON with arXiv IDs, DOIs, and direct PDF links for RAG, training, or citation graphs. |
| 🔁 | Always fresh. arXiv updates continuously. Every run hits the live archive. |
| 🚫 | No registration. Uses only public open-access metadata. No login or API key required. |
🧠 Every state-of-the-art result in modern AI was on arXiv months before it hit a peer-reviewed journal. Skip the lag.
📈 How it compares to alternatives
| Approach | Cost | Coverage | Refresh | Query power | Setup |
|---|---|---|---|---|---|
| ⭐ arXiv Preprint Scraper (this Actor) | $5 free credit, then pay-per-use | 2.5M+ preprints | Live per run | Full arXiv syntax + sort | ⚡ 2 min |
| Google Scholar scraping | Variable | Broad but noisy | Live | Keyword only | ⏳ Hours, captcha-prone |
| Semantic Scholar API | Free tier | 200M+ papers | Daily | Limited operators | 🐢 Days (API key, quotas) |
| Manual arXiv listing pages | Free | All of arXiv | Live | UI clicks only | 🐢 No automation |
Pick this Actor when you want arXiv-quality metadata with a clean schema and zero parser maintenance.
🚀 How to use
- 📝 Sign up. Create a free account with $5 credit (takes 2 minutes).
- 🌐 Open the Actor. Go to the arXiv Preprint Scraper page on the Apify Store.
- 🎯 Set input. Write an arXiv query (e.g.
cat:cs.LG), pick a sort order, and setmaxItems. - 🚀 Run it. Click Start and let the Actor collect your data.
- 📥 Download. Grab your results in the Dataset tab as CSV, Excel, JSON, or XML.
⏱️ Total time from signup to downloaded dataset: 3-5 minutes. No coding required.
💼 Business use cases
🔌 Automating arXiv Scraper
Control the scraper programmatically for scheduled runs and pipeline integrations:
- 🟢 Node.js. Install the
apify-clientNPM package. - 🐍 Python. Use the
apify-clientPyPI package. - 📚 See the Apify API documentation for full details.
The Apify Schedules feature lets you trigger this Actor on any cron interval. A daily refresh on cat:cs.LG plus sortBy: submittedDate gives you a continuously updated "what's new" feed for any subject category.
🌟 Beyond business use cases
Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.
🤖 Ask an AI assistant about this scraper
Open a ready-to-send prompt about this ParseForge actor in the AI of your choice:
- 💬 ChatGPT
- 🧠 Claude
- 🔍 Perplexity
- 🅒 Copilot
❓ Frequently Asked Questions
🧩 How does it work?
You write an arXiv query (e.g. cat:cs.LG), pick a sort order, and click Start. The Actor hits the public arXiv catalog, paginates through results, and emits one clean JSON record per paper. No setup, no captchas.
🔍 What query syntax can I use?
arXiv's full query language. Prefixes include all, ti (title), abs (abstract), au (author), cat (category), and id. Combine with AND, OR, and ANDNOT. Wrap multi-word phrases in quotes, e.g. ti:"neural radiance fields".
📚 Which subject categories are covered?
All of arXiv: physics (hep-ph, cond-mat, gr-qc, etc.), mathematics (math.AG, math.PR, etc.), computer science (cs.LG, cs.CV, etc.), statistics, quantitative biology, quantitative finance, economics, and electrical engineering.
🔁 How often is the data refreshed?
arXiv updates continuously as authors submit new versions. Every run of this Actor fetches the live archive, so your dataset is current at run time.
📅 Can I get only the latest papers?
Yes. Set sortBy to submittedDate and sortOrder to descending. The first records returned will be the most recently submitted preprints.
🔗 Do I get the full PDF?
The dataset includes a direct pdfUrl to the PDF on arXiv. You can download PDFs separately or pipe the URLs into a downloader. Full-text extraction is not part of this Actor.
💬 What is the comment field?
It's an author-supplied note attached to the preprint, often "20 pages, 5 figures" or "Accepted at NeurIPS 2024". It's optional, so it may be null on older or minimally annotated papers.
⚖️ Is it legal to use arXiv metadata?
arXiv's terms permit programmatic access for non-commercial and most commercial research uses. The metadata (titles, authors, abstracts) is publicly viewable. Always review the latest arXiv terms for your specific use case.
💳 Do I need a paid Apify plan?
No. The free Apify plan is enough for testing and small runs (10 records per run). A paid plan lifts the limit and gives you scheduling, higher concurrency, and larger datasets.
🔁 What happens if a run fails?
The Actor automatically retries transient errors and rotates outbound connections. If a run still fails, you can inspect the log, fix the input, and re-run. Partial datasets from failed runs are preserved.
🆘 What if I need help?
Contact us through the Apify platform or use the Tally form linked below.
🔌 Integrate with any app
arXiv Scraper connects to any cloud service via Apify integrations:
- Make - Automate multi-step workflows
- Zapier - Connect with 5,000+ apps
- Slack - Get run notifications in your channels
- Airbyte - Pipe paper records into your warehouse
- GitHub - Trigger runs from commits and releases
- Google Drive - Export datasets straight to Sheets
You can also use webhooks to trigger downstream actions when a run finishes. Push fresh paper metadata into your RAG index, or alert your team in Slack when a watched author posts a new preprint.
🔗 Recommended Actors
- ✈️ OurAirports Scraper - Global airport reference database
- 💼 Greenhouse Jobs Scraper - Pull research and engineering job postings
- 📈 LinkedIn Jobs Scraper - Track academic-adjacent industry roles
- 🔍 Monster Scraper - U.S. job market signal for research talent
- 🧑💼 Lever Jobs Scraper - Pipeline of startup R&D openings
💡 Pro Tip: browse the complete ParseForge collection for more research and reference-data scrapers.
🆘 Need Help? Open our contact form to request a new scraper, propose a custom data project, or report an issue.
⚠️ Disclaimer: this Actor is an independent tool and is not affiliated with, endorsed by, or sponsored by arXiv, Cornell University, or any of its contributors. All trademarks mentioned are the property of their respective owners. Only publicly available open-access preprint metadata is collected.