arXiv Preprint Scraper avatar

arXiv Preprint Scraper

Pricing

Pay per event

Go to Apify Store
arXiv Preprint Scraper

arXiv Preprint Scraper

Export preprints from arXiv.org. Search 2.5M+ open-access papers across physics, mathematics, computer science, biology, economics, and quantitative finance. Query by keyword, author, category, or date range. Pull titles, authors, abstracts, categories, DOIs, journal refs, and PDF links.

Pricing

Pay per event

Rating

5.0

(1)

Developer

ParseForge

ParseForge

Maintained by Community

Actor stats

0

Bookmarked

15

Total users

2

Monthly active users

3 days ago

Last modified

Categories

Share

ParseForge Banner

📐 arXiv Preprint Scraper

🚀 Export open-access research papers in seconds. Query 2.5M+ arXiv preprints across physics, math, computer science, biology, finance, statistics, and economics by keyword, author, category, or date range. No API key, no registration, no XML parsing.

🕒 Last updated: 2026-05-27 · 📊 13 fields per record · 📄 2.5M+ preprints · 🧠 8 macro disciplines · 🔖 170+ subject categories

The arXiv Preprint Scraper searches the open-access arXiv archive and returns 13 fields per record, including arXiv ID, title, authors, abstract, primary category, every secondary category, DOI, journal reference, comment, publication dates, and direct links to both the PDF and the abstract page. arXiv has hosted preprints since 1991 and has become the de facto place where physicists, computer scientists, and mathematicians first publish new work.

The catalog covers eight macro disciplines and more than 170 subject categories, from cs.LG (machine learning) and math.AG (algebraic geometry) to q-bio.NC (neurons and cognition) and econ.EM (econometrics). This Actor accepts the full arXiv query syntax, so you can filter by title, abstract, author, category, or any boolean combination, and download the dataset as CSV, Excel, JSON, or XML.

🎯 Target Audience💡 Primary Use Cases
ML engineers, academic researchers, literature-review teams, science journalists, R&D groups, librarians, citation tools, AI agentsPaper discovery, trend tracking, author monitoring, citation graphs, RAG/training data, alert pipelines, systematic reviews

📋 What the arXiv Scraper does

Four research workflows in a single run:

  • 🔍 Keyword search. Use all:transformer or ti:attention abs:retrieval to scope to titles or abstracts.
  • 👤 Author monitoring. Use au:hinton or au:lecun AND cat:cs.LG to track an author's output.
  • 🧠 Category feeds. Use cat:cs.LG, cat:hep-ph, or cat:q-bio.NC for category-specific firehoses.
  • 📅 Recency sort. Order by submittedDate or lastUpdatedDate, descending or ascending, to surface the latest work first.

Each record includes the arXiv identifier, full title, every co-author, the abstract, primary and secondary categories, the DOI if assigned, journal reference, author comments, both published and updated timestamps, and ready-to-open PDF and abstract URLs.

💡 Why it matters: scientific output doubles roughly every nine years. Tracking the literature by hand is impossible. Calling the public arXiv interface yourself means writing an XML parser, respecting rate limits, and managing pagination. This Actor turns that into a one-click data pull that returns clean JSON.


🎬 Full Demo

🚧 Coming soon: a 3-minute walkthrough showing how to go from sign-up to a downloaded dataset of papers.


⚙️ Input

InputTypeDefaultBehavior
maxItemsinteger10Records to return. Free plan caps at 10, paid plan at 1,000,000.
searchQuerystring"all:transformer"arXiv query syntax. Prefixes: `all`, `ti`, `abs`, `au`, `cat`, `id`. Boolean operators `AND`, `OR`, `ANDNOT` are supported.
sortBystring"relevance"One of `relevance`, `lastUpdatedDate`, or `submittedDate`.
sortOrderstring"descending"`descending` (newest first) or `ascending` (oldest first).

Example: 50 most recent machine-learning preprints.

{
"maxItems": 50,
"searchQuery": "cat:cs.LG",
"sortBy": "submittedDate",
"sortOrder": "descending"
}

Example: every paper by Yann LeCun on neural networks, newest first.

{
"maxItems": 100,
"searchQuery": "au:lecun AND abs:neural",
"sortBy": "submittedDate",
"sortOrder": "descending"
}

⚠️ Good to Know: arXiv is a preprint server. Most papers are pre-publication and may not yet be peer-reviewed. The journalRef field is populated once an author updates the metadata after journal acceptance, and the doi field follows the same rule. For systematic reviews, combine this Actor with a peer-review check downstream.


📊 Output

Each preprint record contains 13 fields. Download the dataset as CSV, Excel, JSON, or XML.

🧾 Schema

FieldTypeExample
🆔 arxivIdstring"1706.03762v7"
📄 titlestring"Attention Is All You Need"
👥 authorsstring[]["Ashish Vaswani", "Noam Shazeer", "..."]
📝 summarystring"The dominant sequence transduction models..."
🧠 primaryCategorystring"cs.CL"
🏷️ categoriesstring[]["cs.CL", "cs.LG"]
🔗 doistring | null"10.48550/arXiv.1706.03762"
📚 journalRefstring | null"NeurIPS 2017"
💬 commentstring | null"15 pages, 5 figures"
📅 publishedDateISO 8601"2017-06-12T17:57:34Z"
🔁 updatedDateISO 8601"2023-08-02T00:41:18Z"
📥 pdfUrlstring"https://arxiv.org/pdf/1706.03762"
🔖 abstractUrlstring"http://arxiv.org/abs/1706.03762v7"
🕒 scrapedAtISO 8601"2026-05-27T00:00:00.000Z"

📦 Sample records


✨ Why choose this Actor

Capability
📚2.5M+ preprints. Every paper hosted on arXiv across physics, math, CS, statistics, quantitative biology, quantitative finance, economics, and electrical engineering.
🎯Full arXiv query syntax. Title, abstract, author, category, ID, and boolean operators all work.
📅Recency sort. Sort by submission date or last update for date-bounded discovery.
Fast. 100 records per page, fully paginated. 1,000 papers in under two minutes.
🧰Ready for downstream pipelines. Clean JSON with arXiv IDs, DOIs, and direct PDF links for RAG, training, or citation graphs.
🔁Always fresh. arXiv updates continuously. Every run hits the live archive.
🚫No registration. Uses only public open-access metadata. No login or API key required.

🧠 Every state-of-the-art result in modern AI was on arXiv months before it hit a peer-reviewed journal. Skip the lag.


📈 How it compares to alternatives

ApproachCostCoverageRefreshQuery powerSetup
⭐ arXiv Preprint Scraper (this Actor)$5 free credit, then pay-per-use2.5M+ preprintsLive per runFull arXiv syntax + sort⚡ 2 min
Google Scholar scrapingVariableBroad but noisyLiveKeyword only⏳ Hours, captcha-prone
Semantic Scholar APIFree tier200M+ papersDailyLimited operators🐢 Days (API key, quotas)
Manual arXiv listing pagesFreeAll of arXivLiveUI clicks only🐢 No automation

Pick this Actor when you want arXiv-quality metadata with a clean schema and zero parser maintenance.


🚀 How to use

  1. 📝 Sign up. Create a free account with $5 credit (takes 2 minutes).
  2. 🌐 Open the Actor. Go to the arXiv Preprint Scraper page on the Apify Store.
  3. 🎯 Set input. Write an arXiv query (e.g. cat:cs.LG), pick a sort order, and set maxItems.
  4. 🚀 Run it. Click Start and let the Actor collect your data.
  5. 📥 Download. Grab your results in the Dataset tab as CSV, Excel, JSON, or XML.

⏱️ Total time from signup to downloaded dataset: 3-5 minutes. No coding required.


💼 Business use cases

🧠 AI / ML R&D teams

  • Daily firehose of new cs.LG, cs.CL, and cs.CV papers
  • Build training corpora and RAG indexes from abstracts
  • Track competitor authors and labs by name
  • Surface state-of-the-art benchmarks via abstract keywords

📊 Investment & VC research

  • Monitor deep-tech preprints from portfolio companies
  • Track quant finance category q-fin.* for new strategies
  • Spot academic spin-out candidates before they raise
  • Build technology landscape reports from preprint clusters

📰 Science journalism & comms

  • Find embargoed-but-public physics and biomed preprints
  • Build alerts on senior-author names for explainer pieces
  • Pull abstracts for newsletters and round-ups
  • Cross-reference DOIs with published-version journals

🏥 Pharma & biotech intelligence

  • Track q-bio.BM and q-bio.GN for target-discovery work
  • Author-monitor academic collaborators
  • Build literature dashboards for therapeutic areas
  • Cite-graph upstream papers feeding clinical pipelines

🔌 Automating arXiv Scraper

Control the scraper programmatically for scheduled runs and pipeline integrations:

  • 🟢 Node.js. Install the apify-client NPM package.
  • 🐍 Python. Use the apify-client PyPI package.
  • 📚 See the Apify API documentation for full details.

The Apify Schedules feature lets you trigger this Actor on any cron interval. A daily refresh on cat:cs.LG plus sortBy: submittedDate gives you a continuously updated "what's new" feed for any subject category.


🌟 Beyond business use cases

Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.

🎓 Research and academia

  • Systematic literature reviews with reproducible queries
  • Bibliometric analyses and citation-network construction
  • Course reading-list generation for graduate seminars
  • Thesis-defense literature scans across categories

🎨 Personal and creative

  • Hobby ML newsletter and Substack curation
  • Personal "papers to read" digest powered by RSS
  • Tools that surface arXiv papers to lay readers
  • Art projects visualizing the shape of human knowledge

🤝 Non-profit and civic

  • Public-interest tech evaluations of academic claims
  • Disinformation researchers tracking preprint origins
  • Civic-science explainers for climate and public-health topics
  • Education non-profits building free curricula from open papers

🧪 Experimentation

  • Train topic classifiers and embedding models on abstracts
  • Benchmark retrieval systems against arXiv-scale corpora
  • Prototype academic-search frontends and chat assistants
  • Build agent pipelines that resolve paper IDs to PDFs

🤖 Ask an AI assistant about this scraper

Open a ready-to-send prompt about this ParseForge actor in the AI of your choice:


❓ Frequently Asked Questions

🧩 How does it work?

You write an arXiv query (e.g. cat:cs.LG), pick a sort order, and click Start. The Actor hits the public arXiv catalog, paginates through results, and emits one clean JSON record per paper. No setup, no captchas.

🔍 What query syntax can I use?

arXiv's full query language. Prefixes include all, ti (title), abs (abstract), au (author), cat (category), and id. Combine with AND, OR, and ANDNOT. Wrap multi-word phrases in quotes, e.g. ti:"neural radiance fields".

📚 Which subject categories are covered?

All of arXiv: physics (hep-ph, cond-mat, gr-qc, etc.), mathematics (math.AG, math.PR, etc.), computer science (cs.LG, cs.CV, etc.), statistics, quantitative biology, quantitative finance, economics, and electrical engineering.

🔁 How often is the data refreshed?

arXiv updates continuously as authors submit new versions. Every run of this Actor fetches the live archive, so your dataset is current at run time.

📅 Can I get only the latest papers?

Yes. Set sortBy to submittedDate and sortOrder to descending. The first records returned will be the most recently submitted preprints.

🔗 Do I get the full PDF?

The dataset includes a direct pdfUrl to the PDF on arXiv. You can download PDFs separately or pipe the URLs into a downloader. Full-text extraction is not part of this Actor.

💬 What is the comment field?

It's an author-supplied note attached to the preprint, often "20 pages, 5 figures" or "Accepted at NeurIPS 2024". It's optional, so it may be null on older or minimally annotated papers.

arXiv's terms permit programmatic access for non-commercial and most commercial research uses. The metadata (titles, authors, abstracts) is publicly viewable. Always review the latest arXiv terms for your specific use case.

💳 Do I need a paid Apify plan?

No. The free Apify plan is enough for testing and small runs (10 records per run). A paid plan lifts the limit and gives you scheduling, higher concurrency, and larger datasets.

🔁 What happens if a run fails?

The Actor automatically retries transient errors and rotates outbound connections. If a run still fails, you can inspect the log, fix the input, and re-run. Partial datasets from failed runs are preserved.

🆘 What if I need help?

Contact us through the Apify platform or use the Tally form linked below.


🔌 Integrate with any app

arXiv Scraper connects to any cloud service via Apify integrations:

  • Make - Automate multi-step workflows
  • Zapier - Connect with 5,000+ apps
  • Slack - Get run notifications in your channels
  • Airbyte - Pipe paper records into your warehouse
  • GitHub - Trigger runs from commits and releases
  • Google Drive - Export datasets straight to Sheets

You can also use webhooks to trigger downstream actions when a run finishes. Push fresh paper metadata into your RAG index, or alert your team in Slack when a watched author posts a new preprint.


💡 Pro Tip: browse the complete ParseForge collection for more research and reference-data scrapers.


🆘 Need Help? Open our contact form to request a new scraper, propose a custom data project, or report an issue.


⚠️ Disclaimer: this Actor is an independent tool and is not affiliated with, endorsed by, or sponsored by arXiv, Cornell University, or any of its contributors. All trademarks mentioned are the property of their respective owners. Only publicly available open-access preprint metadata is collected.