arXiv Preprint Scraper avatar

arXiv Preprint Scraper

Pricing

Pay per event

Go to Apify Store
arXiv Preprint Scraper

arXiv Preprint Scraper

Export preprints from arXiv.org. Search 2.5M+ open-access papers across physics, mathematics, computer science, biology, economics, and quantitative finance. Query by keyword, author, category, or date range. Pull titles, authors, abstracts, categories, DOIs, journal refs, and PDF links.

Pricing

Pay per event

Rating

5.0

(1)

Developer

ParseForge

ParseForge

Maintained by Community

Actor stats

0

Bookmarked

12

Total users

2

Monthly active users

5 days ago

Last modified

Categories

Share

ParseForge Banner

📚 arXiv Scraper

🚀 Export open-access research in seconds. Query 2M+ preprints from arXiv by keyword, author, or category, and pull titles, abstracts, authors, DOIs, and PDF URLs into a clean dataset. No API key, no registration, no XML parsing.

🕒 Last updated: 2026-04-23 · 📊 14 fields per paper · 📖 2M+ papers · 🔍 Keyword & author search · 📂 All categories · 🚫 No auth required

The arXiv Scraper queries the public arXiv API (export.arxiv.org) and returns 14 fields per paper, including arxivId, title, authors, full abstract, primary and secondary categories, DOI, journal reference, publication and update dates, and a direct PDF URL. arXiv is the world's largest open-access preprint archive for physics, mathematics, computer science, quantitative biology, statistics, and economics.

The archive spans every major quantitative discipline and 2+ million papers going back to 1991. This Actor converts arXiv query syntax into a structured dataset available as CSV, Excel, JSON, or XML in under five minutes. All filtering happens server-side, so you skip the Atom XML parser entirely.

🎯 Target Audience💡 Primary Use Cases
Academic researchers, ML engineers, data scientists, literature review teams, citation tracking tools, competitive-intelligence analysts, journalists, educatorsLiterature reviews, citation graphs, trend tracking, paper discovery, LLM training corpora, author profiling, category monitoring

📋 What the arXiv Scraper does

Three filtering workflows in a single run:

  • 🔍 Keyword search. Full-text queries across title, abstract, and metadata using arXiv query syntax.
  • 👤 Author search. Pull every paper by a given author using the au: prefix.
  • 📂 Category filter. Restrict by arXiv subject category (e.g., cs.LG, math.PR, physics.optics).

Each record includes the arxivId, title, author list, full abstract, primary category and all secondary categories, DOI, journal reference, comment field, publication and update timestamps, plus direct links to the abstract page and the PDF.

💡 Why it matters: arXiv is the default publication channel for machine learning, theoretical physics, and mathematics. Tracking new papers manually is slow, and the official API returns Atom XML that most teams do not want to parse. This Actor returns a flat JSON dataset ready for downstream ingestion.


🎬 Full Demo

🚧 Coming soon: a 3-minute walkthrough showing how to go from sign-up to a downloaded dataset.


⚙️ Input

InputTypeDefaultBehavior
maxItemsinteger10Papers to return. Free plan caps at 10, paid plan at 1,000,000.
searchQuerystring"all:transformer"arXiv query syntax. Examples: all:transformer, ti:attention abs:machine, au:hinton, cat:cs.LG.
sortBystring"relevance"One of relevance, lastUpdatedDate, submittedDate.
sortOrderstring"descending"descending (newest first) or ascending.

Example: 50 most-recent transformer papers in cs.LG.

{
"maxItems": 50,
"searchQuery": "cat:cs.LG AND all:transformer",
"sortBy": "submittedDate",
"sortOrder": "descending"
}

Example: every paper by Geoffrey Hinton.

{
"maxItems": 200,
"searchQuery": "au:hinton",
"sortBy": "submittedDate",
"sortOrder": "descending"
}

⚠️ Good to Know: arXiv enforces a rate limit on its public API. The Actor paces requests to stay within policy, so very large runs (10,000+ papers) naturally take longer. Plan accordingly for literature-review pipelines.


📊 Output

Each paper record contains 14 fields. Download the dataset as CSV, Excel, JSON, or XML.

🧾 Schema

FieldTypeExample
🆔 arxivIdstring"1706.03762v7"
📝 titlestring"Attention Is All You Need"
👥 authorsstring[]["Ashish Vaswani", "Noam Shazeer", "..."]
📄 summarystring"The dominant sequence transduction..."
📂 primaryCategorystring"cs.CL"
🏷️ categoriesstring[]["cs.CL", "cs.LG"]
🔗 doistring | null"10.48550/arXiv.1706.03762"
📰 journalRefstring | null"NeurIPS 2017"
💬 commentstring | null"15 pages, 5 figures"
📅 publishedDateISO 8601"2017-06-12T17:57:34Z"
🔄 updatedDateISO 8601"2023-08-02T00:41:18Z"
📎 pdfUrlstring"https://arxiv.org/pdf/1706.03762"
🌐 abstractUrlstring"http://arxiv.org/abs/1706.03762v7"
🕒 scrapedAtISO 8601"2026-04-21T00:00:00.000Z"

📦 Sample records


✨ Why choose this Actor

Capability
📖2M+ paper archive. Every open-access preprint submitted to arXiv since 1991 is reachable.
🔍Full query syntax. Keyword, title, abstract, author, and category filters combine in one search string.
📂All categories. Physics, math, CS, quant-bio, stats, econ, q-fin, EESS, and all sub-categories.
Fast. 10 papers in under 10 seconds, 1,000 papers in about 5 minutes with built-in rate pacing.
🌐Trusted open-science source. Cited daily by academic, industry, and government research teams.
🔁Always fresh. Every run hits the live arXiv API, so new submissions show up as soon as they are indexed.
🚫No authentication. Public API. No login, API key, or institutional access required.

📊 Preprint discovery is the top of the funnel for modern research workflows. Literature reviews, citation networks, and training datasets all start here.


📈 How it compares to alternatives

ApproachCostCoverageRefreshFiltersSetup
⭐ arXiv Scraper (this Actor)$5 free credit, then pay-per-use2M+ preprintsLive per runkeyword, author, category, date⚡ 2 min
Manual arXiv website browsingFreeFullManualLimited web UI🐢 Slow
Official arXiv API (Atom XML)FreeFullLiveFull syntax🛠️ Parser required
Commercial academic APIs$$$ / seatVariesVariesRich⏳ Hours

Pick this Actor when you want the full archive, flat JSON output, and zero XML parsing.


🚀 How to use

  1. 📝 Sign up. Create a free account with $5 credit (takes 2 minutes).
  2. 🌐 Open the Actor. Go to the arXiv Scraper page on the Apify Store.
  3. 🎯 Set input. Enter an arXiv query (e.g., cat:cs.LG AND all:diffusion), pick a sort order, and set maxItems.
  4. 🚀 Run it. Click Start and let the Actor collect your papers.
  5. 📥 Download. Grab your results in the Dataset tab as CSV, Excel, JSON, or XML.

⏱️ Total time from signup to downloaded dataset: 3-5 minutes. No coding required.


💼 Business use cases

🎓 Academic Research

  • Automated literature reviews by topic or author
  • Citation tracking for grant applications
  • Monitoring a sub-field for new submissions
  • Building reference lists for thesis work

🤖 ML & Data Science

  • Training corpora for scientific LLMs
  • Embedding and retrieval pipelines
  • Benchmark paper discovery
  • State-of-the-art tracking by category

📈 Competitive Intelligence

  • Track research output of specific labs or companies
  • Author profiling for hiring and partnership
  • Technology trend reports and market signals
  • Patent prior-art searches

📰 Science Media & Education

  • Weekly newsletters on hot preprints
  • Course reading lists by category
  • Fact-checking for journalism
  • Open-science dashboards for institutions

🌟 Beyond business use cases

Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.

🎓 Research and academia

  • Empirical datasets for papers, thesis work, and coursework
  • Longitudinal studies tracking changes across snapshots
  • Reproducible research with cited, versioned data pulls
  • Classroom exercises on data analysis and ethical scraping

🎨 Personal and creative

  • Side projects, portfolio demos, and indie app launches
  • Data visualizations, dashboards, and infographics
  • Content research for bloggers, YouTubers, and podcasters
  • Hobbyist collections and personal trackers

🤝 Non-profit and civic

  • Transparency reporting and accountability projects
  • Advocacy campaigns backed by public-interest data
  • Community-run databases for local issues
  • Investigative journalism on public records

🧪 Experimentation

  • Prototype AI and machine-learning pipelines with real data
  • Validate product-market hypotheses before engineering spend
  • Train small domain-specific models on niche corpora
  • Test dashboard concepts with live input

🔌 Automating arXiv Scraper

Control the scraper programmatically for scheduled runs and pipeline integrations:

  • 🟢 Node.js. Install the apify-client NPM package.
  • 🐍 Python. Use the apify-client PyPI package.
  • 📚 See the Apify API documentation for full details.

The Apify Schedules feature lets you trigger this Actor on any cron interval. Daily or weekly category sweeps keep downstream databases fresh automatically.


❓ Frequently Asked Questions

🧩 How does it work?

Enter an arXiv query string in the input form, click Start, and the Actor calls the public arXiv API, parses the Atom XML response, and emits one structured record per paper. No browser automation, no captchas.

📏 How accurate is the data?

The dataset comes directly from the arXiv API, which is the canonical source for preprint metadata. Fields like arxivId, title, authors, and categories are always present. Optional fields such as doi, journalRef, and comment are populated only when the author provided them on submission.

🔁 How often is the dataset refreshed?

arXiv publishes new submissions every weekday. Every run of this Actor hits the live API, so your dataset reflects the current state of the archive at run time.

📂 Which categories are supported?

All of them. Physics (astro-ph, cond-mat, gr-qc, hep-, math-ph, nucl-, physics, quant-ph), mathematics (math.), computer science (cs.), quantitative biology (q-bio.), statistics (stat.), electrical engineering (eess.), economics (econ.), and quantitative finance (q-fin.*).

🔍 What query syntax can I use?

arXiv supports prefix queries: ti: (title), abs: (abstract), au: (author), cat: (category), all: (any field), plus boolean operators AND, OR, and ANDNOT. See the arXiv API user manual for full details.

⏰ Can I schedule regular runs?

Yes. Use Apify Schedules to run this Actor on any cron interval (hourly, daily, weekly) and keep a downstream database or newsletter in sync.

arXiv metadata (titles, abstracts, authors, categories) is published openly and can be reused for research and indexing. Full PDF text is under per-paper licenses, so review each paper's license before redistribution.

💼 Can I use this data commercially?

Metadata and abstracts are generally reusable. For bulk commercial redistribution of full-text PDFs, review the arXiv terms of use and each paper's specific license.

💳 Do I need a paid Apify plan to use this Actor?

No. The free Apify plan is enough for testing and small runs (10 papers per run). A paid plan lifts the limit and gives you access to scheduling, higher concurrency, and larger datasets.

🔁 What happens if a run fails or gets interrupted?

Apify automatically retries transient errors, and the Actor has its own exponential backoff for rate-limit responses. If a run still fails, you can inspect the log in the Runs tab and re-run. Partial datasets are preserved.

📥 Does it download the PDF files?

This Actor returns metadata and a direct pdfUrl link for each paper. PDF downloads are intentionally out of scope so that runs stay fast and you stay within arXiv's usage policy. Download PDFs on your side only for the subset you need.

🆘 What if I need help?

Our support team is here to help. Contact us through the Apify platform or use the Tally form linked below.


🔌 Integrate with any app

arXiv Scraper connects to any cloud service via Apify integrations:

  • Make - Automate multi-step workflows
  • Zapier - Connect with 5,000+ apps
  • Slack - Get run notifications in your channels
  • Airbyte - Pipe paper data into your warehouse
  • GitHub - Trigger runs from commits and releases
  • Google Drive - Export datasets straight to Sheets

You can also use webhooks to trigger downstream actions when a run finishes. Push fresh preprint data into your product backend, or alert your team in Slack when a new paper matches your watchlist.


💡 Pro Tip: browse the complete ParseForge collection for more research and reference-data scrapers.


🆘 Need Help? Open our contact form to request a new scraper, propose a custom data project, or report an issue.


⚠️ Disclaimer: this Actor is an independent tool and is not affiliated with, endorsed by, or sponsored by arXiv, Cornell University, or the Simons Foundation. All trademarks mentioned are the property of their respective owners. Only publicly available open-access preprint metadata is collected.