arXiv Scraper
Pricing
Pay per event
arXiv Scraper
Comprehensive arXiv scraper for extracting scholarly article data across physics, math, CS, biology, finance, statistics, engineering, and economics. Automates access to arXiv’s large preprint archive, providing structured metadata for researchers, academics, and data scientists.
Pricing
Pay per event
Rating
0.0
(0)
Developer

ParseForge
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
📄 arXiv Scraper
🚀 Extract comprehensive scholarly article data from arXiv.org - the world's largest open-access repository of preprints. Perfect for researchers, academics, data scientists, and institutions who need automated access to cutting-edge research papers across physics, mathematics, computer science, and more.
The arXiv Scraper collects detailed information from arXiv.org, a free distribution service and open-access archive for nearly 2.4 million scholarly articles. Whether you're building a research database, tracking the latest developments in your field, or analyzing publication trends, this tool delivers complete paper metadata with just a few clicks.
Target Audience: Researchers, academics, data scientists, research institutions, graduate students, librarians, and knowledge management professionals
Primary Use Cases: Academic research, literature reviews, research database building, trend analysis, citation tracking, knowledge discovery
What Does arXiv Scraper Do?
This tool collects comprehensive scholarly article data from arXiv.org, supporting multiple search methods and delivering detailed information about research papers across all scientific disciplines. It delivers:
- Complete Paper Information: Title, authors, abstract, submission dates, and arXiv ID
- Full Metadata: Categories, subject classifications, comments, journal references, DOI
- PDF Access: Direct links to download papers in PDF format
- Research Context: License information, related papers, and citation data
- Date Tracking: Submission dates and last updated dates for version tracking
- Subject Classification: Full category tags and subject classifications for filtering
- And much more
Business Value: Build comprehensive research databases, track the latest developments in your field, automate literature reviews, and discover cutting-edge research without manual browsing and data entry.
How to use the arXiv Scraper - Full Demo
Watch this demo to see how easy it is to get started!
[Demo video coming soon]
Input
To start arXiv web scraping, simply fill in the input form. You can scrape arXiv based on:
- Search Query - Enter any search term (e.g., "machine learning", "quantum computing", "neural networks"). This searches across titles, abstracts, and authors.
- Search For - Select the archive or category to search within (All, Physics, Mathematics, Computer Science, etc.)
- Sort For - Choose how to sort results: Announcement date, Submission date, or Relevance
- Subcategory - Select the field to search within (All Fields, Title, Author, Abstract, etc.)
- Show Abstracts - Toggle to show abstracts in search results
- Sort By - Choose how to sort results: Relevance, Submitted Date, or Last Updated Date
- Max Items - Set the maximum number of papers to collect (optional). Free users must specify this and are limited to 50 items. Paid users can leave this empty for unlimited collection.
- Start URL - Alternatively, you can paste a direct arXiv search URL. This is useful if you've already created a search on the website and want to use that exact URL.
Pro Tip: 💡 You can either use the search query and filters, OR paste a start URL. If you use a start URL, the other filters won't apply.
Here's what the filled-out input schema looks like:

And here it is written in JSON:
{"maxItems": 10,"searchQuery": "machine learning","showAbstracts": false,"searchFor": "all","sortFor": "","subcategory": "all","sortBy": "relevance"}
Output
After the Actor finishes its run, you'll get a dataset with the output. The length of the dataset depends on the amount of results you've set. You can download those results as an Excel, HTML, XML, JSON, and CSV document.
Here's an example of scraped arXiv data you'll get if you decide to scrape papers:

[{"arxivId": "2511.20643","title": "Concept-Aware Batch Sampling Improves Language-Image Pretraining","authors": ["Adhiraj Ghosh","Vishaal Udandarao","Thao Nguyen","Matteo Farina","Mehdi Cherti","Jenia Jitsev","Sewoong Oh","Elisa Ricci","Ludwig Schmidt","Matthias Bethge"],"abstract": "What data should a vision-language model be trained on? To answer this question, many data curation efforts center on the quality of a dataset. However, most of these existing methods are (i) offline, i.e. they produce a static dataset from a set of predetermined filtering criteria, and (ii) concept-agnostic, i.e. they use model-based filters which induce additional data biases. In this work, we go beyond such offline, concept-agnostic methods and advocate for more flexible, task-adaptive online concept-based curation. Our first contribution is DataConcept, a collection of 128M web-crawled image-text pairs annotated with fine-grained details about their concept composition. Building on DataConcept, we introduce Concept-Aware Batch Sampling (CABS), a simple yet effective batch sampling framework that flexibly constructs batches on-the-fly based on specific target distributions. We propose two variants: (i) Diversity Maximization (CABS-DM) to curate batches with a broad coverage of available concepts, and (ii) Frequency Maximization (CABS-FM) to curate batches with high object multiplicity. Through extensive evaluations across 28 benchmarks, we demonstrate that our CABS method significantly benefits CLIP/SigLIP model classes and yields highly performant models. Overall, CABS represents a strong open-source alternative to proprietary online data curation algorithms, enabling practitioners to define custom concept distributions that optimize for specific downstream tasks.","submissionDate": "25 Nov 2025","lastUpdatedDate": "25 Nov 2025","categories": ["cs.CV","cs.LG"],"detailUrl": "https://arxiv.org/abs/2511.20643","pdfUrl": "https://arxiv.org/pdf/2511.20643","comments": "Tech Report","doi": "https://doi.org/10.48550/arXiv.2511.20643","license": "http://creativecommons.org/licenses/by-nc-sa/4.0/","subjectClassifications": ["cs.CV","cs.LG"],"scrapedTimestamp": "2025-11-26T19:50:37.567Z"},{"arxivId": "2511.20641","title": "Unleashing the Power of Vision-Language Models for Long-Tailed Multi-Label Visual Recognition","authors": ["Wei Tang","Zuo-Zheng Wang","Kun Zhang","Tong Wei","Min-Ling Zhang"],"abstract": "…models that favor head classes while underperforming on tail classes. Recent efforts have leveraged pre-trained vision-language models, such as CLIP, alongside long-tailed learning techniques to exploit rich visual-textual priors for improved performance. However, existing methods often derive semantic inter-class relationships directly from imbalanced datas…Long-tailed multi-label visual recognition poses a significant challenge, as images typically contain multiple labels with highly imbalanced class distributions, leading to biased models that favor head classes while underperforming on tail classes. Recent efforts have leveraged pre-trained vision-language models, such as CLIP, alongside long-tailed learning techniques to exploit rich visual-textual priors for improved performance. However, existing methods often derive semantic inter-class relationships directly from imbalanced datasets, resulting in unreliable correlations for tail classes due to data scarcity. Moreover, CLIP's zero-shot paradigm is optimized for single-label image-text matching, making it suboptimal for multi-label tasks. To address these issues, we propose the correlation adaptation prompt network (CAPNET), a novel end-to-end framework that explicitly models label correlations from CLIP's textual encoder. The framework incorporates a graph convolutional network for label-aware propagation and learnable soft prompts for refined embeddings. It utilizes a distribution-balanced Focal loss with class-aware re-weighting for optimized training under imbalance. Moreover, it improves generalization through test-time ensembling and realigns visual-textual modalities using parameter-efficient fine-tuning to avert overfitting on tail classes without compromising head class performance. Extensive experiments and ablation studies on benchmarks including VOC-LT, COCO-LT, and NUS-WIDE demonstrate that CAPNET achieves substantial improvements over state-of-the-art methods, validating its effectiveness for real-world long-tailed multi-label visual recognition.","submissionDate": "25 Nov 2025","lastUpdatedDate": "25 Nov 2025","categories": ["cs.CV","cs.LG"],"detailUrl": "https://arxiv.org/abs/2511.20641","pdfUrl": "https://arxiv.org/pdf/2511.20641","comments": null,"doi": "https://doi.org/10.48550/arXiv.2511.20641","license": "http://creativecommons.org/licenses/by/4.0/","subjectClassifications": ["cs.CV","cs.LG"],"scrapedTimestamp": "2025-11-26T19:50:37.570Z"},{"arxivId": "2511.20640","title": "MotionV2V: Editing Motion in a Video","authors": ["Ryan Burgert","Charles Herrmann","Forrester Cole","Michael S Ryoo","Neal Wadhwa","Andrey Voynov","Nataniel Ruiz"],"abstract": "While generative video models have achieved remarkable fidelity and consistency, applying these capabilities to video editing remains a complex challenge. Recent research has explored motion controllability as a means to enhance text-to-video generation or image animation; however, we identify precise motion control as a promising yet under-explored paradigm for editing existing videos. In this work, we propose modifying video motion by directly editing sparse trajectories extracted from the input. We term the deviation between input and output trajectories a \"motion edit\" and demonstrate that this representation, when coupled with a generative backbone, enables powerful video editing capabilities. To achieve this, we introduce a pipeline for generating \"motion counterfactuals\", video pairs that share identical content but distinct motion, and we fine-tune a motion-conditioned video diffusion architecture on this dataset. Our approach allows for edits that start at any timestamp and propagate naturally. In a four-way head-to-head user study, our model achieves over 65 percent preference against prior work. Please see our project page: https://ryanndagreat.github.io/MotionV2V","submissionDate": "25 Nov 2025","lastUpdatedDate": "25 Nov 2025","categories": ["cs.CV","cs.AI","cs.GR","cs.LG"],"detailUrl": "https://arxiv.org/abs/2511.20640","pdfUrl": "https://arxiv.org/pdf/2511.20640","comments": null,"doi": "https://doi.org/10.48550/arXiv.2511.20640","license": "http://creativecommons.org/licenses/by/4.0/","subjectClassifications": ["cs.CV","cs.AI","cs.GR","cs.LG"],"scrapedTimestamp": "2025-11-26T19:50:37.572Z"},{"arxivId": "2511.20639","title": "Latent Collaboration in Multi-Agent Systems","authors": ["Jiaru Zou","Xiyuan Yang","Ruizhong Qiu","Gaotang Li","Katherine Tieu","Pan Lu","Ke Shen","Hanghang Tong","Yejin Choi","Jingrui He","James Zou","Mengdi Wang","Ling Yang"],"abstract": "Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end-to-end training-free framework that enables pure latent collaboration among LLM agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings. A shared latent working memory then preserves and transfers each agent's internal representations, ensuring lossless information exchange. We provide theoretical analyses establishing that LatentMAS attains higher expressiveness and lossless information preservation with substantially lower complexity than vanilla text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS consistently outperforms strong single-model and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4x-4.3x faster end-to-end inference. These results demonstrate that our new latent collaboration framework enhances system-level reasoning quality while offering substantial efficiency gains without any additional training. Code and data are fully open-sourced at https://github.com/Gen-Verse/LatentMAS.","submissionDate": "25 Nov 2025","lastUpdatedDate": "25 Nov 2025","categories": ["cs.CL","cs.AI","cs.LG"],"detailUrl": "https://arxiv.org/abs/2511.20639","pdfUrl": "https://arxiv.org/pdf/2511.20639","comments": "Project: this https URL","doi": "https://doi.org/10.48550/arXiv.2511.20639","license": "http://creativecommons.org/licenses/by/4.0/","subjectClassifications": ["cs.CL","cs.AI","cs.LG"],"scrapedTimestamp": "2025-11-26T19:50:37.573Z"},{"arxivId": "2511.20636","title": "Image2Gcode: Image-to-G-code Generation for Additive Manufacturing Using Diffusion-Transformer Model","authors": ["Ziyue Wang","Yayati Jadhav","Peter Pak","Amir Barati Farimani"],"abstract": "…followed by the creation of a computer-aided design (CAD) model and fabrication through material-extrusion (MEX) printing. This process requires converting CAD geometry into machine-readable G-code through slicing and path planning. While each step is well established, dependence on CAD modeling remains a major bottleneck: constructing object-specific 3D ge…Mechanical design and manufacturing workflows conventionally begin with conceptual design, followed by the creation of a computer-aided design (CAD) model and fabrication through material-extrusion (MEX) printing. This process requires converting CAD geometry into machine-readable G-code through slicing and path planning. While each step is well established, dependence on CAD modeling remains a major bottleneck: constructing object-specific 3D geometry is slow and poorly suited to rapid prototyping. Even minor design variations typically necessitate manual updates in CAD software, making iteration time-consuming and difficult to scale. To address this limitation, we introduce Image2Gcode, an end-to-end data-driven framework that bypasses the CAD stage and generates printer-ready G-code directly from images and part drawings. Instead of relying on an explicit 3D model, a hand-drawn or captured 2D image serves as the sole input. The framework first extracts slice-wise structural cues from the image and then employs a denoising diffusion probabilistic model (DDPM) over G-code sequences. Through iterative denoising, the model transforms Gaussian noise into executable print-move trajectories with corresponding extrusion parameters, establishing a direct mapping from visual input to native toolpaths. By producing structured G-code directly from 2D imagery, Image2Gcode eliminates the need for CAD or STL intermediates, lowering the entry barrier for additive manufacturing and accelerating the design-to-fabrication cycle. This approach supports on-demand prototyping from simple sketches or visual references and integrates with upstream 2D-to-3D reconstruction modules to enable an automated pipeline from concept to physical artifact. The result is a flexible, computationally efficient framework that advances accessibility in design iteration, repair workflows, and distributed manufacturing.","submissionDate": "25 Nov 2025","lastUpdatedDate": "25 Nov 2025","categories": ["cs.LG"],"detailUrl": "https://arxiv.org/abs/2511.20636","pdfUrl": "https://arxiv.org/pdf/2511.20636","comments": null,"doi": "https://doi.org/10.48550/arXiv.2511.20636","license": "http://creativecommons.org/licenses/by/4.0/","subjectClassifications": ["cs.LG"],"scrapedTimestamp": "2025-11-26T19:50:37.575Z"},{"arxivId": "2511.20629","title": "MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models","authors": ["Chieh-Yun Chen","Zhonghao Wang","Qi Chen","Zhifan Ye","Min Shi","Yue Zhao","Yinan Zhao","Hui Qu","Wei-An Lin","Yiru Shen","Ajinkya Kale","Irfan Essa","Humphrey Shi"],"abstract": "Reinforcement learning from human feedback (RLHF) with reward models has advanced alignment of generative models to human aesthetic and perceptual preferences. However, jointly optimizing multiple rewards often incurs an alignment tax, improving one dimension while degrading others. To address this, we introduce two complementary methods: MapReduce LoRA and Reward-aware Token Embedding (RaTE). MapReduce LoRA trains preference-specific LoRA experts in parallel and iteratively merges them to refine a shared base model; RaTE learns reward-specific token embeddings that compose at inference for flexible preference control. Experiments on Text-to-Image generation (Stable Diffusion 3.5 Medium and FLUX.1-dev) show improvements of 36.1%, 4.6%, and 55.7%, and 32.7%, 4.3%, and 67.1% on GenEval, PickScore, and OCR, respectively. On Text-to-Video generation (HunyuanVideo), visual and motion quality improve by 48.1% and 90.0%, respectively. On the language task, Helpful Assistant, with Llama-2 7B, helpful and harmless improve by 43.4% and 136.7%, respectively. Our framework sets a new state-of-the-art multi-preference alignment recipe across modalities.","submissionDate": "25 Nov 2025","lastUpdatedDate": "25 Nov 2025","categories": ["cs.CV","cs.AI","cs.LG"],"detailUrl": "https://arxiv.org/abs/2511.20629","pdfUrl": "https://arxiv.org/pdf/2511.20629","comments": null,"doi": "https://doi.org/10.48550/arXiv.2511.20629","license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/","subjectClassifications": ["cs.CV","cs.AI","cs.LG"],"scrapedTimestamp": "2025-11-26T19:50:37.575Z"},{"arxivId": "2511.20626","title": "ROOT: Robust Orthogonalized Optimizer for Neural Network Training","authors": ["Wei He","Kai Han","Hang Zhou","Hanting Chen","Zhicheng Liu","Xinghao Chen","Yunhe Wang"],"abstract": "The optimization of large language models (LLMs) remains a critical challenge, particularly as model scaling exacerbates sensitivity to algorithmic imprecision and training instability. Recent advances in optimizers have improved convergence efficiency through momentum orthogonalization, but suffer from two key robustness limitations: dimensional fragility in orthogonalization precision and vulnerability to outlier-induced noise. To address these robustness challenges, we introduce ROOT, a Robust Orthogonalized Optimizer that enhances training stability through dual robustness mechanisms. First, we develop a dimension-robust orthogonalization scheme using adaptive Newton iterations with fine-grained coefficients tailored to specific matrix sizes, ensuring consistent precision across diverse architectural configurations. Second, we introduce an optimization-robust framework via proximal optimization that suppresses outlier noise while preserving meaningful gradient directions. Extensive experiments demonstrate that ROOT achieves significantly improved robustness, with faster convergence and superior final performance compared to both Muon and Adam-based optimizers, particularly in noisy and non-convex scenarios. Our work establishes a new paradigm for developing robust and precise optimizers capable of handling the complexities of modern large-scale model training. The code will be available at https://github.com/huawei-noah/noah-research/tree/master/ROOT.","submissionDate": "25 Nov 2025","lastUpdatedDate": "25 Nov 2025","categories": ["cs.LG","cs.AI"],"detailUrl": "https://arxiv.org/abs/2511.20626","pdfUrl": "https://arxiv.org/pdf/2511.20626","comments": null,"doi": "https://doi.org/10.48550/arXiv.2511.20626","license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/","subjectClassifications": ["cs.LG","cs.AI"],"scrapedTimestamp": "2025-11-26T19:50:37.576Z"},{"arxivId": "2511.20621","title": "DiFR: Inference Verification Despite Nondeterminism","authors": ["Adam Karvonen","Daniel Reuter","Roy Rinberg","Luke Marks","Adrià Garriga-Alonso","Keri Warr"],"abstract": "As demand for LLM inference grows, it is becoming increasingly important that providers and their customers can verify that inference processes are performed correctly, without errors or tampering. However, re-running the same inference process twice often leads to different results due to benign numerical noise, making it difficult to distinguish legitimate variation from actual problems. To address this problem, we introduce Token-DiFR (Token-Divergence-From-Reference), a method for verifying inference outputs by comparing generated tokens against predictions made by a trusted reference implementation conditioned on the same random seed. Sampling seed synchronization tightly constrains valid outputs, leaving providers minimal room to deviate from correct inference, which allows output tokens themselves to serve as auditable evidence of correctness at zero additional cost to the provider. Token-DiFR reliably identifies sampling errors, simulated bugs, and model quantization, detecting 4-bit quantization with AUC $>$ 0.999 within 300 output tokens. For applications requiring sample-efficient forward-pass verification, we additionally introduce Activation-DiFR, a scheme that uses random orthogonal projections to compress activations into compact fingerprints for subsequent verification. Activation-DiFR detects 4-bit quantization with AUC $>$ 0.999 using just 2 output tokens, while reducing communication overhead by 25-75% relative to existing methods. We release an open-source integration with vLLM to accelerate practical deployment of verifiable inference.","submissionDate": "25 Nov 2025","lastUpdatedDate": "25 Nov 2025","categories": ["cs.LG","cs.AI"],"detailUrl": "https://arxiv.org/abs/2511.20621","pdfUrl": "https://arxiv.org/pdf/2511.20621","comments": null,"doi": "https://doi.org/10.48550/arXiv.2511.20621","license": "http://creativecommons.org/licenses/by/4.0/","subjectClassifications": ["cs.LG","cs.AI"],"scrapedTimestamp": "2025-11-26T19:50:37.577Z"},{"arxivId": "2511.20613","title": "Can Vibe Coding Beat Graduate CS Students? An LLM vs. Human Coding Tournament on Market-driven Strategic Planning","authors": ["Panayiotis Danassis","Naman Goel"],"abstract": "The rapid proliferation of Large Language Models (LLMs) has revolutionized AI-assisted code generation. This rapid development of LLMs has outpaced our ability to properly benchmark them. Prevailing benchmarks emphasize unit-test pass rates and syntactic correctness. Such metrics understate the difficulty of many real-world problems that require planning, optimization, and strategic interaction. We introduce a multi-agent reasoning-driven benchmark based on a real-world logistics optimization problem (Auction, Pickup, and Delivery Problem) that couples competitive auctions with capacity-constrained routing. The benchmark requires building agents that can (i) bid strategically under uncertainty and (ii) optimize planners that deliver tasks while maximizing profit. We evaluate 40 LLM-coded agents (by a wide range of state-of-the-art LLMs under multiple prompting methodologies, including vibe coding) against 17 human-coded agents developed before the advent of LLMs. Our results over 12 double all-play-all tournaments and $\\sim 40$k matches demonstrate (i) a clear superiority of human(graduate students)-coded agents: the top 5 spots are consistently won by human-coded agents, (ii) the majority of LLM-coded agents (33 out of 40) are beaten by very simple baselines, and (iii) given the best human solution as an input and prompted to improve upon, the best performing LLM makes the solution significantly worse instead of improving it. Our results highlight a gap in LLMs' ability to produce code that works competitively in the real-world, and motivate new evaluations that emphasize reasoning-driven code synthesis in real-world scenarios.","submissionDate": "25 Nov 2025","lastUpdatedDate": "25 Nov 2025","categories": ["cs.LG","cs.AI","cs.MA"],"detailUrl": "https://arxiv.org/abs/2511.20613","pdfUrl": "https://arxiv.org/pdf/2511.20613","comments": null,"doi": "https://doi.org/10.48550/arXiv.2511.20613","license": "http://creativecommons.org/licenses/by/4.0/","subjectClassifications": ["cs.LG","cs.AI","cs.MA"],"scrapedTimestamp": "2025-11-26T19:50:37.577Z"},{"arxivId": "2511.20612","title": "Sparse-to-Field Reconstruction via Stochastic Neural Dynamic Mode Decomposition","authors": ["Yujin Kim","Sarah Dean"],"abstract": "Many consequential real-world systems, like wind fields and ocean currents, are dynamic and hard to model. Learning their governing dynamics remains a central challenge in scientific machine learning. Dynamic Mode Decomposition (DMD) provides a simple, data-driven approximation, but practical use is limited by sparse/noisy observations from continuous fields, reliance on linear approximations, and the lack of principled uncertainty quantification. To address these issues, we introduce Stochastic NODE-DMD, a probabilistic extension of DMD that models continuous-time, nonlinear dynamics while remaining interpretable. Our approach enables continuous spatiotemporal reconstruction at arbitrary coordinates and quantifies predictive uncertainty. Across four benchmarks, a synthetic setting and three physics-based flows, it surpasses a baseline in reconstruction accuracy when trained from only 10% observation density. It further recovers the dynamical structure by aligning learned modes and continuous-time eigenvalues with ground truth. Finally, on datasets with multiple realizations, our method learns a calibrated distribution over latent dynamics that preserves ensemble variability rather than averaging across regimes. Our code is available at: https://github.com/sedan-group/Stochastic-NODE-DMD","submissionDate": "25 Nov 2025","lastUpdatedDate": "25 Nov 2025","categories": ["cs.LG","eess.SY","cs.SY"],"detailUrl": "https://arxiv.org/abs/2511.20612","pdfUrl": "https://arxiv.org/pdf/2511.20612","comments": null,"doi": "https://doi.org/10.48550/arXiv.2511.20612","license": "http://creativecommons.org/licenses/by/4.0/","subjectClassifications": ["cs.LG","eess.SY","cs.SY"],"scrapedTimestamp": "2025-11-26T19:50:37.578Z"}]
What You Get:
- Complete Research Data: Every field needed for comprehensive paper records
- Full Text Access: Direct PDF links for immediate paper access
- Rich Metadata: Authors, abstracts, dates, categories, and classifications
- Citation Information: DOI, journal references, and related papers when available
- Version Tracking: Submission and update dates to track paper revisions
Download Options: CSV, Excel, or JSON formats for easy analysis in spreadsheet software, reference managers, or database systems
Why Choose the arXiv Scraper?
- ⚡ Comprehensive Data Collection: Get complete research paper information in one automated process, saving hours of manual browsing
- 🎯 Advanced Filtering: Search by query, archive category, subcategory, and sorting options - find exactly what you need
- 📚 Research-Grade Data: Perfect for researchers, academics, and institutions building research databases
- 🔄 Automated Workflows: Schedule regular runs to track the latest papers in your field automatically
- 💾 Export Flexibility: Download data in multiple formats (CSV, Excel, JSON) for use in any analysis tool or reference manager
- 📄 PDF Access: Direct links to download papers in PDF format for immediate access
Time Savings: What would take days of manual browsing and data entry can be completed in minutes with automated collection
Efficiency: Collect hundreds of research papers automatically while you focus on reading and analysis
How to Use
- Sign Up: Create a free account w/ $5 credit (takes 2 minutes)
- Find the Scraper: Visit the arXiv Scraper page on Apify
- Set Input: Add your search query or paste a start URL (we'll show you exactly what to enter)
- Run It: Click "Start" and let it collect your research data
- Download Data: Get your results in the "Dataset" tab as CSV, Excel, or JSON
Total Time: Less than 5 minutes from sign-up to downloaded data
No Technical Skills Required: Everything is point-and-click - just enter your search terms and go
Business Use Cases
Academic Researchers:
- Build comprehensive literature databases for research projects
- Track the latest developments in your field automatically
- Collect data for systematic reviews and meta-analyses
- Monitor specific research topics or authors over time
Research Institutions & Libraries:
- Automate collection of new papers in specific research areas
- Build institutional research databases with complete metadata
- Create subject-specific paper collections and repositories
- Maintain up-to-date research collections
Data Scientists & Analysts:
- Analyze publication trends and patterns across research fields
- Track research activity in emerging technologies
- Build datasets for research recommendation systems
- Conduct bibliometric analysis and citation studies
Graduate Students & PhD Candidates:
- Collect papers for literature reviews efficiently
- Track papers in your research area automatically
- Build personal research databases with complete metadata
- Discover related papers and research connections
Using arXiv Scraper with the Apify API
For advanced users who want to automate this process, you can control the scraper programmatically with the Apify API. This allows you to schedule regular data collection and integrate with your existing research tools.
- Node.js: Install the apify-client NPM package
- Python: Use the apify-client PyPI package
- See the Apify API reference for full details
Frequently Asked Questions
Q: How does it work?
A: arXiv Scraper is easy to use and requires no technical knowledge. Simply configure your search parameters (query, classification, date range, etc.) and let the tool collect the data automatically from arXiv.org.
Q: How accurate is the data?
A: The data is extracted directly from arXiv.org, ensuring accuracy and completeness. All fields match what you would see on the website, including titles, authors, abstracts, and metadata.
Q: Can I filter by research field or category?
A: Absolutely! Use the "Search For" field to filter by archive or category (All, Physics, Mathematics, Computer Science, etc.), and the "Subcategory" field to search within specific fields like Title, Author, Abstract, etc.
Q: Can I schedule regular runs?
A: Yes! You can set up schedules in the Apify Console to automatically collect new papers at regular intervals, keeping your research database up-to-date.
Q: What if I need help?
A: Our support team is here to help you get the most out of this tool. Contact us through the Apify platform for assistance.
Q: Is my data secure?
A: Yes, all data processing happens securely on Apify's platform. Your search queries and collected data are private and secure.
Integrate arXiv Scraper with any app and automate your workflow
Last but not least, arXiv Scraper can be connected with almost any cloud service or web app thanks to integrations on the Apify platform.
These includes:
Alternatively, you can use webhooks to carry out an action whenever an event occurs, e.g. get a notification whenever arXiv Scraper successfully finishes a run.
🔗 Recommended Actors
Looking for more data collection tools? Check out these related actors:
| Actor | Description | Link |
|---|---|---|
| Hugging Face Model Scraper | Extracts machine learning model data from Hugging Face | https://apify.com/parseforge/hugging-face-model-scraper |
| PR Newswire Scraper | Collects press releases and news content from PR Newswire | https://apify.com/parseforge/pr-newswire-scraper |
| GreatSchools Scraper | Extracts school information and ratings from GreatSchools | https://apify.com/parseforge/greatschools-scraper |
| GSA eLibrary Scraper | Collects government publication data from GSA eLibrary | https://apify.com/parseforge/gsa-elibrary-scraper |
| Open Library Scraper | Extracts book and bibliographic data from Open Library | https://apify.com/parseforge/open-library-scraper |
Pro Tip: 💡 Browse our complete collection of data collection actors to find the perfect tool for your business needs.
Need Help? Our support team is here to help you get the most out of this tool.
⚠️ Disclaimer: This Actor is an independent tool and is not affiliated with, endorsed by, or sponsored by arXiv.org or Cornell University. All trademarks mentioned are the property of their respective owners.