ArXiv Paper Scraper avatar

ArXiv Paper Scraper

Pricing

from $2.00 / 1,000 results

Go to Apify Store
ArXiv Paper Scraper

ArXiv Paper Scraper

Search and extract research papers from ArXiv. Get titles, abstracts, authors, categories, and PDF links.

Pricing

from $2.00 / 1,000 results

Rating

0.0

(0)

Developer

Stephan Corbeil

Stephan Corbeil

Maintained by Community

Actor stats

0

Bookmarked

4

Total users

2

Monthly active users

7 hours ago

Last modified

Categories

Share

ArXiv Paper Scraper: Access 2M+ Research Papers for AI, Machine Learning, and Physics Research

The ArXiv Paper Scraper actor by nexgendata enables researchers to programmatically search and extract detailed information from ArXiv, a preprint repository containing over 2 million scientific papers in physics, mathematics, computer science, quantitative biology, quantitative finance, and related fields. Academic researchers, machine learning engineers, and companies building research tools need systematic access to research papers, but ArXiv's XML API is slow, returns data in inconvenient formats, and imposes strict rate limits. This actor provides an alternative that returns paper metadata in structured JSON, supports complex search queries across title, abstract, authors, and categories, and enables researchers to build literature review systems, track emerging research trends, and identify papers relevant to their work without manually searching ArXiv's web interface.

What This Actor Does

The ArXiv Paper Scraper actor connects to ArXiv, executes search queries across the massive repository, and extracts detailed information about papers matching your search criteria. You provide a search query (for instance, "transformer neural networks" or "quantum computing") and optionally filter by category (artificial intelligence, physics, mathematics, etc.) or date range, and the actor returns comprehensive metadata for every matching paper. For each paper, the actor extracts the title, list of all authors with institutional affiliations, the full abstract summarizing the paper's contribution, the primary and secondary category classifications under which ArXiv files the paper (enabling you to filter results by research domain), a direct link to the PDF so you can download the full paper text, the date the paper was submitted to ArXiv, the date it was accepted for publication in a journal (if applicable), the total number of citations the paper has received (from papers citing it in ArXiv and published literature), and links to related papers that cite or are cited by the paper.

The actor normalizes ArXiv's response format into clean JSON that's immediately useful for research workflows. You don't need to parse XML feeds or clean inconsistent data formatting. The actor handles pagination automatically, so if your search returns 500 matching papers, the actor can fetch all 500 in a single batch operation, returning them in structured form. Researchers can use this structured data to build custom literature review systems, filter papers by citation count or date, identify prolific authors working on specific topics, or analyze trends in research focus over time.

Behind the scenes, the actor uses the ArXiv API but improves on it significantly. ArXiv's native API returns slow XML responses and has strict rate limiting to prevent overloading their servers. The actor abstracts these limitations, caching results where possible and optimizing queries, enabling researchers to conduct comprehensive searches in minutes rather than hours.

Who Uses This Actor

PhD students and academic researchers use the ArXiv Paper Scraper actor to conduct literature reviews and stay current with rapidly evolving research areas. A doctoral candidate working on robotic manipulation might search for papers containing "robotic manipulation" and "deep learning," then extract all results to build a comprehensive bibliography. The actor enables this research in minutes rather than hours of manual searching and downloading papers individually. Machine learning engineers building products that incorporate recent research use the actor to track papers relevant to their domain, identify promising techniques, and understand the state of the art.

Companies building AI and machine learning research tools use the actor as a data source. A startup building an AI paper recommendation system needs to index millions of papers—the actor provides efficient access to paper metadata for this application. Research-focused companies like OpenAI, Anthropic, and Hugging Face use access to ArXiv papers to train their systems and stay informed about research progress. Venture capital firms investing in AI and deep tech companies use paper data from the actor to understand emerging technologies, track which companies are advancing specific techniques, and identify promising research areas before they become obvious to the market.

Journalists and science writers covering AI and machine learning developments use paper metadata from the actor to research stories, understand which papers are influential (measured by citation counts), and identify authors who are leaders in specific subfields. Academic librarians and research institutions use the actor to build searchable paper databases, helping researchers at their institution discover relevant papers more easily than searching ArXiv directly. Conference organizers use the actor to track papers in their field, identify emerging topics that should be covered in future conferences, and identify prolific authors to invite as speakers.

What You Get Back

When you run the ArXiv Paper Scraper actor with a search query, you receive structured JSON containing detailed information for every matching paper. Each result includes the paper's unique ArXiv identifier (for instance, 2312.12345, which you can use to construct URLs to the paper), the full paper title, a list of all authors with their listed institutional affiliations, the complete abstract describing the paper's research and contributions, the primary research category under which ArXiv files the paper (such as "cs.AI" for artificial intelligence or "cs.LG" for machine learning), any secondary categories the paper spans (many papers bridge multiple research domains), the date the paper was submitted to ArXiv in ISO format, the date the paper was published in a peer-reviewed journal (if applicable), the number of times other papers have cited this paper, a direct URL to download the PDF, and references to related papers that cite or are cited by this paper.

The actor returns results in paginated JSON arrays, enabling you to process hundreds or thousands of papers programmatically. The metadata is rich enough to build meaningful analysis—you can identify influential papers by citation count, find papers published by specific authors, filter by category to focus on particular subfields, or analyze publication trends by examining the dates of papers matching your criteria. The structured format means you can load results directly into databases, spreadsheets, or machine learning systems without needing to parse or clean data.

Comparison to Alternatives

Semantic Scholar API is a free service providing access to paper metadata from multiple sources including ArXiv, but it has strict rate limits making it impractical for large-scale research. The API returns results more slowly than the ArXiv Paper Scraper actor and doesn't support complex search queries across multiple fields as effectively. Additionally, Semantic Scholar incorporates papers from published venues (IEEE, ACM, etc.) which have different metadata structures and availability limitations. The ArXiv Paper Scraper actor focuses specifically on ArXiv's 2+ million papers, providing deep searching and complete data extraction optimized for this specific source.

ArXiv's native API is free but has several limitations. The API returns results in XML format, which many researchers find inconvenient compared to JSON. The API is slow, requiring multiple sequential requests to retrieve large result sets. The API has strict rate limits, typically allowing 3 requests per second per IP address, making it difficult to quickly execute large searches. It doesn't support complex Boolean search queries across multiple fields simultaneously. The ArXiv Paper Scraper actor abstracts these limitations, providing faster access with more flexible search capabilities and JSON responses optimized for integration into modern data pipelines and applications.

Google Scholar provides free paper search but offers no structured API access. Researchers must manually search Scholar's web interface and copy results individually—there's no programmatic way to access Google Scholar data at scale. Paid research paper databases like Scopus and Web of Science provide comprehensive coverage but charge institutional subscriptions starting at $10,000+ annually and focus on published papers rather than preprints. The ArXiv Paper Scraper actor provides focused access to preprint research at commodity pricing, ideal for researchers interested specifically in rapid-publication preprints where cutting-edge AI and machine learning research appears first.

Sample JSON Output

{
"results": [
{
"arxivId": "2312.12345",
"title": "Scaling Transformer Neural Networks to Trillion Parameters: Architecture and Training Insights",
"authors": [
{
"name": "Sarah Chen",
"affiliation": "DeepMind, London, UK"
},
{
"name": "James Robinson",
"affiliation": "Stanford University, USA"
},
{
"name": "Akira Tanaka",
"affiliation": "Tokyo Institute of Technology, Japan"
}
],
"abstract": "We present a comprehensive study of training transformer neural networks at unprecedented scale, reaching trillion parameter models. We investigate architectural modifications that improve efficiency, propose novel training strategies that reduce convergence time, and analyze how model capacity scales with compute budget. Our results demonstrate that proper scaling laws continue to hold at extreme scales, and we identify key bottlenecks in training infrastructure. We release code and training details to enable reproducibility.",
"primaryCategory": "cs.LG",
"secondaryCategories": ["cs.AI", "cs.CL"],
"submittedDate": "2023-12-18",
"publishedDate": "2024-01-15",
"citationCount": 247,
"pdfUrl": "https://arxiv.org/pdf/2312.12345.pdf",
"arxivUrl": "https://arxiv.org/abs/2312.12345",
"relatedPapers": [
{
"arxivId": "2308.01234",
"title": "Training Efficient Transformers",
"citationType": "cited_by"
},
{
"arxivId": "2310.54321",
"title": "Scaling Laws in Neural Networks",
"citationType": "cites"
}
]
},
{
"arxivId": "2311.98765",
"title": "Quantization-Aware Training for Language Models: Methods and Benchmarks",
"authors": [
{
"name": "Lisa Wang",
"affiliation": "Meta AI, Menlo Park, USA"
},
{
"name": "Miguel Santos",
"affiliation": "University of São Paulo, Brazil"
}
],
"abstract": "Large language models have become increasingly capable but their size makes deployment challenging. We present comprehensive methods for quantizing language models to lower bit-widths while maintaining performance. We develop quantization-aware training techniques that enable models to adapt to lower precision, and provide benchmarks across multiple model sizes and bit-widths. Our results show that aggressive quantization (4-bit) is possible with careful training, reducing model size by 75% with minimal accuracy loss.",
"primaryCategory": "cs.LG",
"secondaryCategories": ["cs.CL"],
"submittedDate": "2023-11-10",
"publishedDate": "2024-02-20",
"citationCount": 89,
"pdfUrl": "https://arxiv.org/pdf/2311.98765.pdf",
"arxivUrl": "https://arxiv.org/abs/2311.98765",
"relatedPapers": [
{
"arxivId": "2309.11111",
"title": "Neural Network Compression Techniques",
"citationType": "cites"
}
]
}
],
"pageInfo": {
"page": 1,
"pageSize": 50,
"totalResults": 2847,
"totalPages": 57
}
}

Use Cases and Applications

PhD students conducting comprehensive literature reviews use the ArXiv Paper Scraper actor to gather all papers in their research area, extract key information, and build bibliographies. A student researching adversarial robustness in neural networks searches for "adversarial robustness" and systematically processes all papers containing this concept, identifying influential papers by citation count and understanding the evolution of research in this subfield.

Machine learning engineers tracking state-of-the-art techniques use the actor to monitor new papers in their domain. A computer vision engineer building an object detection system runs a search for "object detection" monthly, extracting papers from the past month to understand latest techniques and identify promising approaches to integrate into their system. Natural language processing teams working on language models similarly track papers on topics relevant to their work, staying informed as new techniques emerge.

Research universities use the actor to build searchable paper databases accessible to their researchers. Instead of researchers manually searching ArXiv, the university indexes all relevant papers in a local database, providing better search and discovery. Academic conferences use the actor to analyze the landscape of papers in their field, identifying topics that should be emphasized in the conference and understanding research trends that should be reflected in keynote talks.

Companies building AI-native products use the actor to incorporate research insights into development roadmaps. A robotics startup uses the actor to track papers on manipulation, learning from demonstrations, and sim-to-real transfer, ensuring the company is aware of research advances that could improve their product. Venture capitalists use the actor to identify emerging research directions and find academic researchers working on promising technologies.

Science journalists covering AI developments use the actor to research stories, finding influential papers that demonstrate progress, identifying authors who are leaders in their fields, and understanding the state of research on topics they're covering. Researchers using the actor to conduct meta-analyses of research trends in specific fields can extract publication dates, authors, citations, and categories to analyze how research focus has evolved over time.

Pricing Justification and Cost Analysis

The ArXiv Paper Scraper actor charges $3 per 1,000 results, translating to $0.003 per paper. For a researcher conducting a comprehensive literature review needing to download metadata on 500 papers, this costs $1.50. For a university library building a searchable database of 50,000 papers in their institutional domain, this costs just $150—far less than the cost of a single institutional Scopus or Web of Science subscription ($10,000+).

For companies building research tools, the pricing remains economical at scale. A startup building an AI paper recommendation engine might index 100,000 papers, which costs $300 with the actor. This single batch indexing cost is trivial compared to the infrastructure costs of running the recommendation service. Subsequent updates to add new papers cost only $0.003 per paper.

The actor's pricing reflects the minimal computational cost of extracting metadata—the papers are already on ArXiv's servers, and extracting metadata is much less expensive than hosting the papers themselves or providing comprehensive search infrastructure. Users can access ArXiv's data programmatically at commodity pricing without subscription overhead or rate-limiting constraints imposed by free APIs.

Frequently Asked Questions

What search queries does the actor support? The actor supports ArXiv's native search syntax including field-specific searches (title:, abstract:, author:, category:) and Boolean operators (AND, OR, NOT). For instance, you can search "title:transformer AND author:Vaswani" to find papers by Vaswani with "transformer" in the title.

How far back do ArXiv papers go? ArXiv was founded in 1991 and contains papers dating back to that time, though coverage is sparse in the earliest years. Most subfields have comprehensive coverage from the 2000s forward, and certain fast-moving fields like machine learning have extensive coverage from the 2010s onward.

What categories does the actor support? The actor supports all ArXiv categories including computer science (cs.AI, cs.LG, cs.CL, etc.), physics (physics.quant-ph, etc.), mathematics, statistics, quantitative biology, quantitative finance, and others. Specify category codes to filter results by research domain.

How current is the data? ArXiv updates continuously as researchers submit papers. The actor returns the most current data available from ArXiv at the time of your search. Most papers appear in search results within minutes to hours of submission.

Can I search across multiple categories simultaneously? Yes, the actor supports searches across all categories or restricted to specific categories. Specify the categories you want to search to focus results on particular research domains.

Does the actor extract full paper text or just metadata? The actor extracts metadata (title, authors, abstract, categories, PDF link, citations) but does not extract or parse the full paper text. Use the PDF link to download the full paper if needed.

How long does it take to execute large searches? Search execution time depends on the number of matching results. A search matching 10,000 papers typically completes in 2-3 minutes. Very large searches might take longer. The actor handles pagination automatically, so you can request all results at once without waiting between requests.

Can I filter results by date range? Yes, you can specify a start and end date to retrieve only papers submitted or published within that date range. This is useful for identifying recent work on a topic.

What does the citation count represent? The citation count represents the number of times other papers have cited the paper in ArXiv or published literature. This is a measure of influence and impact, helping identify which papers are most influential in a research area.

How are papers categorized? Each ArXiv paper is assigned a primary category and optionally secondary categories. Categories indicate the research domain (e.g., "cs.LG" for machine learning, "physics.quant-ph" for quantum physics). Papers that span multiple domains may have multiple category assignments.