arXiv Paper Scraper โ Abstracts, Authors & Metadata
Pricing
from $3.50 / 1,000 results
arXiv Paper Scraper โ Abstracts, Authors & Metadata
Scrape research paper metadata from arXiv.org the worlds largest open-access repository. Search by keyword across computer science physics mathematics biology. Returns titles abstracts authors categories PDF links and DOIs. No API key required.
Pricing
from $3.50 / 1,000 results
Rating
0.0
(0)
Developer
Logiover
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
๐ arXiv Paper Scraper โ Research Metadata, Abstracts & Author Data
Scrape research paper metadata from arXiv.org, the world's largest open-access research repository with over 2.5 million scholarly articles across physics, computer science, mathematics, biology, economics, and more. This actor queries the public arXiv API (no API key required) and returns structured paper data including titles, abstracts, authors, categories, publication dates, PDF links and DOIs.
๐ Why arXiv Data Is Valuable
arXiv is the primary preprint server for cutting-edge research. Every major AI breakthrough โ from transformers to diffusion models โ appeared on arXiv first. Researchers, companies, universities, VCs, and journalists track arXiv to stay ahead of scientific developments.
What you can extract per paper:
- arXiv ID and direct links to abstract page and PDF
- Title and abstract (full text)
- Author list with all co-authors
- Categories (e.g., cs.AI, cs.CL, stat.ML) and primary category
- Publication date (original submission) and last updated date
- DOI (Digital Object Identifier) and journal reference if published
- Author comments (implementation notes, accepted venues, code links)
๐ Output Fields
| Field | Description |
|---|---|
arxivId | Unique arXiv paper identifier (e.g., 2401.12345) |
title | Paper title |
authors | Comma-separated author names |
abstract | Full paper abstract text |
categories | All arXiv category codes |
primaryCategory | Primary category |
publishedDate | Original submission date |
updatedDate | Last update date |
pdfUrl | Direct PDF download link |
arxivUrl | Abstract page URL |
comment | Author comments |
journalRef | Journal reference |
doi | Digital Object Identifier |
searchQuery | The query that found this paper |
โ๏ธ Input Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
searchQueries | array | ["machine learning"] | Search terms โ paper titles, author names, keywords |
categories | array | [] | arXiv category filters (cs.AI, stat.ML, etc.) |
maxResults | integer | 200 | Max papers to return per query (up to 1000) |
sortBy | enum | relevance | relevance / lastUpdatedDate / submittedDate |
dateFrom | string | โ | Filter papers after date (YYYY-MM-DD) |
๐ฏ Use Cases
AI & Tech Industry Intelligence
Venture capital firms and corporate strategy teams scrape arXiv daily to identify emerging technologies, track competitor research output, and discover promising startups before they raise funding. The people publishing breakthrough papers today are founding the unicorns of tomorrow.
Academic Literature Reviews
PhD students and researchers search across thousands of papers, filter by date and category, and export structured metadata for systematic literature reviews. No more manual copy-pasting from the arXiv website.
Recruitment & Talent Sourcing
Recruiters and engineering leaders search arXiv for authors publishing in specific domains. Every paper author is a potential candidate โ arXiv gives you their name, research area, and publication track record.
Dataset Building for NLP/ML
Machine learning teams build training datasets from arXiv abstracts and titles. The clean XML API response makes it ideal for text classification, topic modeling, and citation graph construction.
Competitive Research Monitoring
Companies monitor arXiv categories relevant to their industry and get alerts when competitors or key researchers publish new work. Stay ahead of your competitors' R&D pipeline.
๐ฐ Pricing
Pay per event โ charged per API request to arXiv. Each query returns up to 100 papers in a single API call. A run with 5 search queries and 200 results each costs approximately $0.05โ0.15 in compute units. The arXiv API is free and has no rate limiting beyond a polite delay between requests.
๐ Tips
- Use specific categories (e.g.,
cs.CLfor NLP,cs.CVfor computer vision) for more targeted results - Combine queries: run 5โ10 related queries to build a comprehensive dataset
- Filter by date: use
dateFromto only get papers from the last month or year - Sort by lastUpdatedDate to find recently revised papers with new results
- arXiv rate limit: the API asks for polite delays (this actor includes built-in delays)
โ FAQ
Q: Is this the official arXiv API? A: Yes โ this actor uses arXiv's public OAI-PMH compatible API at export.arxiv.org. No API key, no authentication, no rate-limiting beyond polite use.
Q: Can I download the actual PDFs? A: The actor provides PDF URLs. You can download PDFs separately. Full-text extraction from PDFs is not included.
Q: How many papers can I get in one run? A: Up to 1000 papers per query, with no limit on the number of queries.
Keywords: arxiv scraper, research paper api, academic paper data, arxiv metadata extractor, scientific paper scraper, arxiv abstract api, machine learning papers dataset, arxiv search tool, research literature mining, preprint server data, arxiv paper downloader, scholarly article scraper, cs papers data, arxiv api without key, academic research tool
How do I build a dataset of recent AI papers from arXiv?
Set searchQueries to your AI topics, add cs.AI or cs.LG categories, sort by submittedDate and use dateFrom to keep only recent preprints in your export.
Can I search arXiv by author name?
Yes. Put the author name in searchQueries and the scraper returns every matching paper with the full co-author list, abstract and PDF link.
๐ Changelog
2026-07-01
- Maintenance pass: re-verified end-to-end on live data and confirmed successful runs within the 5-minute quality window on the default input.
- Sharpened Store metadata (SEO title & description) and expanded the FAQ with high-intent, long-tail questions for easier discovery in Google and Apify Store search.
- Added ready-to-run example tasks that cover common real-world use cases.