ArXiv Paper Scraper — Search and Extract Research Papers (JSON)
Pricing
Pay per usage
ArXiv Paper Scraper — Search and Extract Research Papers (JSON)
Search and scrape arXiv research papers. Get titles, abstracts, authors, categories, and PDF links. Monitor new papers by topic daily.
arXiv Paper Scraper — Extract Research Papers, Abstracts & Metadata
Extract research papers from arXiv.org at scale. Search by keyword or browse by category to get structured data including titles, authors, abstracts, categories, DOI references, and PDF links. Powered by the official arXiv API with built-in rate limiting and pagination.
Features
- Keyword Search — find papers by any search term (e.g., "large language models", "quantum computing")
- Category Browsing — extract papers from specific arXiv categories (cs.AI, cs.LG, math.CO, physics.optics)
- Full Metadata — title, authors list, abstract, DOI, journal reference, and publication dates
- PDF & Abstract Links — direct URLs to PDF downloads and abstract pages
- Automatic Pagination — fetches up to 500 papers per query with configurable limits
- Sorting Options — sort by submission date, last update, or relevance
- Rate-Limited — respects arXiv's 3-second request delay policy
- Default Fallback — returns latest AI papers when no input is provided
Output Example
{"arxivId": "2403.12345v1","title": "Scaling Laws for Neural Language Models","authors": ["John Smith", "Jane Doe"],"abstract": "We investigate the scaling behavior of Transformer language models...","categories": ["cs.CL", "cs.AI", "cs.LG"],"primaryCategory": "cs.CL","published": "2024-03-15T18:00:00Z","updated": "2024-03-16T12:00:00Z","doi": "10.1234/example","journalRef": "Nature 2024","pdfUrl": "http://arxiv.org/pdf/2403.12345v1","abstractUrl": "http://arxiv.org/abs/2403.12345v1","source": "search:large language models","scrapedAt": "2026-03-18T10:00:00.000Z"}
Use Cases
- AI/ML Research Monitoring — track the latest papers in your field automatically
- Literature Review — bulk-extract papers for systematic academic reviews
- Training Data Collection — gather paper metadata for AI research tools
- Trend Analysis — identify hot topics by analyzing publication volume across categories
- Citation Tracking — monitor new publications from specific research groups
- Knowledge Base Building — feed structured paper data into search engines or databases
Input Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
searchQueries | array | [] | Keywords to search (e.g., "transformer architecture") |
categories | array | [] | arXiv category codes (e.g., "cs.AI", "cs.LG") |
maxPapersPerSource | integer | 50 | Max papers per query/category (1-500) |
sortBy | string | submittedDate | Sort by: submittedDate, lastUpdatedDate, relevance |
sortOrder | string | descending | ascending or descending |
How It Works
The scraper uses the official arXiv API to search and retrieve paper metadata in XML format. It parses each entry to extract structured fields including authors, categories, and links, then handles pagination automatically to collect up to the configured limit per source. A 3-second delay between requests ensures compliance with arXiv's rate limiting policy.