Academic Paper Scraper
Pricing
Pay per usage
Academic Paper Scraper
Search arXiv and PubMed in one request. Returns unified paper data: titles, authors, abstracts, DOIs, and PDF links. Filter by keywords, authors, categories, and date range. Built-in rate limiting and cross-source deduplication. Export to JSON, CSV, or Excel.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Quadruped
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
5 days ago
Last modified
Categories
Share
Academic Research Paper Scraper
Apify actor that scrapes academic papers from arXiv and PubMed with a unified output format. Search millions of research papers by keywords, authors, categories, and date ranges.
Features
- Dual Source Support: Search both arXiv and PubMed simultaneously
- Unified Output: Consistent paper format regardless of source
- Smart Deduplication: Remove duplicates by DOI across sources
- Flexible Filtering: Filter by title, author, categories, and date range
- Rate Limit Compliance: Built-in throttling respects API guidelines
- PubMed API Key Support: Optional API key for faster PubMed access
- TypeScript Implementation: Type-safe with Zod schema validation
Quick Start
{"searchQuery": "transformer attention mechanism","sources": ["arxiv", "pubmed"],"maxResults": 50}
Input Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
searchQuery | string | Yes | - | Keywords, phrases, or terms to search |
sources | array | No | ["arxiv", "pubmed"] | Which databases to search |
titleFilter | string | No | - | Filter papers by title keywords |
authorFilter | string | No | - | Filter by author name |
categories | array | No | - | arXiv categories (e.g., cs.AI, physics.quant-ph) |
dateFrom | string | No | - | Start date (YYYY-MM-DD) |
dateTo | string | No | - | End date (YYYY-MM-DD) |
maxResults | integer | No | 100 | Max papers per source (1-10000) |
sortBy | string | No | relevance | Sort: relevance, date_desc, date_asc |
pubmedApiKey | string | No | - | NCBI API key for faster rate limits |
includeAbstract | boolean | No | true | Include full abstract text |
deduplicateByDoi | boolean | No | true | Remove cross-source duplicates |
Output Format
Each paper in the dataset includes:
{"id": "arxiv:2401.12345","source": "arxiv","doi": "10.1234/example","arxivId": "2401.12345","title": "Paper Title","abstract": "Full abstract text...","authors": [{ "name": "John Doe", "affiliation": "University" }],"publishedDate": "2024-01-15","updatedDate": "2024-01-20","categories": ["cs.AI", "cs.LG"],"journal": "Nature","abstractUrl": "https://arxiv.org/abs/2401.12345","pdfUrl": "https://arxiv.org/pdf/2401.12345.pdf"}
Example Usage
Search for AI papers
{"searchQuery": "transformer attention mechanism","sources": ["arxiv", "pubmed"],"categories": ["cs.AI", "cs.LG", "cs.CL"],"maxResults": 50,"sortBy": "date_desc"}
Search by author
{"searchQuery": "deep learning","authorFilter": "Hinton","dateFrom": "2020-01-01","maxResults": 100}
PubMed only with API key
{"searchQuery": "CRISPR gene editing","sources": ["pubmed"],"pubmedApiKey": "your-ncbi-api-key","maxResults": 500}
arXiv Categories
Common categories you can filter by:
| Domain | Categories |
|---|---|
| Computer Science | cs.AI, cs.CL, cs.CV, cs.LG, cs.NE, cs.RO |
| Physics | physics.quant-ph, physics.comp-ph |
| Mathematics | math.OC, math.ST |
| Statistics | stat.ML, stat.ME |
| Quantitative Biology | q-bio.BM, q-bio.GN, q-bio.NC |
Full taxonomy: https://arxiv.org/category_taxonomy
Rate Limits
The actor respects API rate limits:
| Source | Rate Limit | Notes |
|---|---|---|
| arXiv | 3-second delay | Between requests |
| PubMed | 3 req/sec | Without API key |
| PubMed | 10 req/sec | With API key |
Get a free PubMed API key at: https://www.ncbi.nlm.nih.gov/account/
Use Cases
- Literature Reviews: Gather papers for systematic reviews
- Research Monitoring: Track new publications in your field
- Citation Analysis: Build datasets for bibliometric studies
- AI Training Data: Collect abstracts for NLP model training
- Trend Analysis: Analyze publication trends over time
Technical Details
- Runtime: Node.js 20+
- Language: TypeScript with ES modules
- Validation: Zod schema validation for inputs
- Architecture: Separate clients for arXiv and PubMed with unified normalizer
Data Sources
- arXiv API - Open access preprint server
- PubMed E-utilities - Biomedical literature database
License
MIT License