Academic Paper Scraper
Pricing
Pay per usage
Academic Paper Scraper
Search arXiv and PubMed in one request. Returns unified paper data: titles, authors, abstracts, DOIs, and PDF links. Filter by keywords, authors, categories, and date range. Built-in rate limiting and cross-source deduplication. Export to JSON, CSV, or Excel.
Pricing
Pay per usage
Rating
0.0
(0)
Developer

CQ
Actor stats
0
Bookmarked
3
Total users
2
Monthly active users
19 days ago
Last modified
Categories
Share
Academic Research Paper Scraper
Apify actor that scrapes academic papers from arXiv and PubMed with a unified output format.
Features
- Dual Source Support: Search both arXiv and PubMed simultaneously
- Unified Output: Consistent paper format regardless of source
- Smart Deduplication: Remove duplicates by DOI across sources
- Flexible Filtering: Filter by title, author, categories, and date range
- Rate Limit Compliance: Built-in throttling for API guidelines
- PubMed API Key Support: Optional API key for faster PubMed access
Input Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
searchQuery | string | Yes | - | Keywords, phrases, or terms to search |
sources | array | No | ["arxiv", "pubmed"] | Which databases to search |
titleFilter | string | No | - | Filter papers by title keywords |
authorFilter | string | No | - | Filter by author name |
categories | array | No | - | arXiv categories (e.g., cs.AI, physics.quant-ph) |
dateFrom | string | No | - | Start date (YYYY-MM-DD) |
dateTo | string | No | - | End date (YYYY-MM-DD) |
maxResults | integer | No | 100 | Max papers per source (1-10000) |
sortBy | string | No | relevance | Sort order: relevance, date_desc, date_asc |
pubmedApiKey | string | No | - | NCBI API key for faster rate limits |
unpaywallEmail | string | No | - | Your email for Unpaywall API (free, no signup) |
includeAbstract | boolean | No | true | Include full abstract text |
deduplicateByDoi | boolean | No | true | Remove cross-source duplicates |
Output Format
Each paper in the dataset includes:
{"id": "arxiv:2401.12345","source": "arxiv","doi": "10.1234/example","arxivId": "2401.12345","title": "Paper Title","abstract": "Full abstract text...","authors": [{ "name": "John Doe", "affiliation": "University" }],"publishedDate": "2024-01-15","updatedDate": "2024-01-20","categories": ["cs.AI", "cs.LG"],"journal": "Nature","abstractUrl": "https://arxiv.org/abs/2401.12345","pdfUrl": "https://arxiv.org/pdf/2401.12345.pdf"}
Example Usage
Search for AI papers
{"searchQuery": "transformer attention mechanism","sources": ["arxiv", "pubmed"],"categories": ["cs.AI", "cs.LG", "cs.CL"],"maxResults": 50,"sortBy": "date_desc"}
Search by author
{"searchQuery": "deep learning","authorFilter": "Hinton","dateFrom": "2020-01-01","maxResults": 100}
PubMed only with API key
{"searchQuery": "CRISPR gene editing","sources": ["pubmed"],"pubmedApiKey": "your-ncbi-api-key","maxResults": 500}
arXiv Categories
Common categories you can filter by:
- Computer Science:
cs.AI,cs.CL,cs.CV,cs.LG,cs.NE,cs.RO - Physics:
physics.quant-ph,physics.comp-ph - Mathematics:
math.OC,math.ST - Statistics:
stat.ML,stat.ME - Quantitative Biology:
q-bio.BM,q-bio.GN,q-bio.NC
Full taxonomy: https://arxiv.org/category_taxonomy
PDF Link Availability
Understanding why some papers lack PDF links:
| Source | PDF Availability | Notes |
|---|---|---|
| arXiv | ✅ Always available | arXiv is fully open access |
| PubMed with PMCID | ✅ Available | Paper deposited in PubMed Central |
| PubMed without PMCID | ⚠️ Often unavailable | Paywalled journal articles |
Unpaywall Integration
When you provide unpaywallEmail, the actor queries Unpaywall to find open access versions of papers that lack PDF links. This can recover PDFs from:
- Institutional repositories
- Author preprint servers
- Publisher open access copies
Limitations:
- New papers (< 2-4 weeks old): Unpaywall may not have indexed them yet
- Paywalled papers with no OA version: No legal free PDF exists
- Papers without DOI: Cannot be looked up in Unpaywall
For recent papers without PDFs, the abstractUrl field always provides a link to the paper's landing page.
Rate Limits
The actor respects API rate limits:
- arXiv: 3-second delay between requests
- PubMed: 3 requests/second (or 10/second with API key)
- Unpaywall: 10 requests/second
Get a free PubMed API key at: https://www.ncbi.nlm.nih.gov/account/
Data Sources
- arXiv API - Open access preprint server
- PubMed E-utilities - Biomedical literature database
License
MIT License