Academic Paper Scraper avatar
Academic Paper Scraper

Pricing

Pay per usage

Go to Apify Store
Academic Paper Scraper

Academic Paper Scraper

Search arXiv and PubMed in one request. Returns unified paper data: titles, authors, abstracts, DOIs, and PDF links. Filter by keywords, authors, categories, and date range. Built-in rate limiting and cross-source deduplication. Export to JSON, CSV, or Excel.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Quadruped

Quadruped

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

5 days ago

Last modified

Share

Academic Research Paper Scraper

Apify actor that scrapes academic papers from arXiv and PubMed with a unified output format. Search millions of research papers by keywords, authors, categories, and date ranges.

Features

  • Dual Source Support: Search both arXiv and PubMed simultaneously
  • Unified Output: Consistent paper format regardless of source
  • Smart Deduplication: Remove duplicates by DOI across sources
  • Flexible Filtering: Filter by title, author, categories, and date range
  • Rate Limit Compliance: Built-in throttling respects API guidelines
  • PubMed API Key Support: Optional API key for faster PubMed access
  • TypeScript Implementation: Type-safe with Zod schema validation

Quick Start

{
"searchQuery": "transformer attention mechanism",
"sources": ["arxiv", "pubmed"],
"maxResults": 50
}

Input Parameters

ParameterTypeRequiredDefaultDescription
searchQuerystringYes-Keywords, phrases, or terms to search
sourcesarrayNo["arxiv", "pubmed"]Which databases to search
titleFilterstringNo-Filter papers by title keywords
authorFilterstringNo-Filter by author name
categoriesarrayNo-arXiv categories (e.g., cs.AI, physics.quant-ph)
dateFromstringNo-Start date (YYYY-MM-DD)
dateTostringNo-End date (YYYY-MM-DD)
maxResultsintegerNo100Max papers per source (1-10000)
sortBystringNorelevanceSort: relevance, date_desc, date_asc
pubmedApiKeystringNo-NCBI API key for faster rate limits
includeAbstractbooleanNotrueInclude full abstract text
deduplicateByDoibooleanNotrueRemove cross-source duplicates

Output Format

Each paper in the dataset includes:

{
"id": "arxiv:2401.12345",
"source": "arxiv",
"doi": "10.1234/example",
"arxivId": "2401.12345",
"title": "Paper Title",
"abstract": "Full abstract text...",
"authors": [
{ "name": "John Doe", "affiliation": "University" }
],
"publishedDate": "2024-01-15",
"updatedDate": "2024-01-20",
"categories": ["cs.AI", "cs.LG"],
"journal": "Nature",
"abstractUrl": "https://arxiv.org/abs/2401.12345",
"pdfUrl": "https://arxiv.org/pdf/2401.12345.pdf"
}

Example Usage

Search for AI papers

{
"searchQuery": "transformer attention mechanism",
"sources": ["arxiv", "pubmed"],
"categories": ["cs.AI", "cs.LG", "cs.CL"],
"maxResults": 50,
"sortBy": "date_desc"
}

Search by author

{
"searchQuery": "deep learning",
"authorFilter": "Hinton",
"dateFrom": "2020-01-01",
"maxResults": 100
}

PubMed only with API key

{
"searchQuery": "CRISPR gene editing",
"sources": ["pubmed"],
"pubmedApiKey": "your-ncbi-api-key",
"maxResults": 500
}

arXiv Categories

Common categories you can filter by:

DomainCategories
Computer Sciencecs.AI, cs.CL, cs.CV, cs.LG, cs.NE, cs.RO
Physicsphysics.quant-ph, physics.comp-ph
Mathematicsmath.OC, math.ST
Statisticsstat.ML, stat.ME
Quantitative Biologyq-bio.BM, q-bio.GN, q-bio.NC

Full taxonomy: https://arxiv.org/category_taxonomy

Rate Limits

The actor respects API rate limits:

SourceRate LimitNotes
arXiv3-second delayBetween requests
PubMed3 req/secWithout API key
PubMed10 req/secWith API key

Get a free PubMed API key at: https://www.ncbi.nlm.nih.gov/account/

Use Cases

  • Literature Reviews: Gather papers for systematic reviews
  • Research Monitoring: Track new publications in your field
  • Citation Analysis: Build datasets for bibliometric studies
  • AI Training Data: Collect abstracts for NLP model training
  • Trend Analysis: Analyze publication trends over time

Technical Details

  • Runtime: Node.js 20+
  • Language: TypeScript with ES modules
  • Validation: Zod schema validation for inputs
  • Architecture: Separate clients for arXiv and PubMed with unified normalizer

Data Sources

License

MIT License