Academic Paper Scraper avatar

Academic Paper Scraper

Pricing

Pay per usage

Go to Apify Store
Academic Paper Scraper

Academic Paper Scraper

Search arXiv and PubMed in one request. Returns unified paper data: titles, authors, abstracts, DOIs, and PDF links. Filter by keywords, authors, categories, and date range. Built-in rate limiting and cross-source deduplication. Export to JSON, CSV, or Excel.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

CQ

CQ

Maintained by Community

Actor stats

0

Bookmarked

3

Total users

2

Monthly active users

19 days ago

Last modified

Share

Academic Research Paper Scraper

Apify actor that scrapes academic papers from arXiv and PubMed with a unified output format.

Features

  • Dual Source Support: Search both arXiv and PubMed simultaneously
  • Unified Output: Consistent paper format regardless of source
  • Smart Deduplication: Remove duplicates by DOI across sources
  • Flexible Filtering: Filter by title, author, categories, and date range
  • Rate Limit Compliance: Built-in throttling for API guidelines
  • PubMed API Key Support: Optional API key for faster PubMed access

Input Parameters

ParameterTypeRequiredDefaultDescription
searchQuerystringYes-Keywords, phrases, or terms to search
sourcesarrayNo["arxiv", "pubmed"]Which databases to search
titleFilterstringNo-Filter papers by title keywords
authorFilterstringNo-Filter by author name
categoriesarrayNo-arXiv categories (e.g., cs.AI, physics.quant-ph)
dateFromstringNo-Start date (YYYY-MM-DD)
dateTostringNo-End date (YYYY-MM-DD)
maxResultsintegerNo100Max papers per source (1-10000)
sortBystringNorelevanceSort order: relevance, date_desc, date_asc
pubmedApiKeystringNo-NCBI API key for faster rate limits
unpaywallEmailstringNo-Your email for Unpaywall API (free, no signup)
includeAbstractbooleanNotrueInclude full abstract text
deduplicateByDoibooleanNotrueRemove cross-source duplicates

Output Format

Each paper in the dataset includes:

{
"id": "arxiv:2401.12345",
"source": "arxiv",
"doi": "10.1234/example",
"arxivId": "2401.12345",
"title": "Paper Title",
"abstract": "Full abstract text...",
"authors": [
{ "name": "John Doe", "affiliation": "University" }
],
"publishedDate": "2024-01-15",
"updatedDate": "2024-01-20",
"categories": ["cs.AI", "cs.LG"],
"journal": "Nature",
"abstractUrl": "https://arxiv.org/abs/2401.12345",
"pdfUrl": "https://arxiv.org/pdf/2401.12345.pdf"
}

Example Usage

Search for AI papers

{
"searchQuery": "transformer attention mechanism",
"sources": ["arxiv", "pubmed"],
"categories": ["cs.AI", "cs.LG", "cs.CL"],
"maxResults": 50,
"sortBy": "date_desc"
}

Search by author

{
"searchQuery": "deep learning",
"authorFilter": "Hinton",
"dateFrom": "2020-01-01",
"maxResults": 100
}

PubMed only with API key

{
"searchQuery": "CRISPR gene editing",
"sources": ["pubmed"],
"pubmedApiKey": "your-ncbi-api-key",
"maxResults": 500
}

arXiv Categories

Common categories you can filter by:

  • Computer Science: cs.AI, cs.CL, cs.CV, cs.LG, cs.NE, cs.RO
  • Physics: physics.quant-ph, physics.comp-ph
  • Mathematics: math.OC, math.ST
  • Statistics: stat.ML, stat.ME
  • Quantitative Biology: q-bio.BM, q-bio.GN, q-bio.NC

Full taxonomy: https://arxiv.org/category_taxonomy

Understanding why some papers lack PDF links:

SourcePDF AvailabilityNotes
arXiv✅ Always availablearXiv is fully open access
PubMed with PMCID✅ AvailablePaper deposited in PubMed Central
PubMed without PMCID⚠️ Often unavailablePaywalled journal articles

Unpaywall Integration

When you provide unpaywallEmail, the actor queries Unpaywall to find open access versions of papers that lack PDF links. This can recover PDFs from:

  • Institutional repositories
  • Author preprint servers
  • Publisher open access copies

Limitations:

  • New papers (< 2-4 weeks old): Unpaywall may not have indexed them yet
  • Paywalled papers with no OA version: No legal free PDF exists
  • Papers without DOI: Cannot be looked up in Unpaywall

For recent papers without PDFs, the abstractUrl field always provides a link to the paper's landing page.

Rate Limits

The actor respects API rate limits:

  • arXiv: 3-second delay between requests
  • PubMed: 3 requests/second (or 10/second with API key)
  • Unpaywall: 10 requests/second

Get a free PubMed API key at: https://www.ncbi.nlm.nih.gov/account/

Data Sources

License

MIT License