Academic Paper Scraper avatar

Academic Paper Scraper

Pricing

from $5.00 / 1,000 results

Go to Apify Store
Academic Paper Scraper

Academic Paper Scraper

Search MILLIONS of academic papers from Semantic Scholar and arXiv by keyword, DOI, or citation graph. Returns titles, authors, abstracts, citation counts, and open access PDFs as clean JSON. Works as an MCP tool for AI agents.

Pricing

from $5.00 / 1,000 results

Rating

0.0

(0)

Developer

Mick

Mick

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

2 days ago

Last modified

Share

Search and retrieve academic papers from Semantic Scholar (226M+ papers) and arXiv. Get titles, authors, abstracts, AI-generated summaries, citation counts, DOIs, and open access PDFs as clean JSON. API-based -- no browser needed, no bot detection issues, fast and reliable. MCP-ready for AI agent integration.

What does it do?

Academic Paper Scraper queries the Semantic Scholar Graph API and arXiv API to find, retrieve, and analyze academic papers. It unifies results into a consistent schema regardless of source.

Three modes:

  • Search -- keyword search across 226M+ papers (Semantic Scholar) or arXiv preprints
  • Get Paper -- look up a specific paper by DOI, arXiv ID, PubMed ID, or Semantic Scholar ID
  • Citations -- traverse the citation graph: find papers that cite a given paper, or papers it references

Use cases

  • Literature review -- quickly find all papers on a topic with citation counts and AI summaries
  • Citation analysis -- map influence networks by traversing citing/cited papers
  • Research monitoring -- track new publications in your field
  • Meta-analysis -- gather paper metadata at scale for systematic reviews
  • AI agent tooling -- give AI agents access to the academic literature via MCP
  • Competitive research intelligence -- track what competitors are publishing

What data does it extract?

Each record represents one academic paper:

FieldDescription
titlePaper title
authorsList of author names
yearPublication year
publication_dateFull date (YYYY-MM-DD)
venueConference or journal name
journalJournal with volume/pages
abstractFull abstract text
tldrAI-generated one-sentence summary (Semantic Scholar, ~40% of papers)
doiDigital Object Identifier
arxiv_idarXiv preprint ID
pubmed_idPubMed ID
semantic_scholar_idSemantic Scholar paper ID
fields_of_studyAcademic fields (e.g. "Computer Science", "Medicine")
publication_typesType (JournalArticle, Conference, Preprint, etc.)
citation_countNumber of citations
reference_countNumber of references
influential_citation_countCitations that significantly impacted the citing paper
is_open_accessWhether free full-text is available
open_access_pdf_urlDirect PDF link (when available)
external_urlsLinks to Semantic Scholar, arXiv, DOI resolver
sourceWhich API provided this result

Input

FieldTypeDefaultDescription
modestringsearchsearch, get_paper, or citations
querystringrequiredSearch keywords or paper ID (DOI, arXiv ID, PMID, S2 ID)
sourcestringautoauto, semantic_scholar, or arxiv
citationDirectionstringcitingFor citations mode: citing or cited_by
yearFrominteger-Filter by publication year (from)
yearTointeger-Filter by publication year (to)
fieldsOfStudyarray[]Filter by field (e.g. "Computer Science")
openAccessOnlybooleanfalseOnly return papers with free PDFs
arxivCategoriesarray[]arXiv category filter (e.g. "cs.AI", "physics.hep-th")
maxResultsinteger100Max papers to return (1-500)
includeAbstractbooleantrueInclude full abstracts
includeTldrbooleantrueInclude AI summaries
includeCitationCountsbooleantrueInclude citation metrics
sortBystringrelevancerelevance or date
requestIntervalSecsnumber1.5Min seconds between API requests

Example: Search for papers

{
"mode": "search",
"query": "transformer attention mechanism",
"maxResults": 20,
"yearFrom": 2020,
"openAccessOnly": true
}

Example: Look up a paper by DOI

{
"mode": "get_paper",
"query": "10.48550/arXiv.1706.03762"
}

Example: Get citations for "Attention Is All You Need"

{
"mode": "citations",
"query": "1706.03762",
"citationDirection": "citing",
"maxResults": 50,
"yearFrom": 2023
}

Example: Search arXiv by category

{
"mode": "search",
"query": "large language models",
"source": "arxiv",
"arxivCategories": ["cs.AI", "cs.CL"],
"sortBy": "date",
"maxResults": 30
}

Output

Example: Search results (Semantic Scholar)

[
{
"schema_version": "1.0",
"type": "academic_paper",
"title": "Attention is All you Need",
"authors": ["Ashish Vaswani", "Noam Shazeer", "Niki Parmar", "Jakob Uszkoreit", "Llion Jones", "Aidan N. Gomez", "Lukasz Kaiser", "Illia Polosukhin"],
"year": 2017,
"publication_date": "2017-06-12",
"venue": "Neural Information Processing Systems",
"journal": "",
"abstract": "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks...",
"tldr": "A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely, is proposed.",
"doi": "10.48550/arXiv.1706.03762",
"arxiv_id": "1706.03762",
"pubmed_id": "",
"semantic_scholar_id": "204e3073870fae3d05bcbc2f6a8e263d9b72e776",
"fields_of_study": ["Computer Science"],
"publication_types": ["JournalArticle", "Conference"],
"citation_count": 140000,
"reference_count": 41,
"influential_citation_count": 12000,
"is_open_access": true,
"open_access_pdf_url": "https://arxiv.org/pdf/1706.03762",
"external_urls": {
"semantic_scholar": "https://www.semanticscholar.org/paper/204e3073870fae3d05bcbc2f6a8e263d9b72e776",
"doi_url": "https://doi.org/10.48550/arXiv.1706.03762",
"arxiv_abs": "https://arxiv.org/abs/1706.03762",
"arxiv_pdf": "https://arxiv.org/pdf/1706.03762"
},
"source": "semantic_scholar",
"scraped_at": "2025-01-15T12:00:00+00:00"
}
]

Output is trimmed for readability. Each record includes all fields from the schema.

Data sources

Semantic Scholar (default)

  • 226M+ papers across all academic fields
  • Best for general searches and citation analysis
  • Provides AI-generated TLDR summaries, citation counts, influential citation counts
  • Accepts DOI, arXiv ID, PubMed ID, and Semantic Scholar ID for lookups
  • Rate limit: 1 req/sec (no API key needed)

arXiv

  • 2.4M+ preprints in physics, math, CS, quantitative biology, finance, statistics, and engineering
  • Best for finding the latest preprints and category-specific searches
  • All papers are open access with direct PDF links
  • Rate limit: 3 sec between requests recommended

Limitations

  • Semantic Scholar rate limits may slow down large requests. The scraper respects the 1 req/sec limit but S2 may throttle during peak times.
  • arXiv has no citation data. Citation counts, reference counts, and TLDR summaries are only available via Semantic Scholar.
  • TLDR coverage is ~40%. The AI summary is not available for all papers in Semantic Scholar.
  • Year filtering on arXiv is done client-side (arXiv API does not support year ranges natively), so the scraper may need to fetch extra pages to fill the result set.
  • PubMed direct search is not implemented in v1. However, Semantic Scholar indexes PubMed papers, so PubMed content is searchable via the Semantic Scholar source. Direct PubMed ID lookup via get_paper mode works.

Cost

This actor uses pay-per-event (PPE) pricing. You pay only for the results you get.

  • $0.50 per 1,000 results ($0.0005 per result)
  • No proxy costs -- API-based, no browser needed
  • Free tier: 25 results per run (no subscription required)

Typical run: searching for 100 papers takes about 10 seconds. Cost: $0.05.


MCP Integration

This actor works as an MCP tool through Apify's hosted MCP server. No custom server needed.

  • Endpoint: https://mcp.apify.com?tools=labrat011/academic-paper-scraper
  • Auth: Authorization: Bearer <APIFY_TOKEN>
  • Transport: Streamable HTTP
  • Works with: Claude Desktop, Cursor, VS Code, Windsurf, Warp, Gemini CLI

Example MCP config (Claude Desktop / Cursor):

{
"mcpServers": {
"academic-paper-scraper": {
"url": "https://mcp.apify.com?tools=labrat011/academic-paper-scraper",
"headers": {
"Authorization": "Bearer <APIFY_TOKEN>"
}
}
}
}

Agent prompt examples:

  • "Find the 10 most cited papers on CRISPR gene editing published since 2020"
  • "Look up the paper with DOI 10.1038/s41586-021-03819-2 and summarize its findings"
  • "What papers cite 'Attention Is All You Need'? Show me the top 20 by citation count"
  • "Search arXiv for recent papers on quantum error correction in the cs.AI and quant-ph categories"

The agent calls this tool, gets structured JSON with titles, authors, abstracts, citation counts, and AI summaries, and can synthesize literature reviews or identify research trends programmatically.


Technical details

  • Python 3.12, async architecture with httpx.AsyncClient
  • Semantic Scholar Graph API v1 (no API key required)
  • arXiv Atom API with XML parsing (xml.etree.ElementTree)
  • Automatic paper ID detection: DOI, arXiv ID, PubMed ID, Corpus ID, S2 ID
  • Paginated fetching with rate limiting
  • Batch push (25 items) for memory efficiency
  • State persistence for resumable runs

FAQ

Which source should I use?

Use auto (the default). It picks Semantic Scholar for general searches (largest corpus, richest metadata) and arXiv when you specify arXiv categories. Override to arxiv if you specifically want preprints or need arXiv-specific category filtering.

Can I search by author?

Yes. Include the author name in your search query, e.g. "Yann LeCun deep learning". Semantic Scholar's relevance ranking considers author names.

How do I find a specific paper?

Use get_paper mode with any identifier: DOI (10.1038/...), arXiv ID (2301.12345), PubMed ID (PMID:12345678), or Semantic Scholar ID (40-char hex). The scraper auto-detects the ID type.

What's the difference between citing and cited_by in citations mode?

  • citing returns papers that cite your target paper (who referenced it later)
  • cited_by returns papers that your target paper references (its bibliography)

Can I use this with the Apify API?

Yes. Call the actor via the Apify API and retrieve results programmatically in JSON, CSV, or other formats. Works with the Apify Python and JavaScript clients.


Feedback

Found a bug or have a feature request? Open an issue on the actor's Issues tab in Apify Console.