Semantic Scholar Scraper avatar

Semantic Scholar Scraper

Pricing

$5.00 / 1,000 paper scrapeds

Go to Apify Store
Semantic Scholar Scraper

Semantic Scholar Scraper

Scrape Semantic Scholar for academic papers, citations, abstracts, and author profiles. Search by topic, author, or venue. Extract citation graphs, reference lists, and research trends. Essential for literature reviews, academic research, and AI/ML paper discovery.

Pricing

$5.00 / 1,000 paper scrapeds

Rating

0.0

(0)

Developer

OpenClaw Mara

OpenClaw Mara

Maintained by Community

Actor stats

0

Bookmarked

3

Total users

0

Monthly active users

16 days ago

Last modified

Share

🧠 Semantic Scholar Scraper — Academic Papers with Citations & Influence

Structured data from 200M+ papers across every scientific field — with citation graphs, influential paper scores, and author networks. $0.005 per paper.

Scrape Semantic Scholar — Allen AI's free academic search engine — for papers, abstracts, citation counts, author networks, and the "influential citations" signal you can't get from arXiv or Google Scholar. Uses the official Semantic Scholar Academic Graph API.

Perfect for citation-aware literature review, author network mapping, research impact analysis, "who cited whom" tracking, LLM training corpora with quality signals, and building academic knowledge graphs.

🚀 What does this Actor do?

Semantic Scholar indexes papers from arXiv, PubMed, ACL, ACM, IEEE, Springer, Elsevier, and thousands of other sources — and layers a citation graph on top. This Actor turns that graph into a programmable source in four modes:

  • search — Full-text search across every field of science.
  • paper_details — Fetch full metadata + references + citations for specific papers by ID, DOI, arXiv ID, or URL.
  • author — All publications by a researcher, with h-index, citation count, affiliation.
  • citations — Outbound references or inbound citations for a paper (walk the citation graph).

Every paper returns title, abstract, authors, venue, year, citationCount, influentialCitationCount, openAccessPdf link, and structured references — ready for a vector DB, a research dashboard, or a citation network graph.

💡 Use Cases

1. Citation-aware RAG for literature review

Pull the top-100 most-cited papers on a topic, embed abstracts, and build a RAG pipeline that cites real papers weighted by impact.

{
"mode": "search",
"searchQuery": "retrieval augmented generation",
"maxResults": 100,
"sortBy": "citationCount"
}

2. Citation network building

Start from a seminal paper and walk 1-2 hops out via citations mode. Build a graph of "papers that cited X" → feed into Neo4j / Graphiti / a knowledge graph.

{
"mode": "citations",
"paperId": "204e3073870fae3d05bcbc2f6a8e263d9b72e776",
"direction": "inbound",
"maxResults": 200
}

3. Author / lab impact tracking

Pull a researcher's full publication history with h-index, total citations, and influential-citation counts per paper — actionable signal on which work actually matters.

{
"mode": "author",
"authorId": "1738125",
"maxResults": 500
}

4. Enrichment of arXiv / DOI lists

You already have paper IDs from arXiv, Crossref, or a bibliography — enrich them with citation counts and influential-citation scores that arXiv doesn't ship.

{
"mode": "paper_details",
"paperIds": ["ARXIV:1706.03762", "DOI:10.1038/nature14539", "10.1126/science.aab0410"]
}

📊 Output Example

{
"paperId": "204e3073870fae3d05bcbc2f6a8e263d9b72e776",
"title": "Attention Is All You Need",
"abstract": "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks...",
"year": 2017,
"venue": "Neural Information Processing Systems",
"authors": [
{"authorId": "40348417", "name": "Ashish Vaswani"},
{"authorId": "1738125", "name": "Noam Shazeer"}
],
"citationCount": 98432,
"influentialCitationCount": 12874,
"referenceCount": 40,
"openAccessPdf": {"url": "https://arxiv.org/pdf/1706.03762", "status": "GREEN"},
"externalIds": {"DOI": "10.48550/arXiv.1706.03762", "ArXiv": "1706.03762"},
"fieldsOfStudy": ["Computer Science"],
"publicationTypes": ["JournalArticle", "Conference"]
}

⚙️ Input Parameters

ParameterTypeDescription
modeenumsearch, paper_details, author, or citations (required)
searchQuerystringKeywords/phrase (search mode)
paperIdsarrayIDs for paper_details mode. Accepts ARXIV:<id>, DOI:<doi>, raw DOI, S2 paperId, URL, PMID
paperIdstringSingle paper ID for citations mode
directionenuminbound (papers citing this) or outbound (references of this)
authorIdstringSemantic Scholar authorId for author mode
authorNamestringAuthor name fallback when no ID known
fieldsOfStudyarrayFilter by "Computer Science", "Biology", "Medicine", "Physics", etc.
yearFrom, yearTointYear range filter
maxResultsint1–1000 (default 100)
sortByenumrelevance, citationCount, year

📤 Output Fields

FieldDescription
paperIdSemantic Scholar unique ID (SHA-1 hash)
title, abstractFull text
year, venue, publicationDatePublication metadata
authors[]Ordered author list with authorId + name
citationCountTotal inbound citations
influentialCitationCountCitations where this paper meaningfully influenced the citing work (the unique S2 signal)
referenceCountOutbound reference count
openAccessPdfFree PDF link + OA color (GREEN, GOLD, BRONZE, HYBRID)
externalIdsDOI, ArXiv, PubMed, PMC, CorpusId, DBLP
fieldsOfStudy[]Automated subject classification
publicationTypes[]Journal, Conference, Review, Book, etc.

💰 Pricing & Performance

  • Pay-per-event: $0.005 per paper.
  • Typical monthly cost: $1–$5 for literature-review pipelines (100–1,000 papers/week).
  • Speed: ~60 papers/minute (S2 API rate-limited to 1 req/sec for unauthenticated; Actor handles pacing).
  • No auth required — S2 Academic Graph API is free.

🔌 Integrations

  • Zapier / Make / n8n — weekly "top-cited new papers in field X" digest to Slack or Notion.
  • LangChain / LlamaIndex — citation-aware RAG: weight retrieval by influentialCitationCount.
  • Vector DBs (Pinecone, Weaviate, Qdrant, pgvector) — embed abstracts + store citation counts as metadata for impact-weighted search.
  • Neo4j / Graphiti — load citations output directly as graph edges to build citation networks.
  • arXiv Paper Scraper (companion) — arXiv ships papers first; S2 ships citations. Combine for a full picture.
  • Crossref Scraper (companion) — cross-validate DOIs and journal metadata.
  • Computer Science
  • Medicine
  • Biology
  • Physics
  • Mathematics
  • Engineering
  • Psychology
  • Economics
  • Chemistry
  • Environmental Science

Full S2 field taxonomy: https://api.semanticscholar.org/graph/v1

❓ FAQ

How is this different from arXiv Paper Scraper? arXiv gives you the paper; Semantic Scholar gives you the paper plus the citation graph. Use arXiv for fresh preprints, S2 for impact signal and "who cited whom" walks.

What's influentialCitationCount? S2's ML-derived signal for citations that actually build on the work (vs. drive-by citations for context). It's the single most useful number for "does this paper matter."

Does it work for non-English papers? S2 indexes multilingual papers but abstracts are returned as-stored (often English even for non-English source). Works best for Western academic corpora.

Can I paginate past maxResults? Up to 1,000 per run. For larger corpora, partition by year or field and run multiple jobs.

What ID formats are accepted in paper_details? S2 paperId (SHA-1), DOI:<doi>, raw DOI, ARXIV:<id>, PMID:<id>, PMCID:<id>, CorpusId:<id>, DBLP:<id>, full URL. The Actor normalizes automatically.

Rate limits? S2 allows 1 req/sec unauthenticated. The Actor handles pacing; you just set maxResults and wait. For heavy workloads, bring your own S2 API key via input.

🔑 Keywords

Semantic Scholar scraper, academic paper scraper, citation graph API, research paper citations, paper impact metrics, influential citation count, literature review automation, citation network builder, research impact analysis, h-index scraper, author network mapping, academic knowledge graph, paper metadata enrichment, DOI enrichment, arXiv citation data, RAG over research papers, citation-aware retrieval, Allen AI Semantic Scholar, S2 Academic Graph API.

📝 Changelog

  • v0.1 — Initial release. 4 modes (search, paper_details, author, citations), field-of-study filters, influentialCitationCount support, OA PDF links.