Semantic Scholar Scraper avatar

Semantic Scholar Scraper

Pricing

from $3.00 / 1,000 results

Go to Apify Store
Semantic Scholar Scraper

Semantic Scholar Scraper

Scrape Semantic Scholar with 200M+ academic papers and authors with full citation graph. Search, fetch by paper/author ID, get citations / references / recommendations, with abstracts, TLDRs, fields-of-study, open-access PDFs, h-index, affiliations, and more

Pricing

from $3.00 / 1,000 results

Rating

5.0

(16)

Developer

Crawler Bros

Crawler Bros

Maintained by Community

Actor stats

16

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Scrape Semantic Scholar — Allen Institute for AI's open catalog of 200M+ academic papers and authors with a full citation graph — directly via the official Semantic Scholar Graph API.

What you get

For every paper:

  • paperId, corpusId, externalIds (DOI, arXiv, MAG, PMID, ACL, DBLP)
  • title, abstract, tldr (AI-generated summary)
  • year, publicationDate, venue, publicationVenue, journal
  • authors — list of {authorId, name}, plus primaryAuthor
  • fieldsOfStudy, s2FieldsOfStudy (with source attribution)
  • publicationTypes (Review, JournalArticle, Conference, …)
  • referenceCount, citationCount, influentialCitationCount
  • isOpenAccess, openAccessPdf ({url, status, license})
  • semanticScholarUrl

For every author:

  • authorId, name, aliases, affiliations, homepage
  • paperCount, citationCount, hIndex
  • externalIds (ORCID, DBLP)
  • semanticScholarUrl

For citation/reference relations:

  • The full paper record of the citing/cited paper
  • citationContexts (text snippets of where it was cited)
  • citationIntents (background, methodology, result)
  • isInfluentialCitation

Modes

ModeWhat it does
searchPaperRelevance-ranked paper search via /paper/search. Best for "find me the top N papers about X".
searchPaperBulkBulk paper search via /paper/search/bulk — 1000 results per page, full-corpus pagination. Best for "give me everything about X".
byPaperLook up papers by ID. Accepts the 40-char Semantic Scholar SHA, plus prefixed external IDs: DOI:, ARXIV:, MAG:, PMID:, PMCID:, ACL:, DBLP:. Bare DOIs / arXiv IDs are auto-prefixed.
byPaperCitationsAll papers that cite the given paper (with citation contexts and intents).
byPaperReferencesAll papers cited by the given paper.
searchAuthorSearch authors by name.
byAuthorLook up authors by Semantic Scholar author ID.
byAuthorPapersAll papers authored by the given Semantic Scholar author ID.
recommendationsGet related/similar papers via /recommendations/v1/papers/forpaper/{id}.
byUrlAuto-route from Semantic Scholar / DOI / arXiv URLs.

Filters

Search modes accept:

  • year — single year (2023), open range (2018-, -2010), or closed range (2015-2020)
  • fieldsOfStudy — multi-select: Computer Science, Medicine, Chemistry, Biology, …
  • publicationTypes — multi-select: Review, JournalArticle, Conference, …
  • venues — free-text list (e.g., Nature, NeurIPS)
  • openAccessOnly — drop papers without an open-access PDF
  • minCitationCount — minimum citation count
  • sort (bulk search only) — relevance, citationCount:desc/asc, publicationDate:desc/asc

API key (optional)

The Semantic Scholar Graph API is public and free. An API key is not required, but raises rate limits 10x. Free signup: https://www.semanticscholar.org/product/api#api-key-form.

Without a key the actor enforces a polite ~1.5s delay between requests so a single run stays under the 100-requests-per-5-minutes budget.

Example inputs

Search the literature on attention mechanisms

{
"mode": "searchPaper",
"searchQuery": "transformer attention",
"fieldsOfStudy": ["Computer Science"],
"year": "2017-",
"minCitationCount": 50,
"maxItems": 100
}

Fetch the "Attention Is All You Need" paper

{
"mode": "byPaper",
"paperIds": ["ARXIV:1706.03762"],
"includeReferencesOnPaper": true,
"maxItems": 1
}

All citations of a foundational paper

{
"mode": "byPaperCitations",
"paperIds": ["DOI:10.1145/3065386"],
"maxItems": 500
}

All papers by Geoffrey Hinton

{
"mode": "byAuthorPapers",
"authorIds": ["1741101"],
"maxItems": 200
}

Recommendations for a paper

{
"mode": "recommendations",
"paperIds": ["ARXIV:1706.03762"],
"maxItems": 50
}

FAQ

How do I find a paper's Semantic Scholar ID? Use the URL on semanticscholar.org — the 40-char hex at the end is the ID. Or use a DOI / arXiv ID with the DOI: / ARXIV: prefix. The byUrl mode accepts any of these URL forms directly.

Why does my run say "0 records emitted"? Either the search query had no matches, or the filter combination was too narrow (e.g., minCitationCount: 100000 will drop almost everything). Loosen filters or check the status message.

Are abstracts always available? No. Older papers and some publishers don't share abstracts via the API. The actor omits the abstract field when missing rather than returning null.

What happens on rate-limit? The actor honours the Retry-After header on 429 responses and retries with exponential backoff. With a key you almost never hit the limit; without a key, large jobs slow to a crawl after 100 requests in any 5-min window.

Can I get reference / citation counts without fetching all the papers? Yes — searchPaper and byPaper already return citationCount, referenceCount, and influentialCitationCount in the default field set.

Are the open-access PDF URLs hotlink-blocked? No. They point at the original publisher / arXiv / preprint server and resolve from a clean shell.

Limitations

  • The recommendations endpoint returns up to 100 recommendations per source paper.
  • The Graph API limits each call to 1000 records max; bulk search can paginate beyond 1000.
  • tldr (AI summary) is only generated for a subset of papers.

Source

Data is fetched from the official Semantic Scholar API: https://api.semanticscholar.org/graph/v1. The Allen Institute for AI publishes the API for academic and non-commercial use. See the API terms.