Semantic Scholar Scraper
Pricing
from $3.00 / 1,000 results
Semantic Scholar Scraper
Scrape Semantic Scholar with 200M+ academic papers and authors with full citation graph. Search, fetch by paper/author ID, get citations / references / recommendations, with abstracts, TLDRs, fields-of-study, open-access PDFs, h-index, affiliations, and more
Pricing
from $3.00 / 1,000 results
Rating
5.0
(16)
Developer
Crawler Bros
Maintained by CommunityActor stats
16
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Scrape Semantic Scholar — Allen Institute for AI's open catalog of 200M+ academic papers and authors with a full citation graph — directly via the official Semantic Scholar Graph API.
What you get
For every paper:
paperId,corpusId,externalIds(DOI, arXiv, MAG, PMID, ACL, DBLP)title,abstract,tldr(AI-generated summary)year,publicationDate,venue,publicationVenue,journalauthors— list of{authorId, name}, plusprimaryAuthorfieldsOfStudy,s2FieldsOfStudy(with source attribution)publicationTypes(Review,JournalArticle,Conference, …)referenceCount,citationCount,influentialCitationCountisOpenAccess,openAccessPdf({url, status, license})semanticScholarUrl
For every author:
authorId,name,aliases,affiliations,homepagepaperCount,citationCount,hIndexexternalIds(ORCID, DBLP)semanticScholarUrl
For citation/reference relations:
- The full paper record of the citing/cited paper
citationContexts(text snippets of where it was cited)citationIntents(background,methodology,result)isInfluentialCitation
Modes
| Mode | What it does |
|---|---|
searchPaper | Relevance-ranked paper search via /paper/search. Best for "find me the top N papers about X". |
searchPaperBulk | Bulk paper search via /paper/search/bulk — 1000 results per page, full-corpus pagination. Best for "give me everything about X". |
byPaper | Look up papers by ID. Accepts the 40-char Semantic Scholar SHA, plus prefixed external IDs: DOI:, ARXIV:, MAG:, PMID:, PMCID:, ACL:, DBLP:. Bare DOIs / arXiv IDs are auto-prefixed. |
byPaperCitations | All papers that cite the given paper (with citation contexts and intents). |
byPaperReferences | All papers cited by the given paper. |
searchAuthor | Search authors by name. |
byAuthor | Look up authors by Semantic Scholar author ID. |
byAuthorPapers | All papers authored by the given Semantic Scholar author ID. |
recommendations | Get related/similar papers via /recommendations/v1/papers/forpaper/{id}. |
byUrl | Auto-route from Semantic Scholar / DOI / arXiv URLs. |
Filters
Search modes accept:
year— single year (2023), open range (2018-,-2010), or closed range (2015-2020)fieldsOfStudy— multi-select: Computer Science, Medicine, Chemistry, Biology, …publicationTypes— multi-select: Review, JournalArticle, Conference, …venues— free-text list (e.g.,Nature,NeurIPS)openAccessOnly— drop papers without an open-access PDFminCitationCount— minimum citation countsort(bulk search only) —relevance,citationCount:desc/asc,publicationDate:desc/asc
API key (optional)
The Semantic Scholar Graph API is public and free. An API key is not required, but raises rate limits 10x. Free signup: https://www.semanticscholar.org/product/api#api-key-form.
Without a key the actor enforces a polite ~1.5s delay between requests so a single run stays under the 100-requests-per-5-minutes budget.
Example inputs
Search the literature on attention mechanisms
{"mode": "searchPaper","searchQuery": "transformer attention","fieldsOfStudy": ["Computer Science"],"year": "2017-","minCitationCount": 50,"maxItems": 100}
Fetch the "Attention Is All You Need" paper
{"mode": "byPaper","paperIds": ["ARXIV:1706.03762"],"includeReferencesOnPaper": true,"maxItems": 1}
All citations of a foundational paper
{"mode": "byPaperCitations","paperIds": ["DOI:10.1145/3065386"],"maxItems": 500}
All papers by Geoffrey Hinton
{"mode": "byAuthorPapers","authorIds": ["1741101"],"maxItems": 200}
Recommendations for a paper
{"mode": "recommendations","paperIds": ["ARXIV:1706.03762"],"maxItems": 50}
FAQ
How do I find a paper's Semantic Scholar ID?
Use the URL on semanticscholar.org — the 40-char hex at the end is the ID. Or use a DOI / arXiv ID with the DOI: / ARXIV: prefix. The byUrl mode accepts any of these URL forms directly.
Why does my run say "0 records emitted"?
Either the search query had no matches, or the filter combination was too narrow (e.g., minCitationCount: 100000 will drop almost everything). Loosen filters or check the status message.
Are abstracts always available?
No. Older papers and some publishers don't share abstracts via the API. The actor omits the abstract field when missing rather than returning null.
What happens on rate-limit?
The actor honours the Retry-After header on 429 responses and retries with exponential backoff. With a key you almost never hit the limit; without a key, large jobs slow to a crawl after 100 requests in any 5-min window.
Can I get reference / citation counts without fetching all the papers?
Yes — searchPaper and byPaper already return citationCount, referenceCount, and influentialCitationCount in the default field set.
Are the open-access PDF URLs hotlink-blocked? No. They point at the original publisher / arXiv / preprint server and resolve from a clean shell.
Limitations
- The
recommendationsendpoint returns up to 100 recommendations per source paper. - The Graph API limits each call to 1000 records max; bulk search can paginate beyond 1000.
tldr(AI summary) is only generated for a subset of papers.
Source
Data is fetched from the official Semantic Scholar API: https://api.semanticscholar.org/graph/v1. The Allen Institute for AI publishes the API for academic and non-commercial use. See the API terms.