Semantic Scholar Scraper
Pricing
from $6.00 / 1,000 results
Semantic Scholar Scraper
[π° $6 / 1K] Extract academic papers, abstracts, citations, references, authors, and open-access PDF links from Semantic Scholar's 200M+ database. Search by keyword, paper ID/DOI/URL, or author. Filter by year, field, and citations. No API key.
Pricing
from $6.00 / 1,000 results
Rating
0.0
(0)
Developer
SolidCode
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
8 days ago
Last modified
Categories
Share
Pull academic papers, author profiles, and full citation graphs from Semantic Scholar's 200M+ paper corpus β complete with abstracts, DOIs, arXiv IDs, h-index metrics, citation counts, and direct open-access PDF links. Search by keyword, fetch an exact paper by DOI or arXiv ID, or look up an author profile in one run. Built for researchers, systematic-review teams, and data scientists who need a clean, structured scholarly dataset across every discipline without stitching together the public API one page at a time.
Why This Scraper?
- Three ways in, one dataset β keyword search across titles and abstracts, direct fetch by Semantic Scholar paper ID / DOI / arXiv ID / CorpusId / PMID / URL, and author lookup by ID or profile URL. Mix all three in a single run.
- 200M+ papers across 23 fields of study β filter to any combination of Computer Science, Medicine, Biology, Physics, Economics, Mathematics, Law, Linguistics, and 15 more β exact filters, not fuzzy "suggestions".
- 12 publication-type filters β narrow to peer-reviewed
JournalArticle,Review,MetaAnalysis,ClinicalTrial,Conference,Dataset,Book, and more for systematic-review-grade precision. - Citation + reference graph expansion β opt in to pull every paper that cites a work, or every paper it references, as separate rows β capped per paper so even "Attention Is All You Need" stays bounded.
- Author profiles with h-index β name, affiliations, paper count, total citations, h-index, and homepage as first-class records, plus an opt-in full publication list per author.
- Identifier-rich rows β every paper carries its DOI, arXiv ID, native paper ID, author IDs, influential-citation count, and canonical URL, so you can join against PubMed, Crossref, or arXiv downstream.
- Direct open-access PDF links β
openAccessPdfUrland anisOpenAccessflag surface free full text on every eligible paper, with an open-access-only filter to keep just the downloadable ones. - High-impact filtering β minimum-citation-count, year-range, and sort-by-citations-or-date controls let you surface the most-cited or most-recent work in a field instantly.
- No API key, no sign-up β go from a keyword or a DOI to a structured dataset of up to 10,000 papers per query.
Use Cases
Literature Reviews & Systematic Reviews
- Assemble a complete, deduplicated reading list for a new topic in minutes
- Filter to
ReviewandMetaAnalysistypes for evidence-synthesis projects - Restrict to open-access PDFs to build a downloadable full-text corpus
Research Trend Analysis
- Track publication volume in a field across a year range
- Surface the most-cited papers of the last two years with citation sorting
- Detect emerging sub-fields from a burst of recent open-access work
Citation Network Mapping
- Expand a seminal paper's citing-paper graph to find follow-up research
- Pull a paper's reference list to trace its intellectual lineage
- Build directed citation edges between papers for bibliometric graphs
Competitive Research Intelligence
- Monitor what a lab or institution is publishing by author ID
- Benchmark researcher output with h-index, paper count, and total citations
- Quantify a topic's influence with influential-citation counts
Academic Lead Generation
- Find domain experts to quote, interview, or recruit via author profiles
- Pull affiliations and homepages for outreach to corresponding researchers
- Identify rising authors by citation growth in a specific field
Dataset Building for Machine Learning
- Harvest titles + abstracts at scale for NLP and recommendation models
- Build labeled corpora filtered by field of study and publication type
- Collect open-access PDF links for full-text mining pipelines
Getting Started
Basic Keyword Search
The simplest run β one topic, 100 papers:
{"searchQueries": ["large language models"],"maxResults": 100}
Filtered Search (Year + Field + Open Access)
Narrow to recent, high-impact, open-access computer science work and sort by citations:
{"searchQueries": ["retrieval augmented generation"],"yearFrom": 2023,"yearTo": 2025,"fieldsOfStudy": ["Computer Science"],"publicationTypes": ["JournalArticle", "Conference"],"openAccessOnly": true,"minCitationCount": 25,"sortBy": "citationCount","maxResults": 200}
Direct Fetch with Citation + Reference Graph
Fetch exact papers by DOI and arXiv ID, then pull who cites them and what they reference:
{"paperIds": ["10.1038/nature14539","arXiv:1706.03762","https://www.semanticscholar.org/paper/204e3073870fae3d05bcbc2f6a8e263d9b72e776"],"includeCitations": true,"includeReferences": true,"maxCitationsPerPaper": 50}
Author Profile Lookup
Pull author profiles by ID or URL, with their full publication lists:
{"authorIds": ["1741101", "https://www.semanticscholar.org/author/2061296"],"includeAuthorPapers": true,"maxResults": 200}
To find an author ID, open any Semantic Scholar author page and copy the number after /author/ in the URL.
Input Reference
Search & Input
| Parameter | Type | Default | Description |
|---|---|---|---|
searchQueries | string[] | ["large language models"] | Keywords searched across paper titles and abstracts. Each query produces its own result set. |
paperIds | string[] | [] | Fetch exact papers by Semantic Scholar paper ID, DOI, arXiv ID, CorpusId, PMID, or paper URL. One record per paper. |
authorIds | string[] | [] | Author IDs (numeric) or full profile URLs. Returns an author-profile record with name, affiliations, h-index, and citation count. |
maxResults | integer | 100 | Maximum papers per search query β an exact cap on what you are charged. Set to 0 for all available results (capped at 10,000 per query). |
Filters
Filters apply to search queries only, not to directly-fetched papers or authors.
| Parameter | Type | Default | Description |
|---|---|---|---|
yearFrom | integer | null | Only include papers published in this year or later (1900β2100). |
yearTo | integer | null | Only include papers published in this year or earlier (1900β2100). |
fieldsOfStudy | string[] | [] | Restrict to one or more of 23 research fields (Computer Science, Medicine, Biology, Physics, Economics, and more). |
publicationTypes | string[] | [] | Restrict to one or more of 12 types: Review, JournalArticle, CaseReport, ClinicalTrial, Conference, Dataset, Editorial, LettersAndComments, MetaAnalysis, News, Study, Book. |
openAccessOnly | boolean | false | Only return papers with a free, downloadable open-access PDF. |
minCitationCount | integer | null | Only return papers cited at least this many times β ideal for surfacing high-impact work. |
sortBy | string | "relevance" | "Relevance" (default order), "Most cited first" (by citation count), or "Most recent first" (by publication date). |
Output Options
The citation, reference, and author-paper expansions each add one row per child item, which multiplies your result count and cost β leave them off unless you need the full graph.
| Parameter | Type | Default | Description |
|---|---|---|---|
includeAbstracts | boolean | true | Include the abstract text for each paper. Disable to shrink the dataset. |
includeReferences | boolean | false | For each paper, also output the papers it cites (its reference list) as separate records. |
includeCitations | boolean | false | For each paper, also output the papers that cite it as separate records. |
maxCitationsPerPaper | integer | 50 | Caps how many citing/referenced papers are fetched per source paper (1β1000) when expansion is on. |
includeAuthorPapers | boolean | false | When you provide author IDs, also output each author's publications as separate paper records. |
Output
Every row carries a recordType field β paper or author β so you can filter cleanly downstream. The dataset ships with two ready-made views: Papers and Author profiles.
Paper (recordType: "paper")
{"recordType": "paper","paperId": "204e3073870fae3d05bcbc2f6a8e263d9b72e776","title": "Attention Is All You Need","abstract": "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks...","authors": [{"authorId": "40348417", "name": "Ashish Vaswani"},{"authorId": "1846258", "name": "Noam Shazeer"}],"year": 2017,"publicationDate": "2017-06-12","venue": "Neural Information Processing Systems","publicationTypes": ["JournalArticle", "Conference"],"fieldsOfStudy": ["Computer Science"],"citationCount": 102543,"referenceCount": 41,"influentialCitationCount": 12876,"doi": "10.48550/arXiv.1706.03762","arxivId": "1706.03762","isOpenAccess": true,"openAccessPdfUrl": "https://arxiv.org/pdf/1706.03762.pdf","url": "https://www.semanticscholar.org/paper/204e3073870fae3d05bcbc2f6a8e263d9b72e776","sourceQuery": "transformer architecture","parentPaperId": null,"parentAuthorId": null,"relation": null,"scrapedAt": "2026-06-02T10:30:00.000Z"}
Core Fields
| Field | Type | Description |
|---|---|---|
recordType | string | Always "paper" |
title | string | Paper title |
abstract | string | Abstract text (null when abstracts are off or unavailable) |
authors | object[] | Authors, each {authorId, name} |
year | number | Publication year |
publicationDate | string | ISO publication date when available |
venue | string | Journal or conference name |
publicationTypes | string[] | E.g. ["JournalArticle"] |
fieldsOfStudy | string[] | E.g. ["Computer Science"] |
Identifiers
| Field | Type | Description |
|---|---|---|
paperId | string | Native 40-character Semantic Scholar paper ID |
doi | string | DOI when available |
arxivId | string | arXiv ID when available |
url | string | Canonical Semantic Scholar paper page |
Metrics
| Field | Type | Description |
|---|---|---|
citationCount | number | Times this paper has been cited |
referenceCount | number | Number of references in this paper |
influentialCitationCount | number | Semantic Scholar's "influential" citation count |
Open Access & Lineage
| Field | Type | Description |
|---|---|---|
isOpenAccess | boolean | Whether a free open-access PDF exists |
openAccessPdfUrl | string | Direct PDF link when open access |
sourceQuery | string | The search query that produced this row (null for direct fetches) |
parentPaperId | string | Source paper ID on citation/reference child rows (null on primary rows) |
parentAuthorId | string | Source author ID on author-publication child rows (null on primary rows) |
relation | string | "citation", "reference", or "authorPaper" on child rows (null on primary rows) |
scrapedAt | string | ISO 8601 timestamp |
Author Profile (recordType: "author")
{"recordType": "author","authorId": "1741101","name": "Geoffrey E. Hinton","affiliations": ["University of Toronto", "Google"],"homepage": "https://www.cs.toronto.edu/~hinton/","paperCount": 412,"citationCount": 631204,"hIndex": 178,"url": "https://www.semanticscholar.org/author/1741101","scrapedAt": "2026-06-02T10:30:00.000Z"}
| Field | Type | Description |
|---|---|---|
recordType | string | Always "author" |
authorId | string | Numeric Semantic Scholar author ID |
name | string | Author display name |
affiliations | string[] | Listed affiliations |
homepage | string | Homepage URL when available |
paperCount | number | Number of papers attributed to the author |
citationCount | number | Total citations across all papers |
hIndex | number | Author h-index |
url | string | Canonical Semantic Scholar author page |
scrapedAt | string | ISO 8601 timestamp |
When includeAuthorPapers is on, each author's publications are also emitted as paper rows alongside the profile, so an author's full body of work lands in the Papers view ready to filter and sort.
Tips for Best Results
- Fetch by DOI or arXiv ID for guaranteed exact matches. Keyword search is fuzzy; a DOI like
10.1038/nature14539orarXiv:1706.03762resolves to exactly one paper, every time β perfect for verifying a known reference. - Narrow broad topics with filters. A bare query like
"machine learning"returns a flood. Add ayearFrom, afieldsOfStudyvalue, and aminCitationCountto surface a tight, high-signal set. - Use
minCitationCountfor impact triage. Set it to 50 or 100 to skip preprints and low-impact work when you only want established, well-cited literature. - Filter to
ReviewandMetaAnalysisfor evidence synthesis. These publication types are the backbone of systematic reviews and save hours of manual screening. - Turn off abstracts on large harvests. Setting
includeAbstracts: falseshrinks every row and speeds up runs when you only need metadata and metrics. - Keep
maxResultsmodest when expanding the citation graph.includeCitationsandincludeReferencesmultiply rows per paper β pair them with a smallmaxResults(5β20) and a sensiblemaxCitationsPerPaperto keep runs predictable. - Sort by
"Most recent first"for monitoring. Re-run a saved query on a schedule with date sorting to catch new publications in your field as they land.
Pricing
From $6 per 1,000 results β the lowest-cost way to pull discipline-spanning academic data with citation graphs and author metrics bundled in. Bronze, Silver, and Gold subscribers pay progressively less; the table below shows total cost at each discount tier.
| Results | No discount | Bronze | Silver | Gold |
|---|---|---|---|---|
| 100 | $0.72 | $0.68 | $0.64 | $0.60 |
| 1,000 | $7.20 | $6.80 | $6.40 | $6.00 |
| 10,000 | $72.00 | $68.00 | $64.00 | $60.00 |
| 100,000 | $720.00 | $680.00 | $640.00 | $600.00 |
No compute or time-based charges β you pay per result, plus a small fixed per-run start fee. A "result" is any row in the output dataset: a paper, an author profile, or a citing/referenced/author-paper row from the opt-in graph expansions (so enabling those expansions increases your result count). Platform fees depend on your Apify plan.
Integrations
Export data in JSON, CSV, Excel, XML, or RSS. Connect to 1,500+ apps via:
- Zapier / Make / n8n β Workflow automation
- Google Sheets β Direct spreadsheet export
- Slack / Email β Notifications on new results
- Webhooks β Trigger custom APIs on run completion
- Apify API β Full programmatic access
Legal & Ethical Use
This actor is designed for legitimate academic research, bibliometrics, literature review, and market intelligence. Users are responsible for complying with applicable laws and Semantic Scholar's terms of service, including making reasonable-rate requests and respecting content usage rules for any papers or PDFs linked from the dataset. Do not use extracted data for spam, harassment, or any illegal purpose.

