Semantic Scholar Scraper avatar

Semantic Scholar Scraper

Pricing

from $6.00 / 1,000 results

Go to Apify Store
Semantic Scholar Scraper

Semantic Scholar Scraper

[πŸ’° $6 / 1K] Extract academic papers, abstracts, citations, references, authors, and open-access PDF links from Semantic Scholar's 200M+ database. Search by keyword, paper ID/DOI/URL, or author. Filter by year, field, and citations. No API key.

Pricing

from $6.00 / 1,000 results

Rating

0.0

(0)

Developer

SolidCode

SolidCode

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

8 days ago

Last modified

Share

Pull academic papers, author profiles, and full citation graphs from Semantic Scholar's 200M+ paper corpus β€” complete with abstracts, DOIs, arXiv IDs, h-index metrics, citation counts, and direct open-access PDF links. Search by keyword, fetch an exact paper by DOI or arXiv ID, or look up an author profile in one run. Built for researchers, systematic-review teams, and data scientists who need a clean, structured scholarly dataset across every discipline without stitching together the public API one page at a time.

Why This Scraper?

  • Three ways in, one dataset β€” keyword search across titles and abstracts, direct fetch by Semantic Scholar paper ID / DOI / arXiv ID / CorpusId / PMID / URL, and author lookup by ID or profile URL. Mix all three in a single run.
  • 200M+ papers across 23 fields of study β€” filter to any combination of Computer Science, Medicine, Biology, Physics, Economics, Mathematics, Law, Linguistics, and 15 more β€” exact filters, not fuzzy "suggestions".
  • 12 publication-type filters β€” narrow to peer-reviewed JournalArticle, Review, MetaAnalysis, ClinicalTrial, Conference, Dataset, Book, and more for systematic-review-grade precision.
  • Citation + reference graph expansion β€” opt in to pull every paper that cites a work, or every paper it references, as separate rows β€” capped per paper so even "Attention Is All You Need" stays bounded.
  • Author profiles with h-index β€” name, affiliations, paper count, total citations, h-index, and homepage as first-class records, plus an opt-in full publication list per author.
  • Identifier-rich rows β€” every paper carries its DOI, arXiv ID, native paper ID, author IDs, influential-citation count, and canonical URL, so you can join against PubMed, Crossref, or arXiv downstream.
  • Direct open-access PDF links β€” openAccessPdfUrl and an isOpenAccess flag surface free full text on every eligible paper, with an open-access-only filter to keep just the downloadable ones.
  • High-impact filtering β€” minimum-citation-count, year-range, and sort-by-citations-or-date controls let you surface the most-cited or most-recent work in a field instantly.
  • No API key, no sign-up β€” go from a keyword or a DOI to a structured dataset of up to 10,000 papers per query.

Use Cases

Literature Reviews & Systematic Reviews

  • Assemble a complete, deduplicated reading list for a new topic in minutes
  • Filter to Review and MetaAnalysis types for evidence-synthesis projects
  • Restrict to open-access PDFs to build a downloadable full-text corpus

Research Trend Analysis

  • Track publication volume in a field across a year range
  • Surface the most-cited papers of the last two years with citation sorting
  • Detect emerging sub-fields from a burst of recent open-access work

Citation Network Mapping

  • Expand a seminal paper's citing-paper graph to find follow-up research
  • Pull a paper's reference list to trace its intellectual lineage
  • Build directed citation edges between papers for bibliometric graphs

Competitive Research Intelligence

  • Monitor what a lab or institution is publishing by author ID
  • Benchmark researcher output with h-index, paper count, and total citations
  • Quantify a topic's influence with influential-citation counts

Academic Lead Generation

  • Find domain experts to quote, interview, or recruit via author profiles
  • Pull affiliations and homepages for outreach to corresponding researchers
  • Identify rising authors by citation growth in a specific field

Dataset Building for Machine Learning

  • Harvest titles + abstracts at scale for NLP and recommendation models
  • Build labeled corpora filtered by field of study and publication type
  • Collect open-access PDF links for full-text mining pipelines

Getting Started

The simplest run β€” one topic, 100 papers:

{
"searchQueries": ["large language models"],
"maxResults": 100
}

Filtered Search (Year + Field + Open Access)

Narrow to recent, high-impact, open-access computer science work and sort by citations:

{
"searchQueries": ["retrieval augmented generation"],
"yearFrom": 2023,
"yearTo": 2025,
"fieldsOfStudy": ["Computer Science"],
"publicationTypes": ["JournalArticle", "Conference"],
"openAccessOnly": true,
"minCitationCount": 25,
"sortBy": "citationCount",
"maxResults": 200
}

Direct Fetch with Citation + Reference Graph

Fetch exact papers by DOI and arXiv ID, then pull who cites them and what they reference:

{
"paperIds": [
"10.1038/nature14539",
"arXiv:1706.03762",
"https://www.semanticscholar.org/paper/204e3073870fae3d05bcbc2f6a8e263d9b72e776"
],
"includeCitations": true,
"includeReferences": true,
"maxCitationsPerPaper": 50
}

Author Profile Lookup

Pull author profiles by ID or URL, with their full publication lists:

{
"authorIds": ["1741101", "https://www.semanticscholar.org/author/2061296"],
"includeAuthorPapers": true,
"maxResults": 200
}

To find an author ID, open any Semantic Scholar author page and copy the number after /author/ in the URL.

Input Reference

Search & Input

ParameterTypeDefaultDescription
searchQueriesstring[]["large language models"]Keywords searched across paper titles and abstracts. Each query produces its own result set.
paperIdsstring[][]Fetch exact papers by Semantic Scholar paper ID, DOI, arXiv ID, CorpusId, PMID, or paper URL. One record per paper.
authorIdsstring[][]Author IDs (numeric) or full profile URLs. Returns an author-profile record with name, affiliations, h-index, and citation count.
maxResultsinteger100Maximum papers per search query β€” an exact cap on what you are charged. Set to 0 for all available results (capped at 10,000 per query).

Filters

Filters apply to search queries only, not to directly-fetched papers or authors.

ParameterTypeDefaultDescription
yearFromintegernullOnly include papers published in this year or later (1900–2100).
yearTointegernullOnly include papers published in this year or earlier (1900–2100).
fieldsOfStudystring[][]Restrict to one or more of 23 research fields (Computer Science, Medicine, Biology, Physics, Economics, and more).
publicationTypesstring[][]Restrict to one or more of 12 types: Review, JournalArticle, CaseReport, ClinicalTrial, Conference, Dataset, Editorial, LettersAndComments, MetaAnalysis, News, Study, Book.
openAccessOnlybooleanfalseOnly return papers with a free, downloadable open-access PDF.
minCitationCountintegernullOnly return papers cited at least this many times β€” ideal for surfacing high-impact work.
sortBystring"relevance""Relevance" (default order), "Most cited first" (by citation count), or "Most recent first" (by publication date).

Output Options

The citation, reference, and author-paper expansions each add one row per child item, which multiplies your result count and cost β€” leave them off unless you need the full graph.

ParameterTypeDefaultDescription
includeAbstractsbooleantrueInclude the abstract text for each paper. Disable to shrink the dataset.
includeReferencesbooleanfalseFor each paper, also output the papers it cites (its reference list) as separate records.
includeCitationsbooleanfalseFor each paper, also output the papers that cite it as separate records.
maxCitationsPerPaperinteger50Caps how many citing/referenced papers are fetched per source paper (1–1000) when expansion is on.
includeAuthorPapersbooleanfalseWhen you provide author IDs, also output each author's publications as separate paper records.

Output

Every row carries a recordType field β€” paper or author β€” so you can filter cleanly downstream. The dataset ships with two ready-made views: Papers and Author profiles.

Paper (recordType: "paper")

{
"recordType": "paper",
"paperId": "204e3073870fae3d05bcbc2f6a8e263d9b72e776",
"title": "Attention Is All You Need",
"abstract": "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks...",
"authors": [
{"authorId": "40348417", "name": "Ashish Vaswani"},
{"authorId": "1846258", "name": "Noam Shazeer"}
],
"year": 2017,
"publicationDate": "2017-06-12",
"venue": "Neural Information Processing Systems",
"publicationTypes": ["JournalArticle", "Conference"],
"fieldsOfStudy": ["Computer Science"],
"citationCount": 102543,
"referenceCount": 41,
"influentialCitationCount": 12876,
"doi": "10.48550/arXiv.1706.03762",
"arxivId": "1706.03762",
"isOpenAccess": true,
"openAccessPdfUrl": "https://arxiv.org/pdf/1706.03762.pdf",
"url": "https://www.semanticscholar.org/paper/204e3073870fae3d05bcbc2f6a8e263d9b72e776",
"sourceQuery": "transformer architecture",
"parentPaperId": null,
"parentAuthorId": null,
"relation": null,
"scrapedAt": "2026-06-02T10:30:00.000Z"
}

Core Fields

FieldTypeDescription
recordTypestringAlways "paper"
titlestringPaper title
abstractstringAbstract text (null when abstracts are off or unavailable)
authorsobject[]Authors, each {authorId, name}
yearnumberPublication year
publicationDatestringISO publication date when available
venuestringJournal or conference name
publicationTypesstring[]E.g. ["JournalArticle"]
fieldsOfStudystring[]E.g. ["Computer Science"]

Identifiers

FieldTypeDescription
paperIdstringNative 40-character Semantic Scholar paper ID
doistringDOI when available
arxivIdstringarXiv ID when available
urlstringCanonical Semantic Scholar paper page

Metrics

FieldTypeDescription
citationCountnumberTimes this paper has been cited
referenceCountnumberNumber of references in this paper
influentialCitationCountnumberSemantic Scholar's "influential" citation count

Open Access & Lineage

FieldTypeDescription
isOpenAccessbooleanWhether a free open-access PDF exists
openAccessPdfUrlstringDirect PDF link when open access
sourceQuerystringThe search query that produced this row (null for direct fetches)
parentPaperIdstringSource paper ID on citation/reference child rows (null on primary rows)
parentAuthorIdstringSource author ID on author-publication child rows (null on primary rows)
relationstring"citation", "reference", or "authorPaper" on child rows (null on primary rows)
scrapedAtstringISO 8601 timestamp

Author Profile (recordType: "author")

{
"recordType": "author",
"authorId": "1741101",
"name": "Geoffrey E. Hinton",
"affiliations": ["University of Toronto", "Google"],
"homepage": "https://www.cs.toronto.edu/~hinton/",
"paperCount": 412,
"citationCount": 631204,
"hIndex": 178,
"url": "https://www.semanticscholar.org/author/1741101",
"scrapedAt": "2026-06-02T10:30:00.000Z"
}
FieldTypeDescription
recordTypestringAlways "author"
authorIdstringNumeric Semantic Scholar author ID
namestringAuthor display name
affiliationsstring[]Listed affiliations
homepagestringHomepage URL when available
paperCountnumberNumber of papers attributed to the author
citationCountnumberTotal citations across all papers
hIndexnumberAuthor h-index
urlstringCanonical Semantic Scholar author page
scrapedAtstringISO 8601 timestamp

When includeAuthorPapers is on, each author's publications are also emitted as paper rows alongside the profile, so an author's full body of work lands in the Papers view ready to filter and sort.

Tips for Best Results

  • Fetch by DOI or arXiv ID for guaranteed exact matches. Keyword search is fuzzy; a DOI like 10.1038/nature14539 or arXiv:1706.03762 resolves to exactly one paper, every time β€” perfect for verifying a known reference.
  • Narrow broad topics with filters. A bare query like "machine learning" returns a flood. Add a yearFrom, a fieldsOfStudy value, and a minCitationCount to surface a tight, high-signal set.
  • Use minCitationCount for impact triage. Set it to 50 or 100 to skip preprints and low-impact work when you only want established, well-cited literature.
  • Filter to Review and MetaAnalysis for evidence synthesis. These publication types are the backbone of systematic reviews and save hours of manual screening.
  • Turn off abstracts on large harvests. Setting includeAbstracts: false shrinks every row and speeds up runs when you only need metadata and metrics.
  • Keep maxResults modest when expanding the citation graph. includeCitations and includeReferences multiply rows per paper β€” pair them with a small maxResults (5–20) and a sensible maxCitationsPerPaper to keep runs predictable.
  • Sort by "Most recent first" for monitoring. Re-run a saved query on a schedule with date sorting to catch new publications in your field as they land.

Pricing

From $6 per 1,000 results β€” the lowest-cost way to pull discipline-spanning academic data with citation graphs and author metrics bundled in. Bronze, Silver, and Gold subscribers pay progressively less; the table below shows total cost at each discount tier.

ResultsNo discountBronzeSilverGold
100$0.72$0.68$0.64$0.60
1,000$7.20$6.80$6.40$6.00
10,000$72.00$68.00$64.00$60.00
100,000$720.00$680.00$640.00$600.00

No compute or time-based charges β€” you pay per result, plus a small fixed per-run start fee. A "result" is any row in the output dataset: a paper, an author profile, or a citing/referenced/author-paper row from the opt-in graph expansions (so enabling those expansions increases your result count). Platform fees depend on your Apify plan.

Integrations

Export data in JSON, CSV, Excel, XML, or RSS. Connect to 1,500+ apps via:

  • Zapier / Make / n8n β€” Workflow automation
  • Google Sheets β€” Direct spreadsheet export
  • Slack / Email β€” Notifications on new results
  • Webhooks β€” Trigger custom APIs on run completion
  • Apify API β€” Full programmatic access

This actor is designed for legitimate academic research, bibliometrics, literature review, and market intelligence. Users are responsible for complying with applicable laws and Semantic Scholar's terms of service, including making reasonable-rate requests and respecting content usage rules for any papers or PDFs linked from the dataset. Do not use extracted data for spam, harassment, or any illegal purpose.