Pricing

from $2.00 / 1,000 paper fetcheds

Semantic Scholar Paper Search

Search and extract data from 200M+ academic papers via Semantic Scholar API. Filter by keyword, year, venue, field of study, citation count, and open access. Returns titles, abstracts, AI summaries (TLDR), authors, DOIs, ArXiv IDs, and PDF links. No API key required.

Pricing

from $2.00 / 1,000 paper fetcheds

Rating

0.0

(0)

Developer

ryan clinton

Actor stats

Bookmarked

Total users

Monthly active users

2 days ago

Last modified

What does Semantic Scholar Paper Search do?

Semantic Scholar Paper Search is an Apify actor that queries the Semantic Scholar Academic Graph API to find and extract research paper data at scale. Built by the Allen Institute for AI (AI2), Semantic Scholar indexes over 200 million academic papers across every major discipline -- from computer science and medicine to economics and sociology.

Enter a search query and the actor returns comprehensive, structured JSON for every matching paper: title, authors, abstract, AI-generated TLDR summary, citation count, influential citation count, reference count, publication date, venue, journal details, DOI, ArXiv ID, PubMed ID, fields of study, publication types, open access PDF link, and a direct URL to the Semantic Scholar page.

Use it for systematic literature reviews, citation trend analysis, research monitoring pipelines, academic meta-analysis, or gathering training data for scientific AI tools.

Why use Semantic Scholar Paper Search on Apify?

No API key required -- uses the free public Semantic Scholar API tier, so you can start searching immediately without registration or credentials.
AI-generated TLDR summaries -- Semantic Scholar's machine learning model produces one-sentence paper summaries, letting you scan hundreds of results without reading full abstracts.
Influential citation tracking -- goes beyond raw citation counts with Semantic Scholar's influential citation metric, which identifies citations where the cited work meaningfully shaped the citing paper.
Cross-database identifiers -- every paper includes DOI, ArXiv ID, and PubMed ID when available, making it trivial to cross-reference results with other academic databases.
Built-in rate limiting and retry -- automatically handles the 1 request/second public rate limit and retries on 429 responses with exponential backoff.
Pagination handled automatically -- request up to 1,000 papers in a single run; the actor pages through results behind the scenes.
Scheduled runs -- set up recurring searches on Apify to monitor new publications on a daily or weekly basis.
Cloud execution -- runs on Apify infrastructure with no local setup, and integrates with webhooks, APIs, and 1,600+ apps via Zapier or Make.

Key features

Full-text search across paper titles and abstracts using Semantic Scholar's relevance ranking
AI-generated TLDR summaries -- machine-generated one-sentence paper summaries available for many papers in the index
Influential citation counts -- a quality-weighted citation metric that counts only papers where the citation had a significant methodological or conceptual impact
Multi-ID cross-referencing -- every paper exports DOI, ArXiv ID, and PubMed ID, enabling seamless cross-database lookups
Year range filtering with flexible syntax (from year, to year, or bounded range)
Venue filtering by journal or conference name (Nature, NeurIPS, ICML, ArXiv, etc.)
Field of study filtering across 10 disciplines: Computer Science, Medicine, Biology, Physics, Chemistry, Mathematics, Engineering, Economics, Psychology, Sociology
Open access filter to retrieve only papers with free PDF downloads
Minimum citation threshold to surface only well-cited papers
Three sort modes -- relevance (default), citation count (most cited), or publication date (newest first)
Direct open access PDF links when available

How to use Semantic Scholar Paper Search

Navigate to the Semantic Scholar Paper Search actor on the Apify Store.
Click Try for free to open the actor in Apify Console.
Enter your Search Query -- for example, large language models, CRISPR gene editing, or climate change mitigation.
Optionally set filters: year range, venue, field of study, open access only, minimum citations.
Choose a sort order: relevance (default), most cited, or newest first.
Set the maximum number of results (1 to 1,000).
Click Start to run the actor.
When the run finishes, download results as JSON, CSV, or Excel from the Dataset tab.

Input parameters

Parameter	Type	Required	Default	Description
`query`	String	Yes	`large language models`	Search query matching paper titles and abstracts
`yearFrom`	Integer	No	`2023`	Earliest publication year to include
`yearTo`	Integer	No	--	Latest publication year to include
`venue`	String	No	--	Filter by journal or conference name (e.g., `Nature`, `NeurIPS`, `ArXiv`)
`fieldsOfStudy`	String	No	--	Academic field: Computer Science, Medicine, Biology, Physics, Chemistry, Mathematics, Engineering, Economics, Psychology, or Sociology
`openAccessOnly`	Boolean	No	`false`	When enabled, only returns papers with free PDF downloads
`minCitations`	Integer	No	--	Minimum number of citations a paper must have
`sortBy`	String	No	`relevance`	Sort order: `relevance`, `citationCount` (most cited), or `publicationDate` (newest first)
`maxResults`	Integer	No	`50`	Maximum number of papers to return (1 to 1,000)

Input examples

Find highly-cited LLM papers from top conferences:

{
    "query": "large language models",
    "yearFrom": 2023,
    "venue": "NeurIPS",
    "minCitations": 50,
    "sortBy": "citationCount",
    "maxResults": 100
}

Search for open access biomedical research:

{
    "query": "CRISPR gene therapy clinical trials",
    "fieldsOfStudy": "Medicine",
    "openAccessOnly": true,
    "yearFrom": 2022,
    "yearTo": 2025,
    "maxResults": 200
}

Get the newest climate science publications:

{
    "query": "climate change mitigation renewable energy",
    "sortBy": "publicationDate",
    "yearFrom": 2025,
    "maxResults": 50
}

Find influential machine learning survey papers:

{
    "query": "survey transformer architecture",
    "fieldsOfStudy": "Computer Science",
    "minCitations": 100,
    "sortBy": "citationCount",
    "maxResults": 50
}

Tips for best results

Use specific search terms -- Semantic Scholar searches across titles and abstracts. More specific queries like transformer architecture self-attention return more targeted results than broad terms like AI.
Combine filters -- pair a keyword search with a year range and minimum citation count to find highly-cited recent papers in your area.
Use the venue filter -- if you only want papers from NeurIPS, ICML, Nature, or The Lancet, set the venue filter to narrow results significantly.
Sort by citations for impact -- sorting by citationCount surfaces the most influential papers in any research area.
Sort by date for recency -- sorting by publicationDate finds the latest preprints and publications.
Filter open access only -- when you need downloadable PDFs for text mining or corpus building, enable the open access filter.
Check the TLDR field -- AI-generated summaries are available for many papers, saving significant time when scanning large result sets.
Check influential citations -- a paper with 50 influential citations may be more important to a field than one with 500 total citations that are mostly superficial mentions.
Schedule weekly runs -- set up a recurring Apify schedule to monitor new publications matching your query automatically.

Programmatic access

You can call Semantic Scholar Paper Search programmatically using the Apify API. Here are examples in Python, JavaScript, and cURL.

Python:

from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")

run = client.actor("ryanclinton/semantic-scholar-search").call(run_input={
    "query": "large language models",
    "yearFrom": 2023,
    "minCitations": 50,
    "sortBy": "citationCount",
    "maxResults": 100,
})

for paper in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(f"{paper['title']} ({paper['citationCount']} citations)")
    if paper.get("tldr"):
        print(f"  TLDR: {paper['tldr']}")

JavaScript:

import { ApifyClient } from "apify-client";

const client = new ApifyClient({ token: "YOUR_API_TOKEN" });

const run = await client.actor("ryanclinton/semantic-scholar-search").call({
    query: "large language models",
    yearFrom: 2023,
    minCitations: 50,
    sortBy: "citationCount",
    maxResults: 100,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((paper) => {
    console.log(`${paper.title} (${paper.citationCount} citations)`);
    if (paper.tldr) console.log(`  TLDR: ${paper.tldr}`);
});

cURL:

curl "https://api.apify.com/v2/acts/ryanclinton~semantic-scholar-search/runs" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -d '{
    "query": "large language models",
    "yearFrom": 2023,
    "minCitations": 50,
    "sortBy": "citationCount",
    "maxResults": 100
  }'

Output example

Each paper in the output dataset contains the following structure:

{
    "paperId": "204e3073870fae3d05bcbc2f6a8e263d9b72e776",
    "title": "Attention Is All You Need",
    "year": 2017,
    "publicationDate": "2017-06-12",
    "citationCount": 124500,
    "referenceCount": 40,
    "influentialCitationCount": 15230,
    "isOpenAccess": true,
    "openAccessPdfUrl": "https://arxiv.org/pdf/1706.03762.pdf",
    "doi": "10.48550/arXiv.1706.03762",
    "arxivId": "1706.03762",
    "pmid": null,
    "authors": "Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin",
    "authorIds": ["1846258", "1857797", "47269835", "2516777", "144783904", "1857998", "1741101", "47558326"],
    "venue": "Neural Information Processing Systems",
    "journalName": null,
    "journalVolume": null,
    "journalPages": null,
    "abstract": "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms...",
    "tldr": "A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely, is proposed and achieves state-of-the-art results on English-to-German and English-to-French translation tasks.",
    "fieldsOfStudy": ["Computer Science"],
    "publicationTypes": ["Conference", "JournalArticle"],
    "semanticScholarUrl": "https://www.semanticscholar.org/paper/204e3073870fae3d05bcbc2f6a8e263d9b72e776",
    "extractedAt": "2026-02-17T10:30:00.000Z"
}

Output fields reference

Field	Type	Description
`paperId`	String	Semantic Scholar unique paper identifier (40-character hash)
`title`	String	Full paper title
`year`	Integer	Publication year (may be `null` for preprints)
`publicationDate`	String	ISO date string (e.g., `2023-06-15`), `null` if unknown
`citationCount`	Integer	Total number of citing papers in Semantic Scholar
`referenceCount`	Integer	Number of papers cited by this paper
`influentialCitationCount`	Integer	Citations where this paper significantly influenced the citing work
`isOpenAccess`	Boolean	Whether a free PDF is available
`openAccessPdfUrl`	String	Direct URL to the open access PDF, `null` if not available
`doi`	String	Digital Object Identifier, `null` if not assigned
`arxivId`	String	ArXiv preprint identifier (e.g., `2301.12345`), `null` if not on ArXiv
`pmid`	String	PubMed identifier, `null` if not indexed in PubMed
`authors`	String	Comma-separated list of author names
`authorIds`	Array	Semantic Scholar author IDs for programmatic author lookups
`venue`	String	Publication venue name (conference or journal), `null` if unknown
`journalName`	String	Journal name if published in a journal, `null` otherwise
`journalVolume`	String	Journal volume number, `null` if not applicable
`journalPages`	String	Page range in the journal, `null` if not applicable
`abstract`	String	Full paper abstract, `null` if not available
`tldr`	String	AI-generated one-sentence summary from Semantic Scholar, `null` if not generated
`fieldsOfStudy`	Array	Academic disciplines (e.g., `["Computer Science", "Mathematics"]`)
`publicationTypes`	Array	Publication types (e.g., `["Conference"]`, `["JournalArticle"]`, `["Review"]`)
`semanticScholarUrl`	String	Direct link to the paper's Semantic Scholar page
`extractedAt`	String	ISO timestamp of when the data was extracted

How it works

The actor follows a straightforward pipeline to search, paginate, transform, and output paper data:

Semantic Scholar Academic Graph API
                    ===================================

  [Input Query + Filters]
           |
           v
  +------------------+     offset=0      +---------------------------+
  | Build URL with   | ----------------> | api.semanticscholar.org   |
  | 17 explicit      |     100/page      | /graph/v1/paper/search    |
  | field params     | <---------------- | (free, no key required)   |
  +------------------+     JSON page     +---------------------------+
           |                                        ^
           |  1.1s delay between pages              |
           |  5s wait + retry on 429                |
           +--------- next page? --------> offset += 100
           |           (until maxResults or offset >= 1000)
           v
  +------------------+
  | Transform:       |
  | - Flatten IDs    |  DOI, ArXiv, PubMed extracted from externalIds
  | - Extract TLDR   |  AI summary from tldr.text
  | - Format authors |  Joined names + separate ID array
  | - Build S2 URL   |  Direct link to paper page
  +------------------+
           |
           v
  +------------------+
  | Push to Apify    |  Flat JSON objects, one per paper
  | Dataset          |  + citation/field/TLDR summary stats in log
  +------------------+

Field selection

The actor requests 17 specific data fields from the Semantic Scholar API in a single fields parameter. This explicit field selection ensures you get the maximum available metadata per paper without making additional per-paper API calls. The requested fields include title, year, citationCount, referenceCount, influentialCitationCount, isOpenAccess, openAccessPdf, externalIds, publicationTypes, journal, authors, abstract, fieldsOfStudy, s2FieldsOfStudy, publicationVenue, publicationDate, and tldr.

Rate limiting and 429 retry

The Semantic Scholar public API allows 1 request per second without an API key. The actor enforces a 1.1-second delay between page requests to stay within this limit. If the API returns a 429 (Too Many Requests) response, the actor waits 5 seconds before retrying the same request. This retry loop continues until the request succeeds, so transient rate limit hits never cause the run to fail.

Year filter syntax

The Semantic Scholar API accepts year ranges in three formats:

2023-2025 -- papers published between 2023 and 2025 inclusive
2023- -- papers published from 2023 onward (open-ended upper bound)
-2025 -- papers published up to and including 2025 (open-ended lower bound)

The actor constructs the correct format automatically based on which of yearFrom and yearTo you provide.

External ID extraction

Each paper from the API may include an externalIds object containing DOI, ArXiv, PubMed, and other identifiers. The actor flattens these into top-level doi, arxivId, and pmid fields so you can directly cross-reference results with other databases (Crossref, ArXiv, PubMed) without nested object parsing.

TLDR generation

Semantic Scholar uses a trained machine learning model (SciTLDR) to generate one-sentence summaries for papers in its index. These are returned in the tldr field. Not every paper has a TLDR -- the model needs sufficient abstract text to generate a summary. The actor reports how many papers in the result set include a TLDR in the run log.

Influential vs. total citations

Total citationCount includes every paper that references the work, including superficial mentions. The influentialCitationCount metric, unique to Semantic Scholar, uses a trained classifier to identify citations where the cited paper had a significant impact on the citing paper's methodology, experiments, or conclusions. A paper with a high influential citation ratio relative to its total citations is generally considered more foundational to its field.

How much does it cost to run?

Semantic Scholar Paper Search is lightweight -- it uses only 256 MB of memory and makes HTTP API calls without any browser rendering. The Semantic Scholar API itself is completely free (no API key or subscription required).

Scenario	Papers	Run time	Apify cost (approx.)
Quick search	50	~60 seconds	$0.001 -- $0.005
Medium batch	200	~3 minutes	$0.005 -- $0.01
Full extraction	1,000	~12 minutes	$0.01 -- $0.03

Run times scale linearly with result count due to the 1-request-per-second rate limit (100 papers per page, 1.1 seconds between pages). The majority of the cost comes from the Apify platform compute time at 256 MB memory.

Limitations and responsible use

1,000 paper maximum per run -- the Semantic Scholar API enforces a maximum offset of 1,000. To retrieve more papers on a broad topic, run multiple searches with non-overlapping year ranges or additional filters.
Search query is required -- unlike some academic APIs, Semantic Scholar's search endpoint requires a query string. You cannot browse all papers without a search term.
Rate limiting -- the public API tier allows 1 request per second. The actor respects this limit automatically, but run times scale linearly with result count.
TLDR availability -- AI-generated summaries are not available for every paper. Older papers and those with very short abstracts may lack a TLDR.
Field of study coverage -- filtering supports 10 top-level disciplines. More granular sub-field filtering is not available through this endpoint.
Data freshness -- Semantic Scholar continuously indexes new papers, but there may be a delay of days to weeks before very recent publications appear in search results.
Respect the API -- this actor is designed for legitimate research and data analysis. Avoid scheduling extremely frequent runs with maximum result counts, as this consumes shared public API resources.

FAQ

Do I need a Semantic Scholar API key to use this actor?

No. The actor uses the free public API tier, which does not require any API key or authentication. It automatically respects the public rate limit of 1 request per second and handles 429 responses with retry logic.

What is the maximum number of papers I can retrieve in one run?

You can retrieve up to 1,000 papers per run. This is a hard limit of the Semantic Scholar API's offset parameter. To cover more ground, run multiple searches with different year ranges, venues, or field-of-study filters.

What are "influential citations" and how are they different from regular citations?

Influential citation count is a Semantic Scholar metric computed by a trained classifier. It identifies citations where the cited paper had a significant impact on the citing paper's methodology, experiments, or conclusions -- as opposed to superficial mentions in related-work sections. A paper with 200 total citations and 80 influential citations is likely more foundational than one with 500 total citations and only 10 influential citations.

What does the TLDR field contain?

The tldr field contains an AI-generated one-sentence summary produced by Semantic Scholar's SciTLDR model. It distills the paper's main contribution or finding into a single sentence. Not every paper has a TLDR -- it depends on whether the model could generate a quality summary from the abstract.

Can I search for a specific author's papers?

This actor searches by keyword across titles and abstracts, not by author ID. You can include an author name in the query (e.g., "Yoshua Bengio" deep learning) to find papers mentioning that author, but for comprehensive author-based retrieval, the Semantic Scholar Author API endpoint would be more appropriate.

How do I cross-reference results with other academic databases?

Each paper includes doi, arxivId, and pmid fields when available. Use the DOI to look up the paper in Crossref or the publisher's site, the ArXiv ID to find it on arxiv.org, and the PubMed ID to locate it in PubMed/MEDLINE. These identifiers make it straightforward to merge Semantic Scholar data with results from other actors in this suite.

Actor	Database	Coverage	Best for
OpenAlex Research Search	OpenAlex	250M+ works, fully open metadata	Broad bibliometric analysis with open data
Crossref Academic Paper Search	Crossref	150M+ DOI records	DOI metadata, publisher information, citation links
PubMed Biomedical Literature Search	PubMed/MEDLINE	36M+ biomedical citations	Medical and life science research
ArXiv Preprint Paper Search	ArXiv	2.4M+ preprints	Pre-publication CS, physics, math papers
CORE Open Access Papers	CORE	300M+ metadata records	Open access full-text aggregation
Europe PMC Literature Search	Europe PMC	45M+ life science records	European biomedical and life science literature

Semantic Scholar Scraper

automation-lab/semantic-scholar-scraper

Search and extract academic paper data from Semantic Scholar. Find papers, analyze citations, track references. 200M+ papers, no API key needed.

Stas Persiianenko

Semantic Scholar Paper Scraper

agenscrape/semantic-scholar-paper-scraper

Scrape academic papers from Semantic Scholar. Search by keyword and extract paper titles, abstracts, authors, citation counts, publication dates, DOIs, open access PDFs... Perfect for literature reviews, citation analysis, and research databases. Real time data output with pagination support.

Agenscrape

Semantic Scholar Search Scraper

powerai/semantic-scholar-search-scraper

Scrape academic papers from Semantic Scholar by keyword search, with automatic pagination and comprehensive research data extraction.

PowerAI

Semantic Scholar Scraper

openclawmara/semantic-scholar-scraper

Scrape Semantic Scholar for academic papers, citations, abstracts, and author profiles. Search by topic, author, or venue. Extract citation graphs, reference lists, and research trends. Essential for literature reviews, academic research, and AI/ML paper discovery.

OpenClaw Mara

Semantic Scholar Scraper - Cheap 📚🔎🤖

scrapestorm/semantic-scholar-scraper---cheap

🔎 Easily collect research papers from Semantic Scholar Provide one or multiple search keywords, paper URLs or author profiles and extract structured academic data such as 📄 Paper Title👨‍🔬 Authors 📅 Publication Year 🔗 Paper URL & more Perfect for academic research & AI research monitoring 📚

Storm_Scraper

5.0

Academic Paper Scraper

labrat011/academic-paper-scraper

Search MILLIONS of academic papers from Semantic Scholar and arXiv by keyword, DOI, or citation graph. Returns titles, authors, abstracts, citation counts, and open access PDFs as clean JSON. Works as an MCP tool for AI agents.

Mick

Semantic Scholar Scraper

parseforge/semantic-scholar-scraper

Extract detailed academic paper data from Semantic Scholar, including abstracts, citations, authors, and publication details. Ideal for researchers, academics, and analysts who need structured scholarly data for literature reviews, research workflows, and large-scale academic analysis.

ParseForge

5.0

Academic Paper Search (Semantic Scholar)

nexgendata/google-scholar-scraper

Scrape academic papers, citations, author profiles, and h-index data from Google Scholar. Essential for literature reviews, research monitoring, and academic analytics.

Stephan Corbeil

Google Scholar Scraper - Academic Papers & Citations

klondikeking/google-scholar-scraper-v2

Extract academic papers, citations, authors, and PDF links from Google Scholar.

Pierrick McD0nald

Crossref Academic Paper Search

ryanclinton/crossref-paper-search

Search 150M+ scholarly papers via Crossref API. Filter by keywords, author, journal, DOI prefix, publication type, and year range. Returns DOIs, citations, authors with ORCID, abstracts, funding data, and publisher metadata. Free, no API key needed.