Semantic Scholar Paper Search
Pricing
from $2.00 / 1,000 paper fetcheds
Semantic Scholar Paper Search
Search and extract data from 200M+ academic papers via Semantic Scholar API. Filter by keyword, year, venue, field of study, citation count, and open access. Returns titles, abstracts, AI summaries (TLDR), authors, DOIs, ArXiv IDs, and PDF links. No API key required.
Pricing
from $2.00 / 1,000 paper fetcheds
Rating
0.0
(0)
Developer

ryan clinton
Actor stats
0
Bookmarked
2
Total users
0
Monthly active users
2 days ago
Last modified
Categories
Share
Search and extract academic research papers from Semantic Scholar's database of over 200 million publications. Retrieve structured metadata including titles, abstracts, authors, citation counts, DOIs, ArXiv IDs, open access PDF links, and AI-generated TLDR summaries -- all without needing an API key.
What does Semantic Scholar Paper Search do?
Semantic Scholar Paper Search is an Apify actor that queries the Semantic Scholar Academic Graph API to find and extract research paper data at scale. Built by the Allen Institute for AI (AI2), Semantic Scholar indexes over 200 million academic papers across every major discipline -- from computer science and medicine to economics and sociology.
Enter a search query and the actor returns comprehensive, structured JSON for every matching paper: title, authors, abstract, AI-generated TLDR summary, citation count, influential citation count, reference count, publication date, venue, journal details, DOI, ArXiv ID, PubMed ID, fields of study, publication types, open access PDF link, and a direct URL to the Semantic Scholar page.
Use it for systematic literature reviews, citation trend analysis, research monitoring pipelines, academic meta-analysis, or gathering training data for scientific AI tools.
Why use Semantic Scholar Paper Search on Apify?
- No API key required -- uses the free public Semantic Scholar API tier, so you can start searching immediately without registration or credentials.
- AI-generated TLDR summaries -- Semantic Scholar's machine learning model produces one-sentence paper summaries, letting you scan hundreds of results without reading full abstracts.
- Influential citation tracking -- goes beyond raw citation counts with Semantic Scholar's influential citation metric, which identifies citations where the cited work meaningfully shaped the citing paper.
- Cross-database identifiers -- every paper includes DOI, ArXiv ID, and PubMed ID when available, making it trivial to cross-reference results with other academic databases.
- Built-in rate limiting and retry -- automatically handles the 1 request/second public rate limit and retries on 429 responses with exponential backoff.
- Pagination handled automatically -- request up to 1,000 papers in a single run; the actor pages through results behind the scenes.
- Scheduled runs -- set up recurring searches on Apify to monitor new publications on a daily or weekly basis.
- Cloud execution -- runs on Apify infrastructure with no local setup, and integrates with webhooks, APIs, and 1,600+ apps via Zapier or Make.
Key features
- Full-text search across paper titles and abstracts using Semantic Scholar's relevance ranking
- AI-generated TLDR summaries -- machine-generated one-sentence paper summaries available for many papers in the index
- Influential citation counts -- a quality-weighted citation metric that counts only papers where the citation had a significant methodological or conceptual impact
- Multi-ID cross-referencing -- every paper exports DOI, ArXiv ID, and PubMed ID, enabling seamless cross-database lookups
- Year range filtering with flexible syntax (from year, to year, or bounded range)
- Venue filtering by journal or conference name (Nature, NeurIPS, ICML, ArXiv, etc.)
- Field of study filtering across 10 disciplines: Computer Science, Medicine, Biology, Physics, Chemistry, Mathematics, Engineering, Economics, Psychology, Sociology
- Open access filter to retrieve only papers with free PDF downloads
- Minimum citation threshold to surface only well-cited papers
- Three sort modes -- relevance (default), citation count (most cited), or publication date (newest first)
- Direct open access PDF links when available
How to use Semantic Scholar Paper Search
- Navigate to the Semantic Scholar Paper Search actor on the Apify Store.
- Click Try for free to open the actor in Apify Console.
- Enter your Search Query -- for example,
large language models,CRISPR gene editing, orclimate change mitigation. - Optionally set filters: year range, venue, field of study, open access only, minimum citations.
- Choose a sort order: relevance (default), most cited, or newest first.
- Set the maximum number of results (1 to 1,000).
- Click Start to run the actor.
- When the run finishes, download results as JSON, CSV, or Excel from the Dataset tab.
Input parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
query | String | Yes | large language models | Search query matching paper titles and abstracts |
yearFrom | Integer | No | 2023 | Earliest publication year to include |
yearTo | Integer | No | -- | Latest publication year to include |
venue | String | No | -- | Filter by journal or conference name (e.g., Nature, NeurIPS, ArXiv) |
fieldsOfStudy | String | No | -- | Academic field: Computer Science, Medicine, Biology, Physics, Chemistry, Mathematics, Engineering, Economics, Psychology, or Sociology |
openAccessOnly | Boolean | No | false | When enabled, only returns papers with free PDF downloads |
minCitations | Integer | No | -- | Minimum number of citations a paper must have |
sortBy | String | No | relevance | Sort order: relevance, citationCount (most cited), or publicationDate (newest first) |
maxResults | Integer | No | 50 | Maximum number of papers to return (1 to 1,000) |
Input examples
Find highly-cited LLM papers from top conferences:
{"query": "large language models","yearFrom": 2023,"venue": "NeurIPS","minCitations": 50,"sortBy": "citationCount","maxResults": 100}
Search for open access biomedical research:
{"query": "CRISPR gene therapy clinical trials","fieldsOfStudy": "Medicine","openAccessOnly": true,"yearFrom": 2022,"yearTo": 2025,"maxResults": 200}
Get the newest climate science publications:
{"query": "climate change mitigation renewable energy","sortBy": "publicationDate","yearFrom": 2025,"maxResults": 50}
Find influential machine learning survey papers:
{"query": "survey transformer architecture","fieldsOfStudy": "Computer Science","minCitations": 100,"sortBy": "citationCount","maxResults": 50}
Tips for best results
- Use specific search terms -- Semantic Scholar searches across titles and abstracts. More specific queries like
transformer architecture self-attentionreturn more targeted results than broad terms likeAI. - Combine filters -- pair a keyword search with a year range and minimum citation count to find highly-cited recent papers in your area.
- Use the venue filter -- if you only want papers from NeurIPS, ICML, Nature, or The Lancet, set the venue filter to narrow results significantly.
- Sort by citations for impact -- sorting by
citationCountsurfaces the most influential papers in any research area. - Sort by date for recency -- sorting by
publicationDatefinds the latest preprints and publications. - Filter open access only -- when you need downloadable PDFs for text mining or corpus building, enable the open access filter.
- Check the TLDR field -- AI-generated summaries are available for many papers, saving significant time when scanning large result sets.
- Check influential citations -- a paper with 50 influential citations may be more important to a field than one with 500 total citations that are mostly superficial mentions.
- Schedule weekly runs -- set up a recurring Apify schedule to monitor new publications matching your query automatically.
Programmatic access
You can call Semantic Scholar Paper Search programmatically using the Apify API. Here are examples in Python, JavaScript, and cURL.
Python:
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run = client.actor("ryanclinton/semantic-scholar-search").call(run_input={"query": "large language models","yearFrom": 2023,"minCitations": 50,"sortBy": "citationCount","maxResults": 100,})for paper in client.dataset(run["defaultDatasetId"]).iterate_items():print(f"{paper['title']} ({paper['citationCount']} citations)")if paper.get("tldr"):print(f" TLDR: {paper['tldr']}")
JavaScript:
import { ApifyClient } from "apify-client";const client = new ApifyClient({ token: "YOUR_API_TOKEN" });const run = await client.actor("ryanclinton/semantic-scholar-search").call({query: "large language models",yearFrom: 2023,minCitations: 50,sortBy: "citationCount",maxResults: 100,});const { items } = await client.dataset(run.defaultDatasetId).listItems();items.forEach((paper) => {console.log(`${paper.title} (${paper.citationCount} citations)`);if (paper.tldr) console.log(` TLDR: ${paper.tldr}`);});
cURL:
curl "https://api.apify.com/v2/acts/ryanclinton~semantic-scholar-search/runs" \-X POST \-H "Content-Type: application/json" \-H "Authorization: Bearer YOUR_API_TOKEN" \-d '{"query": "large language models","yearFrom": 2023,"minCitations": 50,"sortBy": "citationCount","maxResults": 100}'
Output example
Each paper in the output dataset contains the following structure:
{"paperId": "204e3073870fae3d05bcbc2f6a8e263d9b72e776","title": "Attention Is All You Need","year": 2017,"publicationDate": "2017-06-12","citationCount": 124500,"referenceCount": 40,"influentialCitationCount": 15230,"isOpenAccess": true,"openAccessPdfUrl": "https://arxiv.org/pdf/1706.03762.pdf","doi": "10.48550/arXiv.1706.03762","arxivId": "1706.03762","pmid": null,"authors": "Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin","authorIds": ["1846258", "1857797", "47269835", "2516777", "144783904", "1857998", "1741101", "47558326"],"venue": "Neural Information Processing Systems","journalName": null,"journalVolume": null,"journalPages": null,"abstract": "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms...","tldr": "A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely, is proposed and achieves state-of-the-art results on English-to-German and English-to-French translation tasks.","fieldsOfStudy": ["Computer Science"],"publicationTypes": ["Conference", "JournalArticle"],"semanticScholarUrl": "https://www.semanticscholar.org/paper/204e3073870fae3d05bcbc2f6a8e263d9b72e776","extractedAt": "2026-02-17T10:30:00.000Z"}
Output fields reference
| Field | Type | Description |
|---|---|---|
paperId | String | Semantic Scholar unique paper identifier (40-character hash) |
title | String | Full paper title |
year | Integer | Publication year (may be null for preprints) |
publicationDate | String | ISO date string (e.g., 2023-06-15), null if unknown |
citationCount | Integer | Total number of citing papers in Semantic Scholar |
referenceCount | Integer | Number of papers cited by this paper |
influentialCitationCount | Integer | Citations where this paper significantly influenced the citing work |
isOpenAccess | Boolean | Whether a free PDF is available |
openAccessPdfUrl | String | Direct URL to the open access PDF, null if not available |
doi | String | Digital Object Identifier, null if not assigned |
arxivId | String | ArXiv preprint identifier (e.g., 2301.12345), null if not on ArXiv |
pmid | String | PubMed identifier, null if not indexed in PubMed |
authors | String | Comma-separated list of author names |
authorIds | Array | Semantic Scholar author IDs for programmatic author lookups |
venue | String | Publication venue name (conference or journal), null if unknown |
journalName | String | Journal name if published in a journal, null otherwise |
journalVolume | String | Journal volume number, null if not applicable |
journalPages | String | Page range in the journal, null if not applicable |
abstract | String | Full paper abstract, null if not available |
tldr | String | AI-generated one-sentence summary from Semantic Scholar, null if not generated |
fieldsOfStudy | Array | Academic disciplines (e.g., ["Computer Science", "Mathematics"]) |
publicationTypes | Array | Publication types (e.g., ["Conference"], ["JournalArticle"], ["Review"]) |
semanticScholarUrl | String | Direct link to the paper's Semantic Scholar page |
extractedAt | String | ISO timestamp of when the data was extracted |
How it works
The actor follows a straightforward pipeline to search, paginate, transform, and output paper data:
Semantic Scholar Academic Graph API===================================[Input Query + Filters]|v+------------------+ offset=0 +---------------------------+| Build URL with | ----------------> | api.semanticscholar.org || 17 explicit | 100/page | /graph/v1/paper/search || field params | <---------------- | (free, no key required) |+------------------+ JSON page +---------------------------+| ^| 1.1s delay between pages || 5s wait + retry on 429 |+--------- next page? --------> offset += 100| (until maxResults or offset >= 1000)v+------------------+| Transform: || - Flatten IDs | DOI, ArXiv, PubMed extracted from externalIds| - Extract TLDR | AI summary from tldr.text| - Format authors | Joined names + separate ID array| - Build S2 URL | Direct link to paper page+------------------+|v+------------------+| Push to Apify | Flat JSON objects, one per paper| Dataset | + citation/field/TLDR summary stats in log+------------------+
Field selection
The actor requests 17 specific data fields from the Semantic Scholar API in a single fields parameter. This explicit field selection ensures you get the maximum available metadata per paper without making additional per-paper API calls. The requested fields include title, year, citationCount, referenceCount, influentialCitationCount, isOpenAccess, openAccessPdf, externalIds, publicationTypes, journal, authors, abstract, fieldsOfStudy, s2FieldsOfStudy, publicationVenue, publicationDate, and tldr.
Rate limiting and 429 retry
The Semantic Scholar public API allows 1 request per second without an API key. The actor enforces a 1.1-second delay between page requests to stay within this limit. If the API returns a 429 (Too Many Requests) response, the actor waits 5 seconds before retrying the same request. This retry loop continues until the request succeeds, so transient rate limit hits never cause the run to fail.
Year filter syntax
The Semantic Scholar API accepts year ranges in three formats:
2023-2025-- papers published between 2023 and 2025 inclusive2023--- papers published from 2023 onward (open-ended upper bound)-2025-- papers published up to and including 2025 (open-ended lower bound)
The actor constructs the correct format automatically based on which of yearFrom and yearTo you provide.
External ID extraction
Each paper from the API may include an externalIds object containing DOI, ArXiv, PubMed, and other identifiers. The actor flattens these into top-level doi, arxivId, and pmid fields so you can directly cross-reference results with other databases (Crossref, ArXiv, PubMed) without nested object parsing.
TLDR generation
Semantic Scholar uses a trained machine learning model (SciTLDR) to generate one-sentence summaries for papers in its index. These are returned in the tldr field. Not every paper has a TLDR -- the model needs sufficient abstract text to generate a summary. The actor reports how many papers in the result set include a TLDR in the run log.
Influential vs. total citations
Total citationCount includes every paper that references the work, including superficial mentions. The influentialCitationCount metric, unique to Semantic Scholar, uses a trained classifier to identify citations where the cited paper had a significant impact on the citing paper's methodology, experiments, or conclusions. A paper with a high influential citation ratio relative to its total citations is generally considered more foundational to its field.
How much does it cost to run?
Semantic Scholar Paper Search is lightweight -- it uses only 256 MB of memory and makes HTTP API calls without any browser rendering. The Semantic Scholar API itself is completely free (no API key or subscription required).
| Scenario | Papers | Run time | Apify cost (approx.) |
|---|---|---|---|
| Quick search | 50 | ~60 seconds | $0.001 -- $0.005 |
| Medium batch | 200 | ~3 minutes | $0.005 -- $0.01 |
| Full extraction | 1,000 | ~12 minutes | $0.01 -- $0.03 |
Run times scale linearly with result count due to the 1-request-per-second rate limit (100 papers per page, 1.1 seconds between pages). The majority of the cost comes from the Apify platform compute time at 256 MB memory.
Limitations and responsible use
- 1,000 paper maximum per run -- the Semantic Scholar API enforces a maximum offset of 1,000. To retrieve more papers on a broad topic, run multiple searches with non-overlapping year ranges or additional filters.
- Search query is required -- unlike some academic APIs, Semantic Scholar's search endpoint requires a query string. You cannot browse all papers without a search term.
- Rate limiting -- the public API tier allows 1 request per second. The actor respects this limit automatically, but run times scale linearly with result count.
- TLDR availability -- AI-generated summaries are not available for every paper. Older papers and those with very short abstracts may lack a TLDR.
- Field of study coverage -- filtering supports 10 top-level disciplines. More granular sub-field filtering is not available through this endpoint.
- Data freshness -- Semantic Scholar continuously indexes new papers, but there may be a delay of days to weeks before very recent publications appear in search results.
- Respect the API -- this actor is designed for legitimate research and data analysis. Avoid scheduling extremely frequent runs with maximum result counts, as this consumes shared public API resources.
FAQ
Do I need a Semantic Scholar API key to use this actor?
No. The actor uses the free public API tier, which does not require any API key or authentication. It automatically respects the public rate limit of 1 request per second and handles 429 responses with retry logic.
What is the maximum number of papers I can retrieve in one run?
You can retrieve up to 1,000 papers per run. This is a hard limit of the Semantic Scholar API's offset parameter. To cover more ground, run multiple searches with different year ranges, venues, or field-of-study filters.
What are "influential citations" and how are they different from regular citations?
Influential citation count is a Semantic Scholar metric computed by a trained classifier. It identifies citations where the cited paper had a significant impact on the citing paper's methodology, experiments, or conclusions -- as opposed to superficial mentions in related-work sections. A paper with 200 total citations and 80 influential citations is likely more foundational than one with 500 total citations and only 10 influential citations.
What does the TLDR field contain?
The tldr field contains an AI-generated one-sentence summary produced by Semantic Scholar's SciTLDR model. It distills the paper's main contribution or finding into a single sentence. Not every paper has a TLDR -- it depends on whether the model could generate a quality summary from the abstract.
Can I search for a specific author's papers?
This actor searches by keyword across titles and abstracts, not by author ID. You can include an author name in the query (e.g., "Yoshua Bengio" deep learning) to find papers mentioning that author, but for comprehensive author-based retrieval, the Semantic Scholar Author API endpoint would be more appropriate.
How do I cross-reference results with other academic databases?
Each paper includes doi, arxivId, and pmid fields when available. Use the DOI to look up the paper in Crossref or the publisher's site, the ArXiv ID to find it on arxiv.org, and the PubMed ID to locate it in PubMed/MEDLINE. These identifiers make it straightforward to merge Semantic Scholar data with results from other actors in this suite.
Related actors
| Actor | Database | Coverage | Best for |
|---|---|---|---|
| OpenAlex Research Search | OpenAlex | 250M+ works, fully open metadata | Broad bibliometric analysis with open data |
| Crossref Academic Paper Search | Crossref | 150M+ DOI records | DOI metadata, publisher information, citation links |
| PubMed Biomedical Literature Search | PubMed/MEDLINE | 36M+ biomedical citations | Medical and life science research |
| ArXiv Preprint Paper Search | ArXiv | 2.4M+ preprints | Pre-publication CS, physics, math papers |
| CORE Open Access Papers | CORE | 300M+ metadata records | Open access full-text aggregation |
| Europe PMC Literature Search | Europe PMC | 45M+ life science records | European biomedical and life science literature |

