Semantic Scholar Scraper avatar
Semantic Scholar Scraper

Pricing

Pay per event

Go to Apify Store
Semantic Scholar Scraper

Semantic Scholar Scraper

Extract detailed academic paper data from Semantic Scholar, including abstracts, citations, authors, and publication details. Ideal for researchers, academics, and analysts who need structured scholarly data for literature reviews, research workflows, and large-scale academic analysis.

Pricing

Pay per event

Rating

5.0

(1)

Developer

ParseForge

ParseForge

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

πŸ“š Semantic Scholar Scraper

πŸš€ Supercharge your academic research with our comprehensive Semantic Scholar scraper! Automate collection of detailed academic paper data with advanced filtering capabilities.

Extract comprehensive academic paper data from Semantic Scholar - one of the world's largest academic search engines. Search and collect detailed information about research papers including abstracts, citations, authors, publication details, and more. Perfect for researchers, academics, and data analysts who need structured academic data for literature reviews, research analysis, and academic intelligence gathering.

Target Audience: Researchers, academics, data analysts, students, librarians, research institutions
Primary Use Cases: Literature reviews, research analysis, academic intelligence gathering, citation analysis, publication tracking

What Does Semantic Scholar Scraper Do?

This tool collects detailed academic paper data from Semantic Scholar, supporting both search query-based scraping and custom URL scraping. It delivers:

  • Complete paper metadata - Titles, abstracts, TLDR summaries, publication dates, and years
  • Author information - Author names, IDs, affiliations, and author profile links
  • Citation metrics - Citation counts, reference counts, and influential citation counts
  • Publication details - Venues, journals, publication types, and open access status
  • Access information - Direct PDF links, paper URLs, DOI identifiers, and external IDs (ArXiv, PubMed, etc.)
  • Research classifications - Fields of study and research areas
  • Journal information - Journal names, volumes, and page numbers
  • And much more - Comprehensive academic intelligence in one scrape

Business Value: Make informed research decisions, track academic trends, and identify relevant papers with comprehensive, up-to-date academic intelligence that saves hours of manual research.

How to use the Semantic Scholar Scraper - Full Demo

[YouTube video embed or link]

Watch this 3-minute demo to see how easy it is to get started!

Input

To start Semantic Scholar web scraping, simply fill in the input form. You can scrape Semantic Scholar using two different methods (choose one):

  • searchQuery - Enter a research topic or paper title (e.g., "machine learning", "neural networks", "quantum computing")
    • Required if startUrl is not provided
    • Prefill value: "machine learning"
  • yearMin - Filter papers published on or after this year (optional)
    • Example: 2020
  • yearMax - Filter papers published on or before this year (optional)
    • Example: 2024
  • hasPdf - Only include papers that have an open access PDF available (optional)
    • Checkbox option, default: false
  • maxItems - Set the maximum number of papers to collect (up to 1,000,000). Leave empty for unlimited. Prefill value: 10.

Suggestion-Based Filters (Note: These are treated as suggestions by the API, not strict filters):

  • author - Filter papers by author name (optional)
    • Example: "John Smith"
    • Note: Results may include papers that match the search query but may not have the specified author
  • venues - Filter papers by publication venue (journal or conference) (optional)
    • Example: "Nature", "IEEE"
    • Note: Results may include papers that match the search query but may not match the specified venue

The scraper will automatically search Semantic Scholar and collect all matching papers.

Method 2: Custom URL Scraping πŸ”—

  • startUrl - Use Semantic Scholar search URLs in this format:
    • Required if searchQuery is not provided
    • Cannot be used together with searchQuery or any other API filters
    • Example: https://www.semanticscholar.org/search?q=machine+learning&sort=relevance
  • maxItems - Set the maximum number of papers to collect (up to 1,000,000). Leave empty for unlimited. Prefill value: 10.

βœ… Supported URL Formats:

Search Results Pages:

  • https://www.semanticscholar.org/search?q=machine+learning&sort=relevance
  • https://www.semanticscholar.org/search?q=neural+networks&year=2020-2024
  • https://www.semanticscholar.org/search?q=quantum+computing&openAccessPdf=

⚠️ Important Input Rules:

  1. Choose One Method: You must use either search query-based scraping OR custom URL scraping, not both
  2. Required Fields:
    • Either searchQuery OR startUrl must be provided
  3. Mutual Exclusivity:
    • If using startUrl, you cannot use searchQuery or any other API filters
    • If using searchQuery, you cannot use startUrl
  4. Suggestion-Based Filters: author and venues are treated as suggestions by the API and may not strictly filter results

Here's what the filled-out input configuration looks like in JSON:

{
"searchQuery": "machine learning",
"yearMin": 2020,
"yearMax": 2024,
"hasPdf": true,
"maxItems": 50
}
{
"searchQuery": "neural networks",
"yearMin": 2020,
"yearMax": 2024,
"hasPdf": true,
"maxItems": 100
}

Example 2: Custom URL Scraping

{
"startUrl": "https://www.semanticscholar.org/search?q=machine+learning&sort=relevance",
"maxItems": 50
}

Example 3: Advanced Search with Filters

{
"searchQuery": "quantum computing",
"yearMin": 2022,
"hasPdf": true,
"author": "John Smith",
"maxItems": 200
}

Pro Tips:

For Search Query-Based Scraping (Recommended):

  1. 🎯 Be specific with queries - Use precise research terms for best results
  2. πŸ“… Filter by year range - Focus on recent papers or specific time periods
  3. πŸ“„ Use PDF filter - Get only papers with available PDFs for easier access
  4. ⚑ Faster than manual search - No need to browse through multiple pages manually

For Custom URL Scraping:

  1. Go to Semantic Scholar
  2. Use the search functionality to find papers on your topic
  3. Apply any filters you want (year, open access, etc.)
  4. Copy the URL and paste it into the startUrl field

Output

After the Actor finishes its run, you'll get a dataset with the output. The length of the dataset depends on the amount of results you've set. You can download those results as an Excel, HTML, XML, JSON, and CSV document.

Here's an example of scraped Semantic Scholar data you'll get if you decide to scrape academic papers:

{
"paperId": "1234567890",
"title": "Deep Learning for Natural Language Processing: A Comprehensive Survey",
"authors": [
{
"name": "John Smith",
"url": "https://www.semanticscholar.org/author/123456",
"authorId": "123456",
"affiliations": ["Stanford University"]
},
{
"name": "Jane Doe",
"url": "https://www.semanticscholar.org/author/789012",
"authorId": "789012",
"affiliations": ["MIT"]
}
],
"year": 2023,
"publicationVenue": "Nature Machine Intelligence",
"publicationDate": "2023-05-15",
"abstract": "This paper presents a comprehensive survey of deep learning techniques for natural language processing...",
"tldr": "A survey of deep learning methods for NLP tasks including transformers, attention mechanisms, and pre-trained models.",
"citationCount": 245,
"referenceCount": 89,
"influentialCitationCount": 12,
"isOpenAccess": true,
"hasPdf": true,
"detailUrl": "https://www.semanticscholar.org/paper/1234567890",
"pdfUrl": "https://example.com/paper.pdf",
"doi": "10.1038/s42256-023-00123-4",
"corpusId": "1234567890",
"externalIds": {
"DOI": "10.1038/s42256-023-00123-4",
"ArXiv": "2305.12345",
"PubMed": "12345678",
"PubMedCentral": "PMC1234567",
"MAG": "123456789",
"ACL": "2023.acl-main.123",
"DBLP": "conf/nature/2023",
"CorpusId": "1234567890"
},
"fieldsOfStudy": ["Computer Science", "Machine Learning", "Natural Language Processing"],
"s2FieldsOfStudy": ["Computer Science"],
"publicationTypes": ["JournalArticle"],
"journal": {
"name": "Nature Machine Intelligence",
"volume": "5",
"pages": "123-145"
},
"scrapedTimestamp": "2025-01-12T23:29:22.172Z"
}

What You Get:

  • πŸ“„ Complete Paper Information - Titles, abstracts, and TLDR summaries for quick understanding
  • πŸ‘₯ Detailed Author Data - Author names, IDs, affiliations, and profile links
  • πŸ“Š Citation Metrics - Total citations, references, and influential citation counts
  • πŸ”— Access Links - Direct PDF links, paper URLs, and DOI identifiers
  • πŸ›οΈ Publication Details - Venues, journals, publication types, and open access status
  • πŸ” Research Classifications - Fields of study and research areas
  • πŸ“š External Identifiers - ArXiv, PubMed, ACL, DBLP, and other database IDs
  • πŸ“… Publication Metadata - Years, dates, journal volumes, and page numbers

Download Options: CSV, Excel, or JSON formats for easy analysis in your research tools

Why Choose the Semantic Scholar Scraper?

  • 🎯 Comprehensive Data: Get all available paper information in one scrape - citations, abstracts, authors, and more
  • πŸ” Flexible Search: Search by query or use custom URLs with advanced filtering options
  • πŸ“… Year Filtering: Filter papers by publication year range for targeted research
  • πŸ“„ PDF Access: Filter for papers with available PDFs for easier access
  • πŸ‘₯ Author Information: Get complete author details including affiliations and profile links
  • πŸ“Š Citation Metrics: Access citation counts, reference counts, and influential citation metrics
  • πŸ”— Multiple Identifiers: Get DOI, ArXiv, PubMed, and other external database IDs
  • 🚫 No Duplicates: Automatically skips papers already in your dataset
  • ⚑ User-Friendly: No coding needed - just input your search query and go
  • πŸ”„ Sequential Processing: Processes papers one by one for maximum data quality

Time Savings: Save 4-6 hours per week compared to manual paper research
Cost Efficiency: Fraction of the cost of hiring a research assistant or using expensive academic databases

How to Use

  1. Sign Up: Create a free account w/ $5 credit (takes 2 minutes)
  2. Find the Scraper: Visit the Semantic Scholar Scraper page
  3. Set Input:
    • Free users: Can process up to 50 items (maxItems required, maximum value: 50)
    • Paid users:
      • Option A (Recommended): Enter a search query and apply filters (year range, PDF availability, etc.)
      • Option B: Add your custom Semantic Scholar search URL
      • Set max items (optional, prefill value: 10, up to 1,000,000)
  4. Run It: Click "Start" and let it collect your data
  5. Download Data: Get your results in the "Dataset" tab as CSV, Excel, or JSON

Total Time: 3 minutes setup, 10-30 minutes for data collection
No Technical Skills Required: Everything is point-and-click

Business Use Cases

πŸ”¬ Researchers:

  • Conduct comprehensive literature reviews
  • Track citations and research impact
  • Find relevant papers for research projects
  • Monitor new publications in your field

πŸ‘¨β€πŸ« Academics:

  • Build reference databases for courses
  • Track publication trends in your discipline
  • Identify collaboration opportunities
  • Analyze research impact metrics

πŸ“š Librarians:

  • Build comprehensive paper collections
  • Support researchers with data access
  • Track publication trends and patterns
  • Create subject-specific databases

πŸ“Š Data Analysts:

  • Analyze academic publication trends
  • Build research intelligence databases
  • Track citation networks
  • Support policy decisions with data

πŸŽ“ Students:

  • Find papers for thesis and dissertation research
  • Build comprehensive reference lists
  • Track citations for academic writing
  • Discover relevant research in your field

Using Semantic Scholar Scraper with the Apify API

For advanced users who want to automate this process, you can control the scraper programmatically with the Apify API. This allows you to schedule regular data collection and integrate with your existing research tools.

Example API Usage:

// Node.js example
const { ApifyApi } = require('apify-client');
const client = new ApifyApi({
token: 'YOUR_API_TOKEN',
});
// Run with search query
await client.actor('YOUR_ACTOR_ID').call({
searchQuery: "machine learning",
yearMin: 2020,
yearMax: 2024,
hasPdf: true,
maxItems: 100
});
// Run with custom URL
await client.actor('YOUR_ACTOR_ID').call({
startUrl: "https://www.semanticscholar.org/search?q=machine+learning&sort=relevance",
maxItems: 50
});
  • Node.js: Install the apify-client NPM package
  • Python: Use the apify-client PyPI package
  • See the Apify API reference for full details

Frequently Asked Questions

Q: How does it work? A: Semantic Scholar Scraper is easy to use and requires no technical knowledge. Simply enter your search query or paste a Semantic Scholar URL, configure your filters, and let the tool collect the data automatically.

Q: How accurate is the data? A: We collect data directly from Semantic Scholar's official API in real-time, ensuring the most up-to-date and accurate academic paper information available.

Q: Can I filter by specific authors or venues? A: Yes! You can use the author and venues filters. Note that these are treated as suggestions by the Semantic Scholar API, so results may include papers that match your search query but may not strictly match these filters.

Q: What URL formats are supported? A: We support Semantic Scholar search URLs. See the Input section for specific examples.

Q: Can I schedule regular runs? A: Yes! Use the Apify API to schedule daily, weekly, or monthly runs automatically. Perfect for ongoing research monitoring and publication tracking.

Q: What if I need help? A: Our support team is available 24/7. Contact us through the Apify platform.

Q: Is my data secure? A: Absolutely. All data is encrypted in transit and at rest. We never share your data with third parties.

Q: Are there limits for free users? A: Free users can process up to 50 items per run (maxItems is required and must be 50 or less). Paid users can process up to 1,000,000 items per run.

Integrate Semantic Scholar Scraper with any app and automate your workflow

Last but not least, Semantic Scholar Scraper can be connected with almost any cloud service or web app thanks to integrations on the Apify platform.

These includes:

Alternatively, you can use webhooks to carry out an action whenever an event occurs, e.g. get a notification whenever Semantic Scholar Scraper successfully finishes a run.

Looking for more data collection tools? Check out these related actors:

ActorDescriptionLink
GSA eLibrary ScraperCollects government publication data from GSA eLibraryhttps://apify.com/parseforge/gsa-elibrary-scraper
PR Newswire ScraperExtracts press releases and news data from PR Newswirehttps://apify.com/parseforge/pr-newswire-scraper
Hugging Face Model ScraperCollects AI model data from Hugging Facehttps://apify.com/parseforge/hugging-face-model-scraper
Hubspot Marketplace ScraperExtracts business app data from HubSpot marketplacehttps://apify.com/parseforge/hubspot-marketplace-scraper
Smart Apify Actor ScraperCollects comprehensive actor data from Apify with quality metricshttps://apify.com/parseforge/smart-apify-actor-scraper

Pro Tip: πŸ’‘ Browse our complete collection of data collection actors to find the perfect tool for your business needs.

Need Help? Our support team is here to help you get the most out of this tool.


⚠️ Disclaimer: This Actor is an independent tool and is not affiliated with, endorsed by, or sponsored by Semantic Scholar or any of its subsidiaries. All trademarks mentioned are the property of their respective owners.