Pricing

from $1.00 / 1,000 results

Try for free

Go to Apify Store

Thesis Literature Review Scraper

Try for free

Turn any research topic into a clean reading list of peer-reviewed academic papers from OpenAlex, Semantic Scholar, and Crossref in one run. Includes citation-manager and spreadsheet exports, plus LLM-ready Markdown you can paste into ChatGPT or Claude for instant literature synthesis.

Pricing

from $1.00 / 1,000 results

Rating

5.0

(20)

Developer

Leafy

Actor stats

Bookmarked

Total users

Monthly active users

13 days ago

Last modified

Thesis Literature Review Scraper — Multi-Source Academic Papers with Citations & LLM-Ready Output

Paste a research topic → get a de-duplicated, structured, LLM-ready list of peer-reviewed papers. Perfect for thesis literature reviews, RAG pipelines, and AI research assistants.

What it does

Given a research topic or keywords, this Actor queries three major free scholarly databases in parallel, de-duplicates the results by DOI, merges enriched fields across sources, and returns a single clean dataset plus LLM-ready exports. The input panel comes pre-filled with a sample topic so you can hit Run straight away to see how it works.

Why this exists

Writing a thesis lit review manually takes days of copy-paste across different sites like Google Scholar. Building a RAG chatbot over academic papers requires the same legwork. This Actor does it in one run.

Data sources

Source	Size	Auth	License
OpenAlex	250M+ works	none	CC0
Semantic Scholar	~200M papers	optional API key	ODC-BY
Crossref	160M+ records	none	CC0 metadata

All three are public, free, no-auth JSON APIs. We do not scrape Google Scholar.

How the results stay relevant

The Actor usually fetches a few thousand raw papers from the sources before de-duplicating. To make sure your final results are the most on-topic ones, every paper is then scored against your research topic using a classic text-relevance algorithm (BM25) over its title and abstract. Common words like "the" or "study" are ignored, rarer words from your query count for more, and query terms that appear in the title weigh heavier than the same terms in the abstract.

When sortBy is relevance (the default), the highest-scoring papers bubble to the top before the maxResults cap is applied. Pick citations or date instead and the ranking switches to those signals.

Input

Field	Type	Default	Description
`query`	string (required)	pre-filled sample	Research topic or keywords (3–500 chars). Pre-filled with `artificial intelligence in higher education` so you can test-run without editing.
`yearFrom`	integer	2015	Earliest publication year.
`yearTo`	integer	current year	Latest publication year.
`maxResults`	integer	100	Total de-duplicated papers to return (10–1000).
`sources`	array	`["openalex", "crossref"]`	Which databases to query. Semantic Scholar is off by default due to rate limits, enable it if you need, ideally with a free API key below.
`minCitations`	integer	0	Filter out papers with fewer citations.
`openAccessOnly`	boolean	false	Return only open-access papers.
`sortBy`	enum	`relevance`	`relevance` / `citations` / `date`.
`outputFormat`	array	`["bibtex", "markdown"]`	Extra export formats on top of JSON.
`contactEmail`	string	—	Optional — enables OpenAlex / Crossref "polite pool" for faster, prioritized access.
`semanticScholarApiKey`	string (secret)	—	Optional — bypasses Semantic Scholar's shared rate limit.

Example input

{
    "query": "impact of social media on adolescent mental health",
    "yearFrom": 2018,
    "yearTo": 2026,
    "maxResults": 75,
    "sortBy": "citations",
    "outputFormat": ["markdown", "bibtex", "csv"]
}

Output

1. Dataset

One record per paper. The console shows two views:

All fields (default) — every field in a flat, spreadsheet-friendly order. Best for exporting to Excel.
Papers — curated subset (title, authors, year, venue, citations, DOI, OA, sources). Best for quickly scanning results.

Key fields per record:

Group	Fields
Identity	`doi`, `openAlexId`, `semanticScholarId`, `pmid`, `arxivId`
Metadata	`title`, `abstract`, `authors`, `authorsDisplay`, `firstAuthor`, `authorCount`, `year`, `venue`, `publisher`
Metrics	`citationCount`, `influentialCitationCount`, `referenceCount`, `fieldsOfStudy`
Access	`isOpenAccess`, `openAccessUrl`, `landingPageUrl`
Provenance	`sources`, `primarySource`
LLM payload	`llmSummary`, `literature-review.md`

2. Key-Value Store files

Depending on outputFormat:

File	Use it for
`literature-review.md`	LLM synthesis — paste into ChatGPT / Claude / Gemini.
`references.bib`	BibTeX — import into Overleaf / LaTeX / Zotero BibTeX library.
`references.ris`	RIS — import into Zotero, Mendeley, EndNote, or Citavi.
`papers.csv`	Excel / Google Sheets / Numbers.
`METADATA`	Per-source fetch status, total citations, de-dupe counts, run timestamp.

Example output

Abridged excerpts from a run with query: "artificial intelligence in higher education".

Dataset record (JSON) — one complete paper record as returned in the dataset:

{
    "title": "Exploring Opportunities and Challenges of Artificial Intelligence and Machine Learning in Higher Education Institutions",
    "authors": [
        {
            "name": "Valentin Kuleto",
            "orcid": "https://orcid.org/0000-0002-7811-5436",
            "affiliation": "University Business Academy in Novi Sad"
        },
        {
            "name": "Milena P. Ilić",
            "orcid": "https://orcid.org/0000-0002-2656-1449",
            "affiliation": "Information Technology School, Belgrade"
        },
        {
            "name": "Mihail Dumangiu",
            "orcid": null,
            "affiliation": null
        },
        {
            "name": "Marko Ranković",
            "orcid": null,
            "affiliation": null
        }
    ],
    "authorsDisplay": "Valentin Kuleto; Milena P. Ilić; Mihail Dumangiu; Marko Ranković",
    "firstAuthor": "Valentin Kuleto",
    "authorCount": 4,
    "year": 2021,
    "publicationDate": "2021-09-17",
    "venue": "Sustainability",
    "venueType": "journal",
    "publisher": "MDPI AG",
    "abstract": "Artificial Intelligence (AI) and Machine Learning (ML) are reshaping how higher education institutions (HEIs) operate, teach, and serve students. This paper explores the opportunities and challenges of integrating AI and ML into HEIs, drawing on a mixed-methods study of 108 faculty and administrators across six institutions. We identify five opportunity areas (personalized learning, administrative automation, predictive analytics, research augmentation, and accessibility) and four challenge areas (data governance, faculty readiness, equity, and cost). We conclude with a roadmap for responsible adoption.",
    "citationCount": 412,
    "referenceCount": 58,
    "influentialCitationCount": 38,
    "fieldsOfStudy": ["Computer Science", "Education", "Sustainability"],
    "isOpenAccess": true,
    "openAccessUrl": "https://www.mdpi.com/2071-1050/13/18/10424/pdf",
    "landingPageUrl": "https://doi.org/10.3390/su131810424",
    "doi": "10.3390/su131810424",
    "openAlexId": "W3199263016",
    "semanticScholarId": null,
    "pmid": null,
    "arxivId": null,
    "sources": ["openalex", "crossref"],
    "primarySource": "openalex",
    "llmSummary": "## Exploring Opportunities and Challenges of Artificial Intelligence and Machine Learning in Higher Education Institutions\n\n**Authors:** Valentin Kuleto, Milena P. Ilić, Mihail Dumangiu, et al. (2021)\n**Venue:** Sustainability (journal)\n**Citations:** 412 (influential: 38) | **Open Access:** yes\n**DOI:** 10.3390/su131810424\n**Fields:** Computer Science, Education, Sustainability\n**Sources:** openalex, crossref\n\n**Abstract:** Artificial Intelligence (AI) and Machine Learning (ML) are reshaping how higher education institutions (HEIs) operate…",
    "relevanceScore": 18.4
}

literature-review.md (LLM-ready Markdown) — first three papers of a run:

# Literature Review: artificial intelligence in higher education
*Generated 2026-04-19T22:00:00.000Z | 75 papers from 2 sources*

## Summary Stats
- Date range: 2020–2026
- Total citations across corpus: 18,423
- Open access: 47/75
- Top venues: Sustainability (6); Computers & Education (4); IEEE Access (3); International Journal of Educational Technology in Higher Education (3)

## Source Status
- **openalex**: ok (600 fetched)
- **semanticscholar**: failed (0 fetched, error: not requested)
- **crossref**: ok (200 fetched)

## Papers

## Exploring Opportunities and Challenges of Artificial Intelligence and Machine Learning in Higher Education Institutions

**Authors:** Valentin Kuleto, Milena P. Ilić, Mihail Dumangiu, et al. (2021)
**Venue:** Sustainability (journal)
**Citations:** 412 (influential: 38) | **Open Access:** yes
**DOI:** 10.3390/su131810424
**Fields:** Computer Science, Education, Sustainability
**Sources:** openalex, crossref

**Abstract:** Artificial Intelligence (AI) and Machine Learning (ML) are reshaping how higher education institutions (HEIs) operate, teach, and serve students. This paper explores the opportunities and challenges of integrating AI and ML into HEIs, drawing on a mixed-methods study of 108 faculty and administrators across six institutions. We identify five opportunity areas (personalized learning, administrative automation, predictive analytics, research augmentation, and accessibility) and four challenge areas (data governance, faculty readiness, equity, and cost). We conclude with a roadmap for responsible adoption.

---

## Artificial intelligence in higher education: the state of the field

**Authors:** Helen Crompton, Diane Burke (2023)
**Venue:** International Journal of Educational Technology in Higher Education (journal)
**Citations:** 287 (influential: 29) | **Open Access:** yes
**DOI:** 10.1186/s41239-023-00392-8
**Fields:** Education, Computer Science
**Sources:** openalex, crossref

**Abstract:** This systematic review examines the state of artificial intelligence (AI) research in higher education. Drawing from 138 empirical studies published between 2016 and 2022, we map the field across five dimensions: AI application types, pedagogical goals, student populations, methodological approaches, and reported outcomes. Findings indicate a heavy concentration on adaptive learning systems and intelligent tutoring, with under-representation of equity, ethics, and faculty-perspective research. We propose a research agenda for the next phase of AI-in-higher-ed scholarship.

---

## ChatGPT for good? On opportunities and challenges of large language models for education

**Authors:** Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, et al. (2023)
**Venue:** Learning and Individual Differences (journal)
**Citations:** 1,942 (influential: 184) | **Open Access:** yes
**DOI:** 10.1016/j.lindif.2023.102274
**Fields:** Education, Computer Science, Linguistics
**Sources:** openalex, crossref

**Abstract:** Large language models (LLMs) such as ChatGPT are transforming how students access information and produce academic work. This position paper surveys the opportunities LLMs create for educators (personalized feedback, lesson planning, accessibility support) alongside the challenges they raise (academic integrity, factual reliability, equity of access, and assessment redesign). We outline an actionable framework for institutions considering LLM integration, covering policy, pedagogy, and tooling.

---

## [next paper] …

LLM-ready Markdown — paste it into your favorite AI model

When outputFormat includes markdown, the Actor writes a single concatenated literature-review.md file to the Key-Value Store. Each paper becomes its own section with detail. This is better digested by your favorite AI model instead of copy pasting all the details into a huge thread that sometimes couldn't be processed by your free subscription to chatgpt.

How to use it:

Run the Actor.
Open the run → Storage → Key-Value Store.
Download literature-review.md.
Paste or attach it into ChatGPT / Claude / Gemini and ask for a synthesis.

A few prompts that work well:

"Write a 1,500-word literature review organized by theme. Cite every claim inline using (AuthorLastName, Year). Include a Research gaps section. Use only the papers provided."
"For each paper, give me a row: research question, method, sample, key finding, limitation. Output as a Markdown table."
"Group these papers into 4–6 clusters by approach. Name each cluster, list its papers by DOI, and write a 3-sentence synthesis per cluster."

Pair this with the BibTeX / RIS export and any DOI the LLM cites is already in your Zotero or Overleaf bibliography.

Use cases

Thesis / dissertation lit review — seed your chapter with 100+ relevant papers in one run.
RAG pipelines over academic content — ingest the Markdown or CSV into a vector store.
Citation-graph sanity checks — verify a paper is findable in multiple databases.

Limitations (V1)

Metadata only. No full-text PDF download. Use openAccessUrl if you want to fetch PDFs yourself.
Max 1000 papers per run (de-duplicated). Run multiple queries for broader coverage.
Semantic Scholar anonymous pool can 429. If you need reliable SS results at scale, provide a free API key.
Crossref abstract coverage is sparse (~20%). Primary abstract source is Semantic Scholar, then OpenAlex.
No Google Scholar / Scopus / Web of Science. These require commercial licenses.
English-biased. The sources cover multiple languages but keyword matching works best in English.

Legal & licensing

Output is metadata only — no full-text reproduction. Respects copyright and each source's terms.
You are responsible for citing sources appropriately in your own work.
OpenAlex data is CC0. Crossref metadata is CC0. Semantic Scholar data is ODC-BY.
This Actor does not scrape Google Scholar — explicitly avoided due to their ToS and anti-bot measures.
No personal data is collected. contactEmail, if provided, is sent only to OpenAlex / Crossref as a polite-pool identifier per their API conventions; it is not stored by this Actor.

Roadmap

V2 (after V1 traction):

MCP Standby mode (expose as a tool for Claude / Cursor agents)
PubMed / arXiv / CORE as additional sources
Token-aware RAG-ready chunking

Contact / feedback

Bug reports, feature requests, and use-case tips welcome — send a message at leafydevjr@gmail.com or leave a review on the Apify Store listing.

Semantic Scholar Scraper

openclawmara/semantic-scholar-scraper

Scrape Semantic Scholar for academic papers, citations, abstracts, and author profiles. Search by topic, author, or venue. Extract citation graphs, reference lists, and research trends. Essential for literature reviews, academic research, and AI/ML paper discovery.

OpenClaw Mara

Semantic Scholar Paper Scraper

agenscrape/semantic-scholar-paper-scraper

Scrape academic papers from Semantic Scholar. Search by keyword and extract paper titles, abstracts, authors, citation counts, publication dates, DOIs, open access PDFs... Perfect for literature reviews, citation analysis, and research databases. Real time data output with pagination support.

Agenscrape

Crossref Scraper

automation-lab/crossref-scraper

Search and extract academic paper metadata from Crossref — titles, authors, DOIs, citations, abstracts, and journal details. Process thousands of scholarly articles in a single run. Export to JSON, CSV, or Excel for literature reviews and citation analysis.

Stas Persiianenko

Semantic Scholar Scraper

parseforge/semantic-scholar-scraper

Extract detailed academic paper data from Semantic Scholar, including abstracts, citations, authors, and publication details. Ideal for researchers, academics, and analysts who need structured scholarly data for literature reviews, research workflows, and large-scale academic analysis.

ParseForge

5.0

Semantic Scholar Search Scraper

powerai/semantic-scholar-search-scraper

Scrape academic papers from Semantic Scholar by keyword search, with automatic pagination and comprehensive research data extraction.

PowerAI

Semantic Scholar Academic Paper Scraper

cloud9_ai/semantic-scholar-scraper

Search and extract academic papers, citations, and authors from Semantic Scholar. 200M+ papers with citation graphs and impact metrics. Search papers, get detailed paper info, or find researchers. API key optional. For research and AI.

cloud9

Google Scholar Search Scraper

ecomscrape/google-scholar-search-scraper

Extract comprehensive academic data from Google Scholar including research papers, citations, author information, and PDF links. Automate your literature review process with advanced scraping capabilities for researchers and academics.

ecomscrape

Google Scholar Scraper

cloud9_ai/google-scholar-scraper

Extract academic papers from Google Scholar: title, authors, year, journal, citation count, abstract snippet, PDF links. Search by keyword with year range filters. Stricter rate limiting for reliability. Perfect for literature review, research trend analysis, citation tracking.

cloud9

📄 Academic Paper Scraper — Research & Citations

nexgendata/academic-paper-scraper

Scrape academic papers, research articles, citations, author profiles, and h-index data from Google Scholar. Extract abstracts, publication dates, journal names, and citation counts for literature reviews.

Stephan Corbeil

🎓 Google Scholar Scraper — Papers & Citations

nexgendata/google-scholar-scraper

Scrape Google Scholar for academic papers, citations, author profiles, and h-index data. Extract abstracts, publication dates, and journal info. Ideal for literature reviews and research.

Stephan Corbeil