Thesis Literature Review Scraper avatar

Thesis Literature Review Scraper

Pricing

from $1.00 / 1,000 results

Go to Apify Store
Thesis Literature Review Scraper

Thesis Literature Review Scraper

Turn any research topic into a clean reading list of peer-reviewed academic papers from OpenAlex, Semantic Scholar, and Crossref in one run. Includes citation-manager and spreadsheet exports, plus LLM-ready Markdown you can paste into ChatGPT or Claude for instant literature synthesis.

Pricing

from $1.00 / 1,000 results

Rating

5.0

(20)

Developer

Leafy

Leafy

Maintained by Community

Actor stats

7

Bookmarked

24

Total users

4

Monthly active users

13 days ago

Last modified

Share

Thesis Literature Review Scraper — Multi-Source Academic Papers with Citations & LLM-Ready Output

Paste a research topic → get a de-duplicated, structured, LLM-ready list of peer-reviewed papers. Perfect for thesis literature reviews, RAG pipelines, and AI research assistants.


What it does

Given a research topic or keywords, this Actor queries three major free scholarly databases in parallel, de-duplicates the results by DOI, merges enriched fields across sources, and returns a single clean dataset plus LLM-ready exports. The input panel comes pre-filled with a sample topic so you can hit Run straight away to see how it works.

Why this exists

Writing a thesis lit review manually takes days of copy-paste across different sites like Google Scholar. Building a RAG chatbot over academic papers requires the same legwork. This Actor does it in one run.


Data sources

SourceSizeAuthLicense
OpenAlex250M+ worksnoneCC0
Semantic Scholar~200M papersoptional API keyODC-BY
Crossref160M+ recordsnoneCC0 metadata

All three are public, free, no-auth JSON APIs. We do not scrape Google Scholar.


How the results stay relevant

The Actor usually fetches a few thousand raw papers from the sources before de-duplicating. To make sure your final results are the most on-topic ones, every paper is then scored against your research topic using a classic text-relevance algorithm (BM25) over its title and abstract. Common words like "the" or "study" are ignored, rarer words from your query count for more, and query terms that appear in the title weigh heavier than the same terms in the abstract.

When sortBy is relevance (the default), the highest-scoring papers bubble to the top before the maxResults cap is applied. Pick citations or date instead and the ranking switches to those signals.


Input

FieldTypeDefaultDescription
querystring (required)pre-filled sampleResearch topic or keywords (3–500 chars). Pre-filled with artificial intelligence in higher education so you can test-run without editing.
yearFrominteger2015Earliest publication year.
yearTointegercurrent yearLatest publication year.
maxResultsinteger100Total de-duplicated papers to return (10–1000).
sourcesarray["openalex", "crossref"]Which databases to query. Semantic Scholar is off by default due to rate limits, enable it if you need, ideally with a free API key below.
minCitationsinteger0Filter out papers with fewer citations.
openAccessOnlybooleanfalseReturn only open-access papers.
sortByenumrelevancerelevance / citations / date.
outputFormatarray["bibtex", "markdown"]Extra export formats on top of JSON.
contactEmailstringOptional — enables OpenAlex / Crossref "polite pool" for faster, prioritized access.
semanticScholarApiKeystring (secret)Optional — bypasses Semantic Scholar's shared rate limit.

Example input

{
"query": "impact of social media on adolescent mental health",
"yearFrom": 2018,
"yearTo": 2026,
"maxResults": 75,
"sortBy": "citations",
"outputFormat": ["markdown", "bibtex", "csv"]
}

Output

1. Dataset

One record per paper. The console shows two views:

  • All fields (default) — every field in a flat, spreadsheet-friendly order. Best for exporting to Excel.
  • Papers — curated subset (title, authors, year, venue, citations, DOI, OA, sources). Best for quickly scanning results.

Key fields per record:

GroupFields
Identitydoi, openAlexId, semanticScholarId, pmid, arxivId
Metadatatitle, abstract, authors, authorsDisplay, firstAuthor, authorCount, year, venue, publisher
MetricscitationCount, influentialCitationCount, referenceCount, fieldsOfStudy
AccessisOpenAccess, openAccessUrl, landingPageUrl
Provenancesources, primarySource
LLM payloadllmSummary, literature-review.md

2. Key-Value Store files

Depending on outputFormat:

FileUse it for
literature-review.mdLLM synthesis — paste into ChatGPT / Claude / Gemini.
references.bibBibTeX — import into Overleaf / LaTeX / Zotero BibTeX library.
references.risRIS — import into Zotero, Mendeley, EndNote, or Citavi.
papers.csvExcel / Google Sheets / Numbers.
METADATAPer-source fetch status, total citations, de-dupe counts, run timestamp.

Example output

Abridged excerpts from a run with query: "artificial intelligence in higher education".

Dataset record (JSON) — one complete paper record as returned in the dataset:

{
"title": "Exploring Opportunities and Challenges of Artificial Intelligence and Machine Learning in Higher Education Institutions",
"authors": [
{
"name": "Valentin Kuleto",
"orcid": "https://orcid.org/0000-0002-7811-5436",
"affiliation": "University Business Academy in Novi Sad"
},
{
"name": "Milena P. Ilić",
"orcid": "https://orcid.org/0000-0002-2656-1449",
"affiliation": "Information Technology School, Belgrade"
},
{
"name": "Mihail Dumangiu",
"orcid": null,
"affiliation": null
},
{
"name": "Marko Ranković",
"orcid": null,
"affiliation": null
}
],
"authorsDisplay": "Valentin Kuleto; Milena P. Ilić; Mihail Dumangiu; Marko Ranković",
"firstAuthor": "Valentin Kuleto",
"authorCount": 4,
"year": 2021,
"publicationDate": "2021-09-17",
"venue": "Sustainability",
"venueType": "journal",
"publisher": "MDPI AG",
"abstract": "Artificial Intelligence (AI) and Machine Learning (ML) are reshaping how higher education institutions (HEIs) operate, teach, and serve students. This paper explores the opportunities and challenges of integrating AI and ML into HEIs, drawing on a mixed-methods study of 108 faculty and administrators across six institutions. We identify five opportunity areas (personalized learning, administrative automation, predictive analytics, research augmentation, and accessibility) and four challenge areas (data governance, faculty readiness, equity, and cost). We conclude with a roadmap for responsible adoption.",
"citationCount": 412,
"referenceCount": 58,
"influentialCitationCount": 38,
"fieldsOfStudy": ["Computer Science", "Education", "Sustainability"],
"isOpenAccess": true,
"openAccessUrl": "https://www.mdpi.com/2071-1050/13/18/10424/pdf",
"landingPageUrl": "https://doi.org/10.3390/su131810424",
"doi": "10.3390/su131810424",
"openAlexId": "W3199263016",
"semanticScholarId": null,
"pmid": null,
"arxivId": null,
"sources": ["openalex", "crossref"],
"primarySource": "openalex",
"llmSummary": "## Exploring Opportunities and Challenges of Artificial Intelligence and Machine Learning in Higher Education Institutions\n\n**Authors:** Valentin Kuleto, Milena P. Ilić, Mihail Dumangiu, et al. (2021)\n**Venue:** Sustainability (journal)\n**Citations:** 412 (influential: 38) | **Open Access:** yes\n**DOI:** 10.3390/su131810424\n**Fields:** Computer Science, Education, Sustainability\n**Sources:** openalex, crossref\n\n**Abstract:** Artificial Intelligence (AI) and Machine Learning (ML) are reshaping how higher education institutions (HEIs) operate…",
"relevanceScore": 18.4
}

literature-review.md (LLM-ready Markdown) — first three papers of a run:

# Literature Review: artificial intelligence in higher education
*Generated 2026-04-19T22:00:00.000Z | 75 papers from 2 sources*
## Summary Stats
- Date range: 2020–2026
- Total citations across corpus: 18,423
- Open access: 47/75
- Top venues: Sustainability (6); Computers & Education (4); IEEE Access (3); International Journal of Educational Technology in Higher Education (3)
## Source Status
- **openalex**: ok (600 fetched)
- **semanticscholar**: failed (0 fetched, error: not requested)
- **crossref**: ok (200 fetched)
## Papers
## Exploring Opportunities and Challenges of Artificial Intelligence and Machine Learning in Higher Education Institutions
**Authors:** Valentin Kuleto, Milena P. Ilić, Mihail Dumangiu, et al. (2021)
**Venue:** Sustainability (journal)
**Citations:** 412 (influential: 38) | **Open Access:** yes
**DOI:** 10.3390/su131810424
**Fields:** Computer Science, Education, Sustainability
**Sources:** openalex, crossref
**Abstract:** Artificial Intelligence (AI) and Machine Learning (ML) are reshaping how higher education institutions (HEIs) operate, teach, and serve students. This paper explores the opportunities and challenges of integrating AI and ML into HEIs, drawing on a mixed-methods study of 108 faculty and administrators across six institutions. We identify five opportunity areas (personalized learning, administrative automation, predictive analytics, research augmentation, and accessibility) and four challenge areas (data governance, faculty readiness, equity, and cost). We conclude with a roadmap for responsible adoption.
---
## Artificial intelligence in higher education: the state of the field
**Authors:** Helen Crompton, Diane Burke (2023)
**Venue:** International Journal of Educational Technology in Higher Education (journal)
**Citations:** 287 (influential: 29) | **Open Access:** yes
**DOI:** 10.1186/s41239-023-00392-8
**Fields:** Education, Computer Science
**Sources:** openalex, crossref
**Abstract:** This systematic review examines the state of artificial intelligence (AI) research in higher education. Drawing from 138 empirical studies published between 2016 and 2022, we map the field across five dimensions: AI application types, pedagogical goals, student populations, methodological approaches, and reported outcomes. Findings indicate a heavy concentration on adaptive learning systems and intelligent tutoring, with under-representation of equity, ethics, and faculty-perspective research. We propose a research agenda for the next phase of AI-in-higher-ed scholarship.
---
## ChatGPT for good? On opportunities and challenges of large language models for education
**Authors:** Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, et al. (2023)
**Venue:** Learning and Individual Differences (journal)
**Citations:** 1,942 (influential: 184) | **Open Access:** yes
**DOI:** 10.1016/j.lindif.2023.102274
**Fields:** Education, Computer Science, Linguistics
**Sources:** openalex, crossref
**Abstract:** Large language models (LLMs) such as ChatGPT are transforming how students access information and produce academic work. This position paper surveys the opportunities LLMs create for educators (personalized feedback, lesson planning, accessibility support) alongside the challenges they raise (academic integrity, factual reliability, equity of access, and assessment redesign). We outline an actionable framework for institutions considering LLM integration, covering policy, pedagogy, and tooling.
---
## [next paper] …

LLM-ready Markdown — paste it into your favorite AI model

When outputFormat includes markdown, the Actor writes a single concatenated literature-review.md file to the Key-Value Store. Each paper becomes its own section with detail. This is better digested by your favorite AI model instead of copy pasting all the details into a huge thread that sometimes couldn't be processed by your free subscription to chatgpt.

How to use it:

  1. Run the Actor.
  2. Open the run → Storage → Key-Value Store.
  3. Download literature-review.md.
  4. Paste or attach it into ChatGPT / Claude / Gemini and ask for a synthesis.

A few prompts that work well:

  • "Write a 1,500-word literature review organized by theme. Cite every claim inline using (AuthorLastName, Year). Include a Research gaps section. Use only the papers provided."
  • "For each paper, give me a row: research question, method, sample, key finding, limitation. Output as a Markdown table."
  • "Group these papers into 4–6 clusters by approach. Name each cluster, list its papers by DOI, and write a 3-sentence synthesis per cluster."

Pair this with the BibTeX / RIS export and any DOI the LLM cites is already in your Zotero or Overleaf bibliography.


Use cases

  1. Thesis / dissertation lit review — seed your chapter with 100+ relevant papers in one run.
  2. RAG pipelines over academic content — ingest the Markdown or CSV into a vector store.
  3. Citation-graph sanity checks — verify a paper is findable in multiple databases.

Limitations (V1)

  • Metadata only. No full-text PDF download. Use openAccessUrl if you want to fetch PDFs yourself.
  • Max 1000 papers per run (de-duplicated). Run multiple queries for broader coverage.
  • Semantic Scholar anonymous pool can 429. If you need reliable SS results at scale, provide a free API key.
  • Crossref abstract coverage is sparse (~20%). Primary abstract source is Semantic Scholar, then OpenAlex.
  • No Google Scholar / Scopus / Web of Science. These require commercial licenses.
  • English-biased. The sources cover multiple languages but keyword matching works best in English.

  • Output is metadata only — no full-text reproduction. Respects copyright and each source's terms.
  • You are responsible for citing sources appropriately in your own work.
  • OpenAlex data is CC0. Crossref metadata is CC0. Semantic Scholar data is ODC-BY.
  • This Actor does not scrape Google Scholar — explicitly avoided due to their ToS and anti-bot measures.
  • No personal data is collected. contactEmail, if provided, is sent only to OpenAlex / Crossref as a polite-pool identifier per their API conventions; it is not stored by this Actor.

Roadmap

V2 (after V1 traction):

  • MCP Standby mode (expose as a tool for Claude / Cursor agents)
  • PubMed / arXiv / CORE as additional sources
  • Token-aware RAG-ready chunking

Contact / feedback

Bug reports, feature requests, and use-case tips welcome — send a message at leafydevjr@gmail.com or leave a review on the Apify Store listing.