Thesis Literature Review Scraper avatar

Thesis Literature Review Scraper

Pricing

from $1.00 / 1,000 results

Go to Apify Store
Thesis Literature Review Scraper

Thesis Literature Review Scraper

Turn any research question into a clean reading list of peer-reviewed academic papers from OpenAlex, Semantic Scholar, and Crossref in one run. Includes citation-manager and spreadsheet exports, plus LLM-ready Markdown you can paste into ChatGPT or Claude for instant literature synthesis.

Pricing

from $1.00 / 1,000 results

Rating

0.0

(0)

Developer

Leafy Dev Jr

Leafy Dev Jr

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

10 hours ago

Last modified

Share

Thesis Literature Review Scraper — Multi-Source Academic Papers with Citations & LLM-Ready Output

Paste a research question → get a de-duplicated, structured, LLM-ready list of peer-reviewed papers from OpenAlex, Semantic Scholar, and Crossref. Perfect for thesis literature reviews, RAG pipelines, and AI research assistants.


What it does

Given a research question or topic keywords, this Actor queries three major free scholarly databases in parallel, de-duplicates the results by DOI (with fuzzy-title fallback), merges enriched fields across sources, and returns a single clean dataset plus LLM-ready exports.

Why this exists

Writing a thesis lit review manually takes days of copy-paste across Google Scholar, Scopus, Web of Science, and PubMed. Building a RAG chatbot over academic papers requires the same legwork. This Actor does it in one run.


Data sources

SourceSizeAuthLicense
OpenAlex250M+ worksnoneCC0
Semantic Scholar~200M papersoptional API keyODC-BY
Crossref160M+ recordsnoneCC0 metadata

All three are public, free, no-auth JSON APIs. We do not scrape Google Scholar — it's explicitly out of scope to respect their Terms of Service.


Input

FieldTypeDefaultDescription
querystring (required)Research question or keywords (3–500 chars).
yearFrominteger2015Earliest publication year.
yearTointegercurrent yearLatest publication year.
maxResultsinteger100Total de-duplicated papers to return (10–1000).
sourcesarray["openalex", "crossref"]Which databases to query. Semantic Scholar is off by default — enable it if you need its abstracts / influentialCitationCount, ideally with a free API key below.
minCitationsinteger0Filter out papers with fewer citations.
openAccessOnlybooleanfalseReturn only open-access papers.
sortByenumrelevancerelevance / citations / date.
outputFormatarray["bibtex", "markdown"]Extra export formats on top of JSON.
contactEmailstringblankOptional — enables OpenAlex / Crossref "polite pool" for faster, prioritized access.
semanticScholarApiKeystring (secret)blankOptional — bypasses Semantic Scholar's shared rate limit.

Example input

{
"query": "impact of social media on adolescent mental health",
"yearFrom": 2018,
"yearTo": 2026,
"maxResults": 75,
"sortBy": "citations",
"outputFormat": ["markdown", "bibtex", "csv"]
}

Output

1. Dataset

One record per paper. The console shows two views:

  • Papers — curated view (title, authors, year, venue, citations, DOI, OA, sources). Best for browsing.
  • All fields — every field in a flat, spreadsheet-friendly order. Best for exporting to Excel.

Key fields per record:

GroupFields
Identitydoi, openAlexId, semanticScholarId, pmid, arxivId
Metadatatitle, abstract, authors (array), authorsDisplay (joined string), firstAuthor, authorCount, year, venue, publisher
MetricscitationCount, influentialCitationCount, referenceCount, fieldsOfStudy
AccessisOpenAccess, openAccessUrl, landingPageUrl
Provenancesources (which APIs returned it), primarySource
LLM payloadllmSummary — the formatted Markdown block used to build literature-review.md

Both the structured authors array and the flat authorsDisplay / firstAuthor / authorCount columns are included — so JSON / RAG consumers get the full shape, and spreadsheet users get a single clean authors column.

2. Key-Value Store files

Depending on outputFormat:

FileUse it for
literature-review.mdLLM synthesis — paste into ChatGPT / Claude / Gemini (see below).
references.bibBibTeX — import into Overleaf / LaTeX / Zotero BibTeX library.
references.risRIS — import into Zotero, Mendeley, EndNote, or Citavi (File → Import).
papers.csvExcel / Google Sheets / Numbers. Authors already joined with ; so no JSON blob.
METADATAPer-source fetch status, total citations, de-dupe counts, run timestamp.

Example output

Abridged excerpts from a run with query: "artificial intelligence in higher education".

Dataset record (JSON) — one complete paper record as returned in the dataset:

{
"title": "Exploring Opportunities and Challenges of Artificial Intelligence and Machine Learning in Higher Education Institutions",
"authors": [
{
"name": "Valentin Kuleto",
"orcid": "https://orcid.org/0000-0002-7811-5436",
"affiliation": "University Business Academy in Novi Sad"
},
{
"name": "Milena P. Ilić",
"orcid": "https://orcid.org/0000-0002-2656-1449",
"affiliation": "Information Technology School, Belgrade"
},
{
"name": "Mihail Dumangiu",
"orcid": null,
"affiliation": null
},
{
"name": "Marko Ranković",
"orcid": null,
"affiliation": null
}
],
"authorsDisplay": "Valentin Kuleto; Milena P. Ilić; Mihail Dumangiu; Marko Ranković",
"firstAuthor": "Valentin Kuleto",
"authorCount": 4,
"year": 2021,
"publicationDate": "2021-09-17",
"venue": "Sustainability",
"venueType": "journal",
"publisher": "MDPI AG",
"abstract": "Artificial Intelligence (AI) and Machine Learning (ML) are reshaping how higher education institutions (HEIs) operate, teach, and serve students. This paper explores the opportunities and challenges of integrating AI and ML into HEIs, drawing on a mixed-methods study of 108 faculty and administrators across six institutions. We identify five opportunity areas (personalized learning, administrative automation, predictive analytics, research augmentation, and accessibility) and four challenge areas (data governance, faculty readiness, equity, and cost). We conclude with a roadmap for responsible adoption.",
"citationCount": 412,
"referenceCount": 58,
"influentialCitationCount": 38,
"fieldsOfStudy": ["Computer Science", "Education", "Sustainability"],
"isOpenAccess": true,
"openAccessUrl": "https://www.mdpi.com/2071-1050/13/18/10424/pdf",
"landingPageUrl": "https://doi.org/10.3390/su131810424",
"doi": "10.3390/su131810424",
"openAlexId": "W3199263016",
"semanticScholarId": null,
"pmid": null,
"arxivId": null,
"sources": ["openalex", "crossref"],
"primarySource": "openalex",
"llmSummary": "## Exploring Opportunities and Challenges of Artificial Intelligence and Machine Learning in Higher Education Institutions\n\n**Authors:** Valentin Kuleto, Milena P. Ilić, Mihail Dumangiu, et al. (2021)\n**Venue:** Sustainability (journal)\n**Citations:** 412 (influential: 38) | **Open Access:** yes\n**DOI:** 10.3390/su131810424\n**Fields:** Computer Science, Education, Sustainability\n**Sources:** openalex, crossref\n\n**Abstract:** Artificial Intelligence (AI) and Machine Learning (ML) are reshaping how higher education institutions (HEIs) operate…",
"relevanceScore": 18.4
}

literature-review.md (LLM-ready Markdown) — first three papers of a run:

# Literature Review: artificial intelligence in higher education
*Generated 2026-04-19T22:00:00.000Z | 75 papers from 2 sources*
## Summary Stats
- Date range: 2020–2026
- Total citations across corpus: 18,423
- Open access: 47/75
- Top venues: Sustainability (6); Computers & Education (4); IEEE Access (3); International Journal of Educational Technology in Higher Education (3)
## Source Status
- **openalex**: ok (600 fetched)
- **semanticscholar**: failed (0 fetched, error: not requested)
- **crossref**: ok (200 fetched)
## Papers
## Exploring Opportunities and Challenges of Artificial Intelligence and Machine Learning in Higher Education Institutions
**Authors:** Valentin Kuleto, Milena P. Ilić, Mihail Dumangiu, et al. (2021)
**Venue:** Sustainability (journal)
**Citations:** 412 (influential: 38) | **Open Access:** yes
**DOI:** 10.3390/su131810424
**Fields:** Computer Science, Education, Sustainability
**Sources:** openalex, crossref
**Abstract:** Artificial Intelligence (AI) and Machine Learning (ML) are reshaping how higher education institutions (HEIs) operate, teach, and serve students. This paper explores the opportunities and challenges of integrating AI and ML into HEIs, drawing on a mixed-methods study of 108 faculty and administrators across six institutions. We identify five opportunity areas (personalized learning, administrative automation, predictive analytics, research augmentation, and accessibility) and four challenge areas (data governance, faculty readiness, equity, and cost). We conclude with a roadmap for responsible adoption.
---
## Artificial intelligence in higher education: the state of the field
**Authors:** Helen Crompton, Diane Burke (2023)
**Venue:** International Journal of Educational Technology in Higher Education (journal)
**Citations:** 287 (influential: 29) | **Open Access:** yes
**DOI:** 10.1186/s41239-023-00392-8
**Fields:** Education, Computer Science
**Sources:** openalex, crossref
**Abstract:** This systematic review examines the state of artificial intelligence (AI) research in higher education. Drawing from 138 empirical studies published between 2016 and 2022, we map the field across five dimensions: AI application types, pedagogical goals, student populations, methodological approaches, and reported outcomes. Findings indicate a heavy concentration on adaptive learning systems and intelligent tutoring, with under-representation of equity, ethics, and faculty-perspective research. We propose a research agenda for the next phase of AI-in-higher-ed scholarship.
---
## ChatGPT for good? On opportunities and challenges of large language models for education
**Authors:** Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, et al. (2023)
**Venue:** Learning and Individual Differences (journal)
**Citations:** 1,942 (influential: 184) | **Open Access:** yes
**DOI:** 10.1016/j.lindif.2023.102274
**Fields:** Education, Computer Science, Linguistics
**Sources:** openalex, crossref
**Abstract:** Large language models (LLMs) such as ChatGPT are transforming how students access information and produce academic work. This position paper surveys the opportunities LLMs create for educators (personalized feedback, lesson planning, accessibility support) alongside the challenges they raise (academic integrity, factual reliability, equity of access, and assessment redesign). We outline an actionable framework for institutions considering LLM integration, covering policy, pedagogy, and tooling.
---
## [next paper] …

LLM-ready Markdown — paste it into your favorite AI model

When outputFormat includes markdown, the Actor writes a single concatenated literature-review.md file to the Key-Value Store. Each paper becomes its own section with title, authors, year, venue, citations, DOI, and abstract — clean prose, no HTML, no source-specific boilerplate.

How to use it:

  1. Run the Actor.
  2. Open the run → Storage → Key-Value Store.
  3. Download literature-review.md.
  4. Paste or attach it into ChatGPT / Claude / Gemini and ask for a synthesis.

A few prompts that work well:

  • "Write a 1,500-word literature review organized by theme. Cite every claim inline using (AuthorLastName, Year). Include a Research gaps section. Use only the papers provided."
  • "For each paper, give me a row: research question, method, sample, key finding, limitation. Output as a Markdown table."
  • "Group these papers into 4–6 clusters by approach. Name each cluster, list its papers by DOI, and write a 3-sentence synthesis per cluster."

Pair this with the BibTeX / RIS export and any DOI the LLM cites is already in your Zotero or Overleaf bibliography.


Use cases

  1. Thesis / dissertation lit review — seed your chapter with 100+ relevant papers in one run.
  2. RAG pipelines over academic content — ingest the Markdown or CSV into a vector store.
  3. AI research assistants — pre-index a corpus for a domain-specific chatbot.
  4. Systematic reviews — starting-point screening before manual PRISMA filtering.
  5. Citation-graph sanity checks — verify a paper is findable in multiple databases.

Limitations (V1)

  • Metadata only. No full-text PDF download. Use openAccessUrl if you want to fetch PDFs yourself.
  • Max 1000 papers per run (de-duplicated). Run multiple queries for broader coverage.
  • Semantic Scholar anonymous pool can 429. If you need reliable SS results at scale, provide a free API key.
  • Crossref abstract coverage is sparse (~20%). Primary abstract source is Semantic Scholar, then OpenAlex.
  • No Google Scholar / Scopus / Web of Science. These require commercial licenses or violate ToS to scrape.
  • English-biased. The sources cover multiple languages but keyword matching works best in English.

  • Output is metadata only — no full-text reproduction. Respects copyright and each source's terms.
  • You are responsible for citing sources appropriately in your own work.
  • OpenAlex data is CC0. Crossref metadata is CC0. Semantic Scholar data is ODC-BY.
  • This Actor does not scrape Google Scholar — explicitly avoided due to their ToS and anti-bot measures.
  • No personal data is collected. contactEmail, if provided, is sent only to OpenAlex / Crossref as a polite-pool identifier per their API conventions; it is not stored by this Actor.

Roadmap

V2 (after V1 traction):

  • MCP Standby mode (expose as a tool for Claude / Cursor agents)
  • Open-access PDF fetching via Unpaywall
  • PubMed / arXiv / CORE as additional sources
  • Token-aware RAG-ready chunking
  • Citation-graph traversal
  • Scheduled re-runs for literature monitoring

Contact / feedback

Bug reports, feature requests, and use-case tips welcome — send a message at leafydevjr@gmail.com or leave a review on the Apify Store listing.