Thesis Literature Review Scraper
Pricing
from $1.00 / 1,000 results
Thesis Literature Review Scraper
Turn any research topic into a clean reading list of peer-reviewed papers from multiple data sources in one run. Includes citation-manager exports, spreadsheets, and an LLM-ready Markdown bundle for contextualizing AI queries with real academic literature.
Pricing
from $1.00 / 1,000 results
Rating
5.0
(20)
Developer
Leafy
Maintained by CommunityActor stats
7
Bookmarked
27
Total users
3
Monthly active users
19 days ago
Last modified
Categories
Share
Thesis Literature Review Scraper — Multi-Source Academic Papers with Citations & LLM-Ready Output
Paste a research topic → get a de-duplicated, structured, LLM-ready list of peer-reviewed papers. Perfect for thesis literature reviews, RAG pipelines, and AI research assistants.

What it does
Given a research topic or keywords, this Actor queries the selected free scholarly databases in parallel, de-duplicates the results by DOI, merges enriched fields across sources, and returns a single clean dataset plus LLM-ready exports. The default source pool is OpenAlex, Crossref, and PubMed; Semantic Scholar, arXiv, Europe PMC, and citation snowballing are opt-in.
Why this exists
Writing a thesis lit review manually takes days of copy-paste across different sites like Google Scholar. Building a RAG chatbot over academic papers requires the same legwork. This Actor does it in one run.
Watch a quick demo
Data sources
| Source | Size | Best for |
|---|---|---|
| OpenAlex | 250M+ works | Broad coverage across all disciplines; rich concept tagging |
| Semantic Scholar | ~200M papers | CS / ML / AI; influential-citation ranking; clean abstracts |
| Crossref | 160M+ records | Cross-publisher DOI metadata; humanities & journal articles |
| PubMed | 40M citations | Biomedical, clinical, life sciences |
| arXiv (new) | 2.4M+ preprints | Cutting-edge CS, ML, physics, math, statistics - months before journal publication |
| Europe PMC (new) | 43M+ records | Life sciences superset of PubMed; includes preprints, agricultural research, and patents |
All sources are public APIs with polite rate limits. arXiv follows the same opt-in pattern as Semantic Scholar: enable it only when preprint coverage matters for your topic. We do not scrape Google Scholar.
How the results stay relevant
The Actor builds a live result pool from the databases you select. By default that pool comes from OpenAlex, Crossref, and PubMed. Optional sources such as Semantic Scholar, arXiv, and Europe PMC add coverage when you explicitly select them; they are not queried by default.
After the live source pool is fetched, the Actor de-duplicates papers by DOI with fuzzy-title fallback. To make sure your final results are the most on-topic ones, every paper is scored against your research topic using a classic text-relevance algorithm (BM25) over its title and abstract. Common words like "the" or "study" are ignored, rarer words from your query count for more, and query terms that appear in the title weigh heavier than the same terms in the abstract.
If enableCitationSnowballing is turned on, the Actor adds a second optional pool from OpenAlex references and forward citations of the top relevance-ranked seed papers. This can surface seminal or adjacent work that keyword search missed, but it is off by default because it adds runtime and broadens the result pool.
When sortBy is relevance (the default), the highest-scoring papers bubble to the top before the maxResults cap is applied. Pick citations or date instead and the ranking switches to those signals.
Input
| Field | Type | Default | Description |
|---|---|---|---|
query | string (required) | pre-filled sample | Research topic or keywords (3–500 chars). Pre-filled with artificial intelligence in higher education so you can test-run without editing. |
yearFrom | integer | 2015 | Earliest publication year. |
yearTo | integer | current year | Latest publication year. |
maxResults | integer | 100 | Total de-duplicated papers to return (10–1000). |
sources | array | ["openalex", "crossref", "pubmed"] | Which databases to query: openalex, semanticscholar, crossref, pubmed, arxiv, europepmc. PubMed is included by default for biomedical/life-science coverage. Semantic Scholar, arXiv, and Europe PMC are opt-in. |
minCitations | integer | 0 | Filter out papers with fewer citations. |
openAccessOnly | boolean | false | Return only open-access papers. |
enableCitationSnowballing | boolean | false | Optional OpenAlex citation snowballing. Expands coverage by following references and forward citations of top relevance-ranked seed papers. Best for niche topics where keyword search misses seminal work; adds about 20-60 seconds to runtime. |
sortBy | enum | relevance | relevance / citations / date. |
outputFormat | array | ["bibtex", "markdown"] | Extra export formats on top of JSON. |
contactEmail | string | — | Optional — sent as a polite API identifier when a source supports it. |
semanticScholarApiKey | string (secret) | — | Optional — bypasses Semantic Scholar's shared rate limit. |
ncbiApiKey | string (secret) | — | Optional — allows higher PubMed E-utilities request rates. |
Example input
{"query": "impact of social media on adolescent mental health","yearFrom": 2018,"yearTo": 2026,"maxResults": 75,"sources": ["openalex", "crossref", "pubmed"],"sortBy": "citations","outputFormat": ["markdown", "bibtex", "csv"]}
Output
1. Dataset
One record per paper. The console shows two views:
- All fields (default) — every field in a flat, spreadsheet-friendly order. Best for exporting to Excel.
- Papers — curated subset (title, authors, year, venue, citations, DOI, OA, sources). Best for quickly scanning results.
Key fields per record:
| Group | Fields |
|---|---|
| Identity | doi, openAlexId, semanticScholarId, pmid, arxivId |
| Metadata | title, abstract, authors, authorsDisplay, firstAuthor, authorCount, year, venue, publisher |
| Metrics | citationCount, influentialCitationCount, referenceCount, fieldsOfStudy |
| Access | isOpenAccess, openAccessUrl, landingPageUrl |
| Provenance | sources, primarySource |
| LLM payload | llmSummary, literature-review.md |
2. Key-Value Store files
Depending on outputFormat:
| File | Use it for |
|---|---|
literature-review.md | LLM synthesis — attach into ChatGPT / Claude / Gemini. |
references.bib | BibTeX — import into Overleaf / LaTeX / Zotero BibTeX library. |
references.ris | RIS — import into Zotero, Mendeley, EndNote, or Citavi. |
papers.csv | Excel / Google Sheets / Numbers. |
METADATA | Per-source fetch status, total citations, de-dupe counts, run timestamp. |
Example output
Abridged excerpts from a run with query: "artificial intelligence in higher education".
Dataset record (JSON) — one complete paper record as returned in the dataset:
{"title": "Exploring Opportunities and Challenges of Artificial Intelligence and Machine Learning in Higher Education Institutions","authors": [{"name": "Valentin Kuleto","orcid": "https://orcid.org/0000-0002-7811-5436","affiliation": "University Business Academy in Novi Sad"},{"name": "Milena P. Ilić","orcid": "https://orcid.org/0000-0002-2656-1449","affiliation": "Information Technology School, Belgrade"},{"name": "Mihail Dumangiu","orcid": null,"affiliation": null},{"name": "Marko Ranković","orcid": null,"affiliation": null}],"authorsDisplay": "Valentin Kuleto; Milena P. Ilić; Mihail Dumangiu; Marko Ranković","firstAuthor": "Valentin Kuleto","authorCount": 4,"year": 2021,"publicationDate": "2021-09-17","venue": "Sustainability","venueType": "journal","publisher": "MDPI AG","abstract": "Artificial Intelligence (AI) and Machine Learning (ML) are reshaping how higher education institutions (HEIs) operate, teach, and serve students. This paper explores the opportunities and challenges of integrating AI and ML into HEIs, drawing on a mixed-methods study of 108 faculty and administrators across six institutions. We identify five opportunity areas (personalized learning, administrative automation, predictive analytics, research augmentation, and accessibility) and four challenge areas (data governance, faculty readiness, equity, and cost). We conclude with a roadmap for responsible adoption.","citationCount": 412,"referenceCount": 58,"influentialCitationCount": 38,"fieldsOfStudy": ["Computer Science", "Education", "Sustainability"],"isOpenAccess": true,"openAccessUrl": "https://www.mdpi.com/2071-1050/13/18/10424/pdf","landingPageUrl": "https://doi.org/10.3390/su131810424","doi": "10.3390/su131810424","openAlexId": "W3199263016","semanticScholarId": null,"pmid": null,"arxivId": null,"sources": ["openalex", "crossref"],"primarySource": "openalex","llmSummary": "## Exploring Opportunities and Challenges of Artificial Intelligence and Machine Learning in Higher Education Institutions\n\n**Authors:** Valentin Kuleto, Milena P. Ilić, Mihail Dumangiu, et al. (2021)\n**Venue:** Sustainability (journal)\n**Citations:** 412 (influential: 38) | **Open Access:** yes\n**DOI:** 10.3390/su131810424\n**Fields:** Computer Science, Education, Sustainability\n**Sources:** openalex, crossref\n\n**Abstract:** Artificial Intelligence (AI) and Machine Learning (ML) are reshaping how higher education institutions (HEIs) operate…","relevanceScore": 18.4}
literature-review.md (LLM-ready Markdown) — first three papers of a run:
# Literature Review: artificial intelligence in higher education*Generated 2026-04-19T22:00:00.000Z | 75 papers from 2 sources*## Summary Stats- Date range: 2020–2026- Total citations across corpus: 18,423- Open access: 47/75- Top venues: Sustainability (6); Computers & Education (4); IEEE Access (3); International Journal of Educational Technology in Higher Education (3)## Source Status- **openalex**: ok (600 fetched)- **semanticscholar**: failed (0 fetched, error: not requested)- **crossref**: ok (200 fetched)- **pubmed**: ok (100 fetched)- **arxiv**: failed (0 fetched, error: not requested)- **europepmc**: failed (0 fetched, error: not requested)## Papers## Exploring Opportunities and Challenges of Artificial Intelligence and Machine Learning in Higher Education Institutions**Authors:** Valentin Kuleto, Milena P. Ilić, Mihail Dumangiu, et al. (2021)**Venue:** Sustainability (journal)**Citations:** 412 (influential: 38) | **Open Access:** yes**DOI:** 10.3390/su131810424**Fields:** Computer Science, Education, Sustainability**Sources:** openalex, crossref**Abstract:** Artificial Intelligence (AI) and Machine Learning (ML) are reshaping how higher education institutions (HEIs) operate, teach, and serve students. This paper explores the opportunities and challenges of integrating AI and ML into HEIs, drawing on a mixed-methods study of 108 faculty and administrators across six institutions. We identify five opportunity areas (personalized learning, administrative automation, predictive analytics, research augmentation, and accessibility) and four challenge areas (data governance, faculty readiness, equity, and cost). We conclude with a roadmap for responsible adoption.---## Artificial intelligence in higher education: the state of the field**Authors:** Helen Crompton, Diane Burke (2023)**Venue:** International Journal of Educational Technology in Higher Education (journal)**Citations:** 287 (influential: 29) | **Open Access:** yes**DOI:** 10.1186/s41239-023-00392-8**Fields:** Education, Computer Science**Sources:** openalex, crossref**Abstract:** This systematic review examines the state of artificial intelligence (AI) research in higher education. Drawing from 138 empirical studies published between 2016 and 2022, we map the field across five dimensions: AI application types, pedagogical goals, student populations, methodological approaches, and reported outcomes. Findings indicate a heavy concentration on adaptive learning systems and intelligent tutoring, with under-representation of equity, ethics, and faculty-perspective research. We propose a research agenda for the next phase of AI-in-higher-ed scholarship.---## ChatGPT for good? On opportunities and challenges of large language models for education**Authors:** Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, et al. (2023)**Venue:** Learning and Individual Differences (journal)**Citations:** 1,942 (influential: 184) | **Open Access:** yes**DOI:** 10.1016/j.lindif.2023.102274**Fields:** Education, Computer Science, Linguistics**Sources:** openalex, crossref**Abstract:** Large language models (LLMs) such as ChatGPT are transforming how students access information and produce academic work. This position paper surveys the opportunities LLMs create for educators (personalized feedback, lesson planning, accessibility support) alongside the challenges they raise (academic integrity, factual reliability, equity of access, and assessment redesign). We outline an actionable framework for institutions considering LLM integration, covering policy, pedagogy, and tooling.---## [next paper] …
LLM-ready Markdown — attach it to your AI model
When outputFormat includes markdown, the Actor generates a single literature-review.md file in the Key-Value Store. Each paper is structured into a clean, consistent format with metadata, abstract, DOI, and extracted key insights.
Instead of pasting individual papers into a long chat, you can attach this file directly to ChatGPT, Claude, Gemini, or any LLM that supports document input. This allows the model to work directly from the full set of curated academic sources.
This makes it easier to:
- Analyze papers using real academic sources
- Compare methods, findings, and arguments across studies
- Identify research gaps and underexplored areas
- Decide which papers are worth reading in full
- Generate structured summaries or literature reviews
How to use it:
- Run the Actor.
- Open the run → Storage → Key-Value Store.
- Download
literature-review.md. - Attach it to your AI model.
- Ask it to analyze only the provided papers.
Suggested prompts
- "Summarize the main themes across these papers and group them accordingly."
- "What research gaps or missing areas appear across the literature?"
- "Which papers should I read first for a strong foundational understanding?"
- "Generate a literature synthesis in APA style."
- "Create a structured literature matrix (question, method, findings, limitations)."
Pair this with BibTeX or RIS export so every referenced paper can be directly imported into Zotero, Overleaf, or other citation tools.
Use cases
- Thesis / dissertation lit review — seed your chapter with 100+ relevant papers in one run.
- RAG pipelines over academic content — ingest the Markdown or CSV into a vector store.
- Citation-graph sanity checks — verify a paper is findable in multiple databases.
Limitations (V1)
- Metadata only. No full-text PDF download. Use
openAccessUrlif you want to fetch PDFs yourself. - Max 1000 papers per run (de-duplicated). Run multiple queries for broader coverage.
- Semantic Scholar anonymous pool can 429. If you need reliable SS results at scale, provide a free API key.
- Crossref abstract coverage is sparse (~20%). Primary abstract sources are Semantic Scholar, Europe PMC, OpenAlex, and arXiv.
- Source fit varies by domain. PubMed and Europe PMC are strongest for biomedical and life-science topics; arXiv is strongest for preprints in CS, ML, physics, math, and statistics.
- No Google Scholar / Scopus / Web of Science. These require commercial licenses.
- English-biased. The sources cover multiple languages but keyword matching works best in English.
Legal & licensing
- Output is metadata only — no full-text reproduction. Respects copyright and each source's terms.
- You are responsible for citing sources appropriately in your own work.
- OpenAlex data is CC0. Crossref metadata is CC0. Semantic Scholar data is ODC-BY. PubMed metadata is accessed through NCBI E-utilities.
- This Actor does not scrape Google Scholar — explicitly avoided due to their ToS and anti-bot measures.
- No personal data is collected.
contactEmail, if provided, is sent only as a polite-pool/API identifier per source conventions; it is not stored by this Actor.
Roadmap
V2 (after V1 traction):
- MCP Standby mode (expose as a tool for Claude / Cursor agents)
- Token-aware RAG-ready chunking
Contact / feedback
Bug reports, feature requests, and use-case tips welcome - leave a review on the Apify Store listing.
