Pricing

from $1.00 / 1,000 results

Try for free

Go to Apify Store

Thesis Literature Review Scraper

Try for free

Turn any research topic into a clean reading list of peer-reviewed papers from multiple data sources in one run. Includes citation-manager exports, spreadsheets, and an LLM-ready Markdown bundle for contextualizing AI queries with real academic literature.

Pricing

from $1.00 / 1,000 results

Rating

5.0

(20)

Developer

Leafy

Actor stats

Bookmarked

Total users

Monthly active users

18 days ago

Last modified

Thesis Literature Review Scraper — Multi-Source Academic Papers with Citations & LLM-Ready Output

Paste a research topic → get a de-duplicated, structured, LLM-ready list of peer-reviewed papers. Perfect for thesis literature reviews, RAG pipelines, and AI research assistants.

What it does

Given a research topic or keywords, this Actor queries the selected free scholarly databases in parallel, de-duplicates the results by DOI, merges enriched fields across sources, and returns a single clean dataset plus LLM-ready exports. The default source pool is OpenAlex, Crossref, and PubMed; Semantic Scholar, arXiv, Europe PMC, and citation snowballing are opt-in.

Why this exists

Writing a thesis lit review manually takes days of copy-paste across different sites like Google Scholar. Building a RAG chatbot over academic papers requires the same legwork. This Actor does it in one run.

Watch a quick demo

Data sources

Source	Size	Best for
OpenAlex	250M+ works	Broad coverage across all disciplines; rich concept tagging
Semantic Scholar	~200M papers	CS / ML / AI; influential-citation ranking; clean abstracts
Crossref	160M+ records	Cross-publisher DOI metadata; humanities & journal articles
PubMed	40M citations	Biomedical, clinical, life sciences
arXiv (new)	2.4M+ preprints	Cutting-edge CS, ML, physics, math, statistics - months before journal publication
Europe PMC (new)	43M+ records	Life sciences superset of PubMed; includes preprints, agricultural research, and patents

All sources are public APIs with polite rate limits. arXiv follows the same opt-in pattern as Semantic Scholar: enable it only when preprint coverage matters for your topic. We do not scrape Google Scholar.

How the results stay relevant

The Actor builds a live result pool from the databases you select. By default that pool comes from OpenAlex, Crossref, and PubMed. Optional sources such as Semantic Scholar, arXiv, and Europe PMC add coverage when you explicitly select them; they are not queried by default.

After the live source pool is fetched, the Actor de-duplicates papers by DOI with fuzzy-title fallback. To make sure your final results are the most on-topic ones, every paper is scored against your research topic using a classic text-relevance algorithm (BM25) over its title and abstract. Common words like "the" or "study" are ignored, rarer words from your query count for more, and query terms that appear in the title weigh heavier than the same terms in the abstract.

If enableCitationSnowballing is turned on, the Actor adds a second optional pool from OpenAlex references and forward citations of the top relevance-ranked seed papers. This can surface seminal or adjacent work that keyword search missed, but it is off by default because it adds runtime and broadens the result pool.

When sortBy is relevance (the default), the highest-scoring papers bubble to the top before the maxResults cap is applied. Pick citations or date instead and the ranking switches to those signals.

Input

Field	Type	Default	Description
`query`	string (required)	pre-filled sample	Research topic or keywords (3–500 chars). Pre-filled with `artificial intelligence in higher education` so you can test-run without editing.
`yearFrom`	integer	2015	Earliest publication year.
`yearTo`	integer	current year	Latest publication year.
`maxResults`	integer	100	Total de-duplicated papers to return (10–1000).
`sources`	array	`["openalex", "crossref", "pubmed"]`	Which databases to query: `openalex`, `semanticscholar`, `crossref`, `pubmed`, `arxiv`, `europepmc`. PubMed is included by default for biomedical/life-science coverage. Semantic Scholar, arXiv, and Europe PMC are opt-in.
`minCitations`	integer	0	Filter out papers with fewer citations.
`openAccessOnly`	boolean	false	Return only open-access papers.
`enableCitationSnowballing`	boolean	false	Optional OpenAlex citation snowballing. Expands coverage by following references and forward citations of top relevance-ranked seed papers. Best for niche topics where keyword search misses seminal work; adds about 20-60 seconds to runtime.
`sortBy`	enum	`relevance`	`relevance` / `citations` / `date`.
`outputFormat`	array	`["bibtex", "markdown"]`	Extra export formats on top of JSON.
`contactEmail`	string	—	Optional — sent as a polite API identifier when a source supports it.
`semanticScholarApiKey`	string (secret)	—	Optional — bypasses Semantic Scholar's shared rate limit.
`ncbiApiKey`	string (secret)	—	Optional — allows higher PubMed E-utilities request rates.

Example input

{
    "query": "impact of social media on adolescent mental health",
    "yearFrom": 2018,
    "yearTo": 2026,
    "maxResults": 75,
    "sources": ["openalex", "crossref", "pubmed"],
    "sortBy": "citations",
    "outputFormat": ["markdown", "bibtex", "csv"]
}

Output

1. Dataset

One record per paper. The console shows two views:

All fields (default) — every field in a flat, spreadsheet-friendly order. Best for exporting to Excel.
Papers — curated subset (title, authors, year, venue, citations, DOI, OA, sources). Best for quickly scanning results.

Key fields per record:

Group	Fields
Identity	`doi`, `openAlexId`, `semanticScholarId`, `pmid`, `arxivId`
Metadata	`title`, `abstract`, `authorsDisplay`, `firstAuthor`, `authorCount`, `year`, `venue`, `publisher`
Metrics	`citationCount`, `influentialCitationCount`, `referenceCount`, `fieldsOfStudyDisplay`
Access	`isOpenAccess`, `openAccessUrl`, `landingPageUrl`
Provenance	`sourcesDisplay`, `primarySource`
LLM payload	`llmSummary`, `literature-review.md`

2. Key-Value Store files

Depending on outputFormat:

File	Use it for
`literature-review.md`	LLM synthesis — attach into ChatGPT / Claude / Gemini.
`references.bib`	BibTeX — import into Overleaf / LaTeX / Zotero BibTeX library.
`references.ris`	RIS — import into Zotero, Mendeley, EndNote, or Citavi.
`papers.csv`	Excel / Google Sheets / Numbers.
`METADATA`	Per-source fetch status, total citations, de-dupe counts, run timestamp.

Example output

Abridged excerpts from a run with query: "artificial intelligence in higher education".

Dataset record (JSON) — one complete paper record as returned in the dataset:

{
    "title": "Exploring Opportunities and Challenges of Artificial Intelligence and Machine Learning in Higher Education Institutions",
    "authorsDisplay": "Valentin Kuleto; Milena P. Ilić; Mihail Dumangiu; Marko Ranković",
    "firstAuthor": "Valentin Kuleto",
    "authorCount": 4,
    "year": 2021,
    "publicationDate": "2021-09-17",
    "venue": "Sustainability",
    "venueType": "journal",
    "publisher": "MDPI AG",
    "abstract": "Artificial Intelligence (AI) and Machine Learning (ML) are reshaping how higher education institutions (HEIs) operate, teach, and serve students. This paper explores the opportunities and challenges of integrating AI and ML into HEIs, drawing on a mixed-methods study of 108 faculty and administrators across six institutions. We identify five opportunity areas (personalized learning, administrative automation, predictive analytics, research augmentation, and accessibility) and four challenge areas (data governance, faculty readiness, equity, and cost). We conclude with a roadmap for responsible adoption.",
    "citationCount": 412,
    "referenceCount": 58,
    "influentialCitationCount": 38,
    "fieldsOfStudyDisplay": "Computer Science, Education, Sustainability",
    "isOpenAccess": true,
    "openAccessUrl": "https://www.mdpi.com/2071-1050/13/18/10424/pdf",
    "landingPageUrl": "https://doi.org/10.3390/su131810424",
    "doi": "10.3390/su131810424",
    "openAlexId": "W3199263016",
    "semanticScholarId": null,
    "pmid": null,
    "arxivId": null,
    "sourcesDisplay": "openalex, crossref",
    "primarySource": "openalex",
    "llmSummary": "## Exploring Opportunities and Challenges of Artificial Intelligence and Machine Learning in Higher Education Institutions\n\n**Authors:** Valentin Kuleto, Milena P. Ilić, Mihail Dumangiu, et al. (2021)\n**Venue:** Sustainability (journal)\n**Citations:** 412 (influential: 38) | **Open Access:** yes\n**DOI:** 10.3390/su131810424\n**Fields:** Computer Science, Education, Sustainability\n**Sources:** openalex, crossref\n\n**Abstract:** Artificial Intelligence (AI) and Machine Learning (ML) are reshaping how higher education institutions (HEIs) operate…",
    "relevanceScore": 18.4
}

literature-review.md (LLM-ready Markdown) — first three papers of a run:

# Literature Review: artificial intelligence in higher education
*Generated 2026-04-19T22:00:00.000Z | 75 papers from 2 sources*

## Summary Stats
- Date range: 2020–2026
- Total citations across corpus: 18,423
- Open access: 47/75
- Top venues: Sustainability (6); Computers & Education (4); IEEE Access (3); International Journal of Educational Technology in Higher Education (3)

## Source Status
- **openalex**: ok (600 fetched)
- **semanticscholar**: failed (0 fetched, error: not requested)
- **crossref**: ok (200 fetched)
- **pubmed**: ok (100 fetched)
- **arxiv**: failed (0 fetched, error: not requested)
- **europepmc**: failed (0 fetched, error: not requested)

## Papers

## Exploring Opportunities and Challenges of Artificial Intelligence and Machine Learning in Higher Education Institutions

**Authors:** Valentin Kuleto, Milena P. Ilić, Mihail Dumangiu, et al. (2021)
**Venue:** Sustainability (journal)
**Citations:** 412 (influential: 38) | **Open Access:** yes
**DOI:** 10.3390/su131810424
**Fields:** Computer Science, Education, Sustainability
**Sources:** openalex, crossref

**Abstract:** Artificial Intelligence (AI) and Machine Learning (ML) are reshaping how higher education institutions (HEIs) operate, teach, and serve students. This paper explores the opportunities and challenges of integrating AI and ML into HEIs, drawing on a mixed-methods study of 108 faculty and administrators across six institutions. We identify five opportunity areas (personalized learning, administrative automation, predictive analytics, research augmentation, and accessibility) and four challenge areas (data governance, faculty readiness, equity, and cost). We conclude with a roadmap for responsible adoption.

---

## Artificial intelligence in higher education: the state of the field

**Authors:** Helen Crompton, Diane Burke (2023)
**Venue:** International Journal of Educational Technology in Higher Education (journal)
**Citations:** 287 (influential: 29) | **Open Access:** yes
**DOI:** 10.1186/s41239-023-00392-8
**Fields:** Education, Computer Science
**Sources:** openalex, crossref

**Abstract:** This systematic review examines the state of artificial intelligence (AI) research in higher education. Drawing from 138 empirical studies published between 2016 and 2022, we map the field across five dimensions: AI application types, pedagogical goals, student populations, methodological approaches, and reported outcomes. Findings indicate a heavy concentration on adaptive learning systems and intelligent tutoring, with under-representation of equity, ethics, and faculty-perspective research. We propose a research agenda for the next phase of AI-in-higher-ed scholarship.

---

## ChatGPT for good? On opportunities and challenges of large language models for education

**Authors:** Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, et al. (2023)
**Venue:** Learning and Individual Differences (journal)
**Citations:** 1,942 (influential: 184) | **Open Access:** yes
**DOI:** 10.1016/j.lindif.2023.102274
**Fields:** Education, Computer Science, Linguistics
**Sources:** openalex, crossref

**Abstract:** Large language models (LLMs) such as ChatGPT are transforming how students access information and produce academic work. This position paper surveys the opportunities LLMs create for educators (personalized feedback, lesson planning, accessibility support) alongside the challenges they raise (academic integrity, factual reliability, equity of access, and assessment redesign). We outline an actionable framework for institutions considering LLM integration, covering policy, pedagogy, and tooling.

---

## [next paper] …

LLM-ready Markdown — attach it to your AI model

When outputFormat includes markdown, the Actor generates a single literature-review.md file in the Key-Value Store. Each paper is structured into a clean, consistent format with metadata, abstract, DOI, and extracted key insights.

Instead of pasting individual papers into a long chat, you can attach this file directly to ChatGPT, Claude, Gemini, or any LLM that supports document input. This allows the model to work directly from the full set of curated academic sources.

This makes it easier to:

Analyze papers using real academic sources
Compare methods, findings, and arguments across studies
Identify research gaps and underexplored areas
Decide which papers are worth reading in full
Generate structured summaries or literature reviews

How to use it:

Run the Actor.
Open the run → Storage → Key-Value Store.
Download literature-review.md.
Attach it to your AI model.
Ask it to analyze only the provided papers.

Suggested prompts

"Summarize the main themes across these papers and group them accordingly."
"What research gaps or missing areas appear across the literature?"
"Which papers should I read first for a strong foundational understanding?"
"Generate a literature synthesis in APA style."
"Create a structured literature matrix (question, method, findings, limitations)."

Pair this with BibTeX or RIS export so every referenced paper can be directly imported into Zotero, Overleaf, or other citation tools.

Use cases

Thesis / dissertation lit review — seed your chapter with 100+ relevant papers in one run.
RAG pipelines over academic content — ingest the Markdown or CSV into a vector store.
Citation-graph sanity checks — verify a paper is findable in multiple databases.

Limitations (V1)

Metadata only. No full-text PDF download. Use openAccessUrl if you want to fetch PDFs yourself.
Max 1000 papers per run (de-duplicated). Run multiple queries for broader coverage.
Semantic Scholar anonymous pool can 429. If you need reliable SS results at scale, provide a free API key.
Crossref abstract coverage is sparse (~20%). Primary abstract sources are Semantic Scholar, Europe PMC, OpenAlex, and arXiv.
Source fit varies by domain. PubMed and Europe PMC are strongest for biomedical and life-science topics; arXiv is strongest for preprints in CS, ML, physics, math, and statistics.
No Google Scholar / Scopus / Web of Science. These require commercial licenses.
English-biased. The sources cover multiple languages but keyword matching works best in English.

Legal & licensing

Output is metadata only — no full-text reproduction. Respects copyright and each source's terms.
You are responsible for citing sources appropriately in your own work.
OpenAlex data is CC0. Crossref metadata is CC0. Semantic Scholar data is ODC-BY. PubMed metadata is accessed through NCBI E-utilities.
This Actor does not scrape Google Scholar — explicitly avoided due to their ToS and anti-bot measures.
No personal data is collected. contactEmail, if provided, is sent only as a polite-pool/API identifier per source conventions; it is not stored by this Actor.

Roadmap

V2 (after V1 traction):

MCP Standby mode (expose as a tool for Claude / Cursor agents)
Token-aware RAG-ready chunking

Contact / feedback

Bug reports, feature requests, and use-case tips welcome — email the developer at leafydevjr@gmail.com.

PhilGEPS PH Tender Leads Scraper

leafy-dev-jr/philgeps-tender-leads-scraper

Find relevant PhilGEPS bid opportunities before they close. Get budgets, agencies, contacts, and closing dates. Ready to export, monitor, or automate.

Leafy

No-Website Business Leads Scraper (Google Maps)

leafy-dev-jr/no-website-business-leads-scraper-google-maps

Find businesses with weak or missing websites on Google Maps. A priority-ranked lead generation list for website developers and designers

Leafy

5.0

PhilGEPS Philippines Procurement Scraper

jungle_synthesizer/philgeps-procurement-scraper

Extract Philippine government bid opportunities and award notices from PhilGEPS (philgeps.gov.ph) — budget, contact info, awardee, and award amount in PHP

BowTiedRaccoon

Google Maps Scraper - Places, Leads & Contacts

scrapesage/google-maps-scraper

Scrape Google Maps businesses & places by keyword + location: name, address, phone, website, rating, category, opening hours, coordinates, photos & attributes. Optional website email/social enrichment, lead scoring & monitor mode. No API key. Export to CSV/JSON/Excel.

Scrape Sage

arXiv Papers Scraper with AI Topic Tags

and_krm/arxiv-scraper

Search arXiv.org for academic papers by keyword, author, or category. Get clean structured data with optional AI topic tagging via Claude. Perfect for literature reviews, research monitoring, and academic datasets.

Andrei

Google Maps Scraper

primelabs/google-maps-scraper

Many Google maps scrapers leave discovery gaps that quietly cost you leads. This actor systematically scans entire areas to uncover almost all businesses, reduces duplicates, and generate cleaner, outreach ready datasets with contact details, and pin point coordinates with link & much more

Mohsen S

Research Paper Search — Academic Papers to JSON (OpenAlex)

oblanceolate_mandola/research-paper-search

Search academic papers by topic via OpenAlex. Title, authors, year, citations, DOI, venue as JSON for research & literature-review AI agents. $3 per 1,000, no coding.

Hassan Hashish

Crossref Literature Search

jeeves_is_my_copilot/crossref-literature-search

Search Crossref works and return normalized academic literature metadata.

Alexander Abernathy

Google Scholar Lite - Cheap Bulk Academic Papers API

johnvc/google-scholar-lite-api

Search Google Scholar for academic papers in bulk and export clean JSON: title, authors, journal, year, citation count, and PDF links. Fast bibliometric search for literature reviews, citation discovery, and research datasets. Pay per paper from $1.50 per 1,000, with no setup or per-run fee.

John

5.0

Google Scholar Scraper - Low-cost💲🔥📚🎓

delectable_incubator/google-scholar-scraper-low-cost

Scrape Google Scholar academic papers 📚🔍 with a powerful research scraper. Extract paper titles, authors, publication dates, journals/sources, citations, and direct links to full texts. Ideal for academic research, literature reviews, citation analysis, AI/NLP training, and knowledge discovery 🚀