🔍 Google Scholar Scraper

Try for free

Pricing

from $3.99 / 1,000 results

Rating

0.0

(0)

Developer

Scrapium

Actor stats

Bookmarked

Total users

Monthly active users

15 days ago

Last modified

📚 Google Scholar Scraper

A blazing-fast, production-grade Apify Actor that pulls academic papers from the global Scholar knowledge graph (OpenAlex + Semantic Scholar) and delivers clean, structured JSON ready for analysis, citation review, or literature dashboards.

Bulk in. Citations out. Throw a list of keywords or Google Scholar URLs and walk away — the Actor does the heavy lifting.

🚀 Why Choose This Actor?

🧠 Multi-source intelligence — combines OpenAlex (250 M+ works) and Semantic Scholar so you never miss a paper.
🌐 Smart auto-escalating proxy — starts direct, falls back to Datacenter → Residential only when needed. You don't have to think about it.
⚡ Live streaming results — each paper hits the dataset the moment it's scraped. A crash mid-run still leaves you with rows.
🧹 Built-in deduplication, filters, and sort — citations, recency, open-access, article-type filters out of the box.
🪶 Light & fast — no headless browser, no Playwright overhead — just well-engineered HTTP calls.
💸 Pay only for what you use — no hidden compute time waste.

✨ Key Features

🔎 Bulk search — submit dozens of queries / Scholar URLs at once.
📥 Up to 5 000 papers per query with cursor-based pagination.
🏷️ Rich metadata — title, authors, year, citations, source, PDF link, abstract snippet, etc.
🛡️ Auto-rotating proxies with sticky residential mode after escalation.
📊 Two pre-configured dataset views — Overview (essentials) + Full Details (everything).
📝 Per-query sectioning — every record carries a query field so you can split results by topic in seconds.

⚙️ Input

Field	Type	Description
`searchQueries` ✱	array of strings	Search keywords or Scholar URLs (e.g. `https://scholar.google.com/scholar?q=...`). Required.
`maxItems`	integer (1 – 5000)	Max papers per query. Default `100`.
`sortBy`	enum	`relevance` (default) \| `cited_by_count`
`filter`	enum	`all` (default) \| `has_pdf` \| `open_access` \| `recent_5_years`
`articleType`	enum	`any` (default) \| `journal` \| `conference` \| `book` \| `preprint`
`proxyConfiguration`	object	Optional. Defaults to no proxy — the actor will auto-escalate to Datacenter/Residential on rate-limits.

Example input

{
  "searchQueries": [
    "Tomato Shelf Life Prediction using IoT and Machine Learning",
    "Federated learning healthcare"
  ],
  "maxItems": 100,
  "sortBy": "cited_by_count",
  "filter": "open_access",
  "articleType": "journal",
  "proxyConfiguration": { "useApifyProxy": false }
}

📦 Output

Each dataset row matches the well-known Scholar / SerpAPI-style shape:

{
  "query": "Tomato Shelf Life Prediction using IoT and Machine Learning",
  "cidCode": "W4409060190",
  "didCode": "W4409060190",
  "lidCode": "",
  "aidCode": "W4409060190",
  "resultIndex": 0,
  "type": "ARTICLE",
  "title": "Tomato Shelf Life Prediction using IoT and Machine Learning",
  "link": "https://doi.org/10.1109/iciset62123.2024.10939467",
  "documentLink": "",
  "documentType": "",
  "fullAttribution": "Nazmul Arafin Naim, Raisul Islam, Mohammed Saifuddin, ... - , 2024",
  "authors": "Nazmul Arafin Naim, Raisul Islam, Mohammed Saifuddin, ...",
  "publication": "",
  "year": 2024,
  "source": "",
  "searchMatch": "Predicting tomato shelf life is crucial for ...",
  "citations": 1,
  "citationsLink": "https://openalex.org/W4409060190",
  "relatedArticlesLink": "https://openalex.org/W4409060190",
  "versions": 1,
  "versionsLink": "https://openalex.org/W4409060190"
}

Field	Meaning
`query`	Original query that produced this row (lets you group sections).
`cidCode` / `didCode` / `aidCode`	Stable record identifiers (OpenAlex ID or hash).
`resultIndex`	Position within that query's result set.
`title`	Paper title.
`authors`	Up to five lead authors.
`publication` / `source`	Journal / venue name.
`year`	Publication year.
`citations`	Total citation count.
`documentLink` / `documentType`	Direct PDF/OA URL when available.
`searchMatch`	Abstract snippet (first ~300 chars).
`citationsLink` / `relatedArticlesLink` / `versionsLink`	Apify-friendly clickable links.

🚀 How to Use (Apify Console)

Log in at https://console.apify.com → Actors.
Open Google Scholar Scraper.
Paste your queries (or Scholar URLs) into Search Queries.
Tune maxItems, sortBy, filter, articleType to taste.
Leave Proxy on its default (no proxy) — the Actor auto-escalates on rate-limits.
Click ▶ Start.
Watch the live log — every section reports progress in real time.
Open the Output tab and export to JSON / CSV / XLSX.

🤖 Use via API

curl -X POST "https://api.apify.com/v2/acts/<ACTOR_ID>/run-sync-get-dataset-items?token=$APIFY_TOKEN" \
     -H "Content-Type: application/json" \
     -d '{
       "searchQueries": ["Federated learning healthcare"],
       "maxItems": 50,
       "sortBy": "cited_by_count"
     }'

🎯 Best Use Cases

🔬 Literature reviews — pull a full corpus on a research topic in minutes.
📈 Citation tracking — monitor how a paper or author cluster grows over time.
🧪 Trend detection — slice by recent_5_years to spot emerging directions.
📚 Library / EdTech tools — feed clean, normalised records into your platform.
🤖 AI agents — give RAG/LLM pipelines high-quality academic context.

💸 Pricing

This Actor is best deployed under the Pay-per-event (PPE) model:

One event = one paper pushed to the dataset (apify-default-dataset-item).
No surprise compute charges, no rental — you pay for results, not waiting.
Free 5-second startup included by Apify on every run.

Configure the exact event prices in the Apify Console → Publication → Monetization tab.

❓ Frequently Asked Questions

Q: Do I need a Google Scholar account? No. We connect to OpenAlex + Semantic Scholar — both are open scholarly knowledge graphs.

Q: How fresh is the data? OpenAlex syncs daily with Crossref, DOAJ, PubMed and others. Most papers appear within 24 – 48 h of publication.

Q: Will I get blocked? Unlikely — the actor uses official, rate-limit-friendly APIs and auto-escalates through Datacenter → Residential proxies if a host ever pushes back.

Q: Can I pass full Scholar URLs instead of keywords? Yes. URLs like https://scholar.google.com/scholar?q=... are auto-parsed for the q= term.

Q: Why two views in the output? The Overview view is great for quick scanning. The Full Details view is the complete record — same data, more columns.

🛟 Support & Feedback

Found a bug or have a feature request? Open an issue or message us through the Apify Store page. We respond fast.

⚖️ Cautions / Legal

Data is collected only from publicly available sources (OpenAlex, Semantic Scholar).
You are responsible for downstream use that complies with GDPR/CCPA, target ToS, and copyright.
Respect rate-limits and robots.txt — being a good citizen reduces blocks too.