Google Scholar Scraper: Articles, Citations & PDFs
Pricing
Pay per usage
Google Scholar Scraper: Articles, Citations & PDFs
Extract academic data from Google Scholar: titles, authors, years, citations, abstracts, PDF links. Supports queries, year filters (1900-2100), pagination (up to 5 pages). Rate-limited for safety. Ideal for research, citations, datasets, AI. Clean JSON output. Run on Apify with proxies.
Pricing
Pay per usage
Rating
0.0
(0)
Developer

PrimeParse
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
π¬ Google Scholar Scraper: Academic Research Data Extractor
Enterprise-grade Google Scholar scraper for academic research and data analysis. Collects structured data from Google Scholar search results including titles, authors, citations, abstracts, and PDF links. Ideal for literature reviews, citation analysis, and academic dataset building. Features intelligent parsing, rate limiting, and year filtering.
High-quality Google Scholar Data Extractor for Researchers, Academics, and Data Scientists
Automatically searches Google Scholar, extracts article metadata, filters by publication year, and collects citation data β clean, structured, ready for analysis or academic research.
Built for:
- Academic researchers conducting literature reviews
- Data scientists building research datasets
- PhD students tracking citations and publications
- Librarians organizing academic resources
- Research teams monitoring publication trends
- AI/ML engineers collecting training data from academic sources
β Smart search with keyword queries β Year range filtering (1900-2100) β Rich metadata extraction (title, authors, year, citations, abstract, PDF links) β Automatic pagination support (up to 5 pages) β Rate limiting & respectful crawling β AI-ready structured output
π Runs on Apify β’ No code required
π Why This Scraper
β Purpose-Built for Academic Research
Intelligently extracts structured data from Google Scholar search results β perfect for literature reviews, citation analysis, and academic research.
β Comprehensive Metadata Extraction
Extracts all essential academic metadata: article titles, author lists, publication years, citation counts, abstracts, PDF links, and Google Scholar page URLs.
β Clean & Structured Output
Produces clean, structured JSON output ready for analysis, database import, or further processing. Perfect for academic datasets and research workflows.
β Smart Year Filtering
Filter results by publication year range to focus on recent research or historical publications. Supports years from 1900 to 2100.
β AI & ML Ready
Structured JSON output perfect for RAG systems, LLM fine-tuning, academic knowledge bases, or training datasets for research applications.
β Fast & Efficient
Powered by Puppeteer for reliable browser automation. Handles dynamic content and JavaScript-rendered pages efficiently.
β Safe & Controlled Processing
Built-in rate limiting (1-2 second delays), configurable pagination limits, and graceful error handling to respect Google Scholar's infrastructure.
πΌ Use Cases
- Literature reviews β Collect and analyze academic papers for systematic reviews
- Citation tracking β Monitor citation counts and track research impact
- Publication monitoring β Track new publications in specific research areas
- Dataset building β Create structured datasets for academic research or AI training
- Competitive research β Monitor competitor publications and research trends
- Academic analysis β Analyze publication patterns, author networks, and citation trends
- PDF collection β Automatically collect PDF links for offline research
π Supported Data
- Article titles β Full publication titles
- Authors β Complete author lists (up to 10 authors per article)
- Publication years β Extracted from metadata
- Citation counts β Number of citations for each article
- Abstracts β Article abstracts when available
- PDF links β Direct links to PDF files when available
- Google Scholar links β Direct links to article pages on Google Scholar
βοΈ How It Works
- Enter your search query (e.g., "machine learning", "quantum computing")
- Optionally set year range filters and pagination limits
- Configure proxy settings for reliable access
- Run the Actor
- Download clean, structured academic datasets
π§© Input Configuration
Example JSON Input
{"query": "machine learning","maxPages": 1,"startYear": 2020,"endYear": 2026,"proxyConfiguration": {"useApifyProxy": true}}
Key Options
- query β Search query string (required, e.g., "machine learning", "neural networks")
- maxPages β Maximum number of result pages to scrape (default: 1, recommended: 1-5)
- startYear β Filter results by minimum publication year (optional, 1900-2100)
- endYear β Filter results by maximum publication year (optional, 1900-2100)
- proxyConfiguration β Proxy settings for anti-bot protection (default: uses Apify Proxy)
Search Query Tips
- Use specific terms for better results (e.g., "deep learning neural networks" instead of "AI")
- Combine keywords with quotes for exact phrases:
"transfer learning" - Use Boolean operators:
machine learning AND computer vision - Filter by author:
author:"John Smith" machine learning - Filter by publication:
source:"Nature" quantum computing
π Output Dataset
All articles are stored in the default Apify dataset with the following structure:
Example Output Record
{"title": "Machine learning","authors": ["ZH Zhou"],"year": 2021,"citations": 3301,"abstract": "β¦ from data is called learning or training. The β¦ machine learning is to find or approximate ground-truth. In this book, models are sometimes called learners, which are machine learning β¦","pdfLink": null,"scholarLink": "https://books.google.com/books?hl=en&lr=&id=ctM-EAAAQBAJ&oi=fnd&pg=PR6&dq=machine+learning&ots=o_OnT7Rv3p&sig=bH9TGnw_ZdZYH4lSLmKun7xX6Cs"}
Output Fields
- title (string, required) β Article title
- authors (array, required) β List of author names (up to 10 authors)
- year (integer|null) β Publication year
- citations (integer|null) β Number of citations
- abstract (string|null) β Article abstract when available
- pdfLink (string|null) β Direct link to PDF file when available
- scholarLink (string, required) β Link to Google Scholar article page
Multiple Authors Example
{"title": "A guide to machine learning for biologists","authors": ["JG Greener","SM Kandathil"],"year": 2022,"citations": 2020,"abstract": "β¦ A machine learning task is an objective specification for what we want a machine learning model to accomplishβ¦","pdfLink": "https://discovery.ucl.ac.uk/id/eprint/10134478/1/NRMCB-review-accepted-forRPS.pdf","scholarLink": "https://www.nature.com/articles/s41580-021-00407-0"}
Input File Example
Create storage/key_value_stores/default/INPUT.json:
{"query": "quantum computing","maxPages": 2,"startYear": 2020,"endYear": 2024}
π Performance
- Processing Speed β ~3-4 seconds per page (depending on results)
- Rate Limiting β Built-in 1-2 second delays between requests
- Concurrency β Single request at a time for reliability
- Scalability β Handles 1-5 pages optimally (up to 50 articles per run)
- Success Rate β High reliability with proper proxy configuration
π§ Advanced Configuration
Year Range Filtering
Filter results by publication year:
{"query": "artificial intelligence","startYear": 2020,"endYear": 2026}
Multiple Pages
Scrape multiple pages for comprehensive results:
{"query": "deep learning","maxPages": 5}
Proxy Configuration
Use Apify Proxy for reliable access:
{"query": "neural networks","proxyConfiguration": {"useApifyProxy": true}}
π§ Support
- Issues β Use Apify Issues tab for bug reports
- Documentation β Check Apify documentation for platform features
- Community β Join Apify community for discussions
Tags: Google Scholar, academic research, literature review, citation analysis, research data, paper scraping, academic scraping, research automation, citation tracking, publication monitoring, academic dataset, research tools, scholarly articles, PDF extraction
Built with β€οΈ on Apify