Google News Scraper
Pricing
$10.00/month + usage
Google News Scraper
Extract full Google News articles with text, images & metadata. 95%+ success rate, multi-region support, smart content extraction with automatic fallbacks. Production-ready & cost-optimized
Pricing
$10.00/month + usage
Rating
5.0
(1)
Developer

Yevhenii Molodtsov
Actor stats
2
Bookmarked
13
Total users
3
Monthly active users
a day ago
Last modified
Categories
Share
Google News Bulk Scraper
Google News → publisher URLs → clean article text + images + metadata, with JS rendering, paywall, and consent-page fallbacks. HTTP-first, Playwright only when needed.
Scrape one query or thousands in a single run. Each article lands as its own dataset row with the full text, images, author, source, language, and a quality score — ready for NLP pipelines, media monitoring, or research datasets.
What You Get
Each article in the output includes:
- title — headline as published
- url — canonical publisher URL (not the Google News redirect)
- source — publisher name (e.g. "Reuters", "TechCrunch")
- publishedAt — ISO 8601 timestamp
- author — byline when available
- text — clean full-text content (300+ characters, validated)
- images — OG image, featured image, and in-article images with alt text
- language — detected content language
- extractionSuccess — boolean flag for downstream filtering
- contentQuality — score (0-100), level (low/medium/high), and warnings
Set fetchArticleDetails: false to skip crawling and get RSS metadata only (title, source, date, link) at minimal cost.
Quick Start
Using Apify Console
- Visit Apify Console
- Search for "Google News Scraper"
- Configure your search parameters
- Run the actor
Using Apify CLI
npm install -g apify-cli# Single queryapify call google-news-scraper --input '{"query": "Tesla","maxItemsPerUrl": 10}'# Multiple queries (string shorthand)apify call google-news-scraper --input '{"queries": ["tesla", "apple"],"maxItemsPerUrl": 10}'# Multiple queries with passthrough fieldsapify call google-news-scraper --input '{"queries": [{ "query": "Kim Kardashian", "profileUrl": "https://news.google.com/search?q=kim+kardashian" },{ "query": "MrBeast" }],"maxItemsPerUrl": 10,"maxItems": 15}'
Using Apify API
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });const run = await client.actor('google-news-scraper').call({queries: [{ query: 'Taylor Swift', profileUrl: 'https://news.google.com/search?q=taylor+swift' },{ query: 'Elon Musk', profileUrl: 'https://news.google.com/search?q=elon+musk' },],maxItemsPerUrl: 10,maxItems: 50,});const { items } = await client.dataset(run.defaultDatasetId).listItems();// items is a flat array of articles, each with query + passthrough fields merged inconsole.log(items);
Input Modes
Most Common: Single Query
{"query": "artificial intelligence","maxItemsPerUrl": 10}
That's it — one query, up to 10 articles.
Bulk: Multiple Queries
Pass an array of strings to scrape several topics in one run:
{"queries": ["tesla", "apple", "nvidia"],"maxItemsPerUrl": 10}
Advanced: Queries with Passthrough Fields
Each query can be an object. Any field besides query is passed through to every output article for that query — useful for linking results back to your own IDs, profile URLs, or tags:
{"queries": [{ "query": "Kim Kardashian", "profileUrl": "https://news.google.com/search?q=kim+kardashian" },{ "query": "MrBeast", "customField": "my-tag" },"Taylor Swift"],"maxItemsPerUrl": 10,"maxItems": 25}
Precedence: queries > query. If both are provided, queries wins.
Configuration
Input Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
query | string | No* | - | Simple search query string |
queries | array | No* | - | Array of strings or objects with query and optional passthrough fields |
maxItemsPerUrl | integer | No | 50 | Max articles per individual query |
maxItems | integer | No | 0 | Optional global cap on total articles (0 = unlimited) |
fetchArticleDetails | boolean | No | true | If false, skip article crawling and return RSS metadata only |
region | string | No | "US" | Country code (US, GB, CA, AU, DE, ES, MX, IT) |
language | string | No | "en-US" | Language code (en-US, en-GB, en-CA, en-AU, de-DE, es-ES, es-MX, it-IT) |
dateFrom | string | No | - | Start date (YYYY-MM-DD) |
dateTo | string | No | - | End date (YYYY-MM-DD) |
disableBrowserFallback | boolean | No | false | Skip Playwright fallback — cheaper but may return fewer articles |
proxyConfiguration | object | No | Apify Proxy enabled | Proxy settings; defaults to Apify Proxy |
*At least one of query or queries is required.
How Extraction Works
The pipeline resolves every Google News redirect to the real publisher URL, then extracts content through six ordered strategies — stopping at the first one that produces 300+ characters of text with images:
- HTTP fetch — fast, cheap, works for most publishers
- Playwright browser — automatic fallback for JS-rendered or consent-gated pages
- Readability / Extractus / JSON-LD / custom selectors / meta tags / heuristics — six extraction strategies tried in order
Every article is quality-scored (text length, image presence, error-page detection). Low-quality results are filtered before they reach your dataset.
Estimated Cost
All costs depend on article count, target sites, and proxy tier. The numbers below are rough guidelines based on typical runs using Apify Proxy (datacenter tier).
| Scenario | Articles | Typical Cost |
|---|---|---|
RSS metadata only (fetchArticleDetails: false) | 100 | ~$0.01 – $0.02 |
| Full text, HTTP-first (most sites) | 100 | ~$0.05 – $0.10 |
| Full text, mixed HTTP + Playwright fallback | 100 | ~$0.10 – $0.25 |
| Heavy JS sites (frequent Playwright) | 100 | ~$0.20 – $0.50 |
Cost levers you control:
fetchArticleDetails: false— skip article crawling entirely for near-zero costdisableBrowserFallback: true— stay HTTP-only, ~2-5x cheaper, fewer articles from JS-heavy sitesmaxItemsPerUrl/maxItems— hard caps on article count- Proxy tier — datacenter is default and cheapest; residential auto-escalates only on repeated 429/403 errors
Limitations
Be aware of these before you buy:
- Paywalled sites — articles behind hard paywalls (WSJ, FT, NYT subscriber-only) will return partial text or fail. The scraper extracts whatever is publicly visible.
- Heavy bot protection — sites with aggressive Cloudflare challenges or CAPTCHAs may need multiple retries and residential proxies, increasing cost.
- Region/language variance — Google News returns different articles depending on
regionandlanguage. The same query may yield different results from US vs DE. - RSS feed limits — Google News RSS feeds return a limited window of articles (roughly 24-72 hours). For historical coverage, use
dateFrom/dateTodate slicing, which the scraper handles automatically. - Image availability — some publishers strip images or serve them via CDN policies that block external access. Articles without valid images receive a lower quality score.
Output Format
Output is a flat array of articles. Each article is a separate dataset entry with the query string and any passthrough fields merged at the top level:
[{"query": "Taylor Swift","profileUrl": "https://news.google.com/search?q=taylor+swift","title": "Taylor Swift Announces New Album - Billboard","url": "https://www.billboard.com/2025/08/05/taylor-swift-new-album.html","source": "Billboard","publishedAt": "2025-08-05T14:08:57.000Z","author": "Jane Smith","text": "Full article content...","description": "Brief summary of the article...","images": [{"url": "https://example.com/image.jpg","type": "featured-og","alt": "Image description"}],"tags": ["Taylor Swift"],"language": "en","extractionSuccess": true,"contentQuality": {"score": 85,"level": "high","isValid": true,"warnings": []}},{"query": "MrBeast","customField": "test-passthrough","title": "MrBeast Breaks YouTube Record","url": "https://www.example.com/mrbeast-record.html","source": "Example News","publishedAt": "2025-08-05T10:00:00.000Z","text": "Full article content...","..."}]
Development
Setup
git clone https://github.com/YevheniiM/google-news-scrappercd google-news-scrappernpm install
Running
# Productionnpm start# Development mode (DEBUG=true, NODE_ENV=development)npm run dev# Development with file watchingnpm run dev:watch
Create an INPUT.json at the project root for local input:
{"queries": [{ "query": "Taylor Swift" }, { "query": "Elon Musk" }],"maxItemsPerUrl": 5}
Testing
# Run all testsnpm test# Watch modenpm run test:watch# With coveragenpm run test:coverage
Formatting
npm run formatnpm run format:check
License
MIT -- see LICENSE for details.
Acknowledgments
- Apify SDK and Crawlee for the scraping framework
- @mozilla/readability and @extractus/article-extractor for content extraction
- fast-xml-parser for RSS parsing