Google News Scraper
Pricing
$3.00 / 1,000 results
Google News Scraper
Stop settling for headline-only Google News scrapers that leave you with bare titles and broken redirect URLs. This is the fastest, most data-rich Google News scraper on Apify — purpose-built for PR teams, finance desks, NLP pipelines, and brand monitoring at scale.
Pricing
$3.00 / 1,000 results
Rating
0.0
(0)
Developer
VortexData
Actor stats
0
Bookmarked
3
Total users
2
Monthly active users
3 days ago
Last modified
Categories
Share
📰 Google News Scraper
Production-grade Google News scraper. Browser-grade TLS fingerprinting, Apify Residential US proxies, three independent retry budgets, HD thumbnails — all without a single line of headless-browser code.
Pull clean, structured news data from Google News by keywords, predefined topics, or any topic/section URL you paste from your browser. Works for any country and language. Scales from a 5-article spot check to 50,000-article archives.
⚡ Why This Scraper
🧬 Real browser fingerprint, no headless browser
Built on curl_cffi with Chrome TLS impersonation. Google sees a real Chrome handshake — JA3, ALPN, HTTP/2 frames — without paying the cost of Selenium, Playwright, or a JS engine. Result: ~10× lower latency and CPU than a headless setup, the same survival rate against bot defenses.
🌎 Residential US IPs that rotate per request
Every HTTP call gets a fresh exit IP from Apify's residential pool. No session pinning, no shared rate limits between requests. You can ramp parallelism without tripping Google's 429 wall.
🛡️ Three independent retry budgets
The scraper runs three separate I/O stages — RSS fetch · URL decode · publisher page fetch — and each has its own retry counter, exponential backoff with jitter, and 15-second hard timeout. A single flaky publisher can't drain the budget you need to keep ingesting RSS feeds.
🖼️ HD images without visiting publisher pages
Most scrapers that want article thumbnails have to follow every link to the publisher and parse <meta og:image>. This one fetches Google News' HTML index in parallel with the RSS feed and pulls the HD thumbnail straight from news.google.com/api/attachments/.... Free images, zero extra requests per article.
🔓 Smart URL decoder
Optional decodeUrls resolves news.google.com/rss/articles/CBMi... redirects to the actual publisher URL via Google's Fbv4je batchexecute endpoint — so you can store, deduplicate, and re-crawl real source URLs.
📊 50,000-article archives, day-by-day
Google News RSS caps each feed at ~100 results. When you ask for maxItems > 100, the scraper automatically splits the date range one day at a time and parallelizes across days — no manual chunking, no coordination on your side.
🚀 Quick Start
Drop this into the Actor input:
{"keywords": ["Elon Musk"],"maxArticles": 10,"timeframe": "1d","region_language": "US:en","extractImages": true}
You get back HD-illustrated, ISO-timestamped, source-attributed news in seconds.
🧭 Two Ways to Search (Mix and Match)
🔍 Mode 1 — Keyword search
Full Google search operator support (site:, intitle:, inurl:, OR, AND, -, "…").
{"keywords": ["bitcoin", "ethereum -dogecoin", "intitle:\"AI\" site:bbc.com"],"timeframe": "1h","maxArticles": 50}
Perfect for brand monitoring · competitor tracking · trend detection · SEO research.
📰 Mode 2 — Topics & sections
Predefined Google News topics: WORLD · NATION · BUSINESS · TECHNOLOGY · ENTERTAINMENT · SPORTS · SCIENCE · HEALTH
{"topics": ["TECHNOLOGY", "BUSINESS"],"maxArticles": 100,"region_language": "FR:fr"}
Custom topic/section URLs — browse Google News, navigate to anything (e.g. Sports → F1, Tech → Artificial Intelligence), and paste the URL straight from your address bar:
{"topicUrls": ["https://news.google.com/topics/CAAq.../sections/CAQi...?hl=en-US&gl=US&ceid=US:en"]}
You can combine all three modes (keywords + topics + topicUrls) in a single run.
📦 Output Schema
Every article is one clean JSON record:
{"title": "Elon Musk tells his side of OpenAI's beginnings - PBS","source": "PBS","sourceUrl": "https://www.pbs.org","url": "https://www.pbs.org/newshour/show/elon-musk-...","rssLink": "https://news.google.com/rss/articles/CBMi...","guid": "CBMi...","articleId": "CBMi...","publishedAt": "2026-04-30T00:46:35.000Z","publishedTimestamp": 1777509995000,"image": "https://news.google.com/api/attachments/CC8i...-w400-h224-p-df-rw","description": "Tesla and SpaceX CEO Elon Musk took the witness stand...","loadedUrl": "https://www.pbs.org/newshour/show/elon-musk-...","metadata": {"scrapeTimestamp": "2026-04-30T13:57:26.461Z","sourceType": "keyword","timeframe": "1d","region": "US","language": "en","keyword": "Elon Musk"}}
| Field | Always? | Notes |
|---|---|---|
title, source, sourceUrl | ✅ | From RSS |
url | ✅ | Resolved publisher URL when decodeUrls=true, else Google News redirect |
rssLink, guid, articleId | ✅ | The CBMi… identifier extracted in three forms |
publishedAt, publishedTimestamp | ✅ | ISO-8601 UTC + Unix epoch ms |
image | when extractImages=true (default) | Google News HD thumbnail; upgraded to publisher's og:image if extractDescriptions=true |
description | when extractDescriptions=true | og:description from publisher page |
loadedUrl | when publisher page was fetched | Final URL after redirects |
metadata.sourceType | ✅ | keyword / topic / topic_url |
Pipe this directly into BigQuery, Snowflake, Airtable, Google Sheets, Slack, your CRM, or anything that speaks JSON / CSV / XLSX.
📋 Input Reference
Sources (at least one required)
| Field | Type | Description |
|---|---|---|
keywords | string[] | Multiple search queries |
query | string | Single query (added to keywords if both are set) |
topics | string[] | One or more of WORLD · NATION · BUSINESS · TECHNOLOGY · ENTERTAINMENT · SPORTS · SCIENCE · HEALTH |
topicUrls | string[] | Full Google News topic/section URLs (or bare {TOPIC_ID}[/sections/{SECTION_ID}]) |
Volume
| Field | Type | Default | Description |
|---|---|---|---|
maxItems | int 1–50,000 | — | Global cap. >100 triggers automatic day-by-day splitting |
maxArticles | int | 10 | Per-source cap (used when maxItems is not set) |
Time window
| Field | Type | Description |
|---|---|---|
timeframe | 1h / 1d / 7d / 1m / 1y / all | Window for keyword searches |
time_period | last_hour / last_day / last_week / last_month / last_year / custom | Alias for timeframe |
time_period_min, time_period_max | MM/DD/YYYY | Used with time_period=custom |
Locale & filters
| Field | Type | Description |
|---|---|---|
region_language | "US:en", "FR:fr", … | Combined region:language |
gl | ISO country (lowercase) | Overrides region |
hl | ISO language (lowercase) | Overrides language |
lr | lang_en, lang_fr, … | Restrict results to a language |
cr | ISO country (lowercase) | Restrict results to a country |
nfpr | 0 / 1 | 1 = disable auto-correct |
filter | 0 / 1 | 1 = enable Similar/Omitted Results filter |
Extraction
| Field | Type | Default | Description |
|---|---|---|---|
decodeUrls | bool | false | Resolve to publisher URLs |
extractDescriptions | bool | false | Fetch og:description (implies decodeUrls) |
extractImages | bool | true | Include HD thumbnails |
Advanced
| Field | Type | Default | Description |
|---|---|---|---|
proxyConfiguration | object | Apify Residential US | Standard Apify proxy input |
concurrency | int 1–128 | 32 | In-flight requests for decode + article stages |
retryBudgetRss | int 1–10 | 4 | Independent retry attempts per RSS feed |
retryBudgetDecode | int 1–10 | 3 | Independent retry attempts per URL decode |
retryBudgetArticle | int 1–10 | 2 | Independent retry attempts per publisher page fetch |
💡 Recipes
🔔 Brand monitoring — last hour, multi-keyword
{"keywords": ["Acme Corp", "Acme CEO", "Acme product line"],"timeframe": "1h","maxArticles": 100,"extractImages": true}
📊 Daily SEO digest with full content
{"topics": ["TECHNOLOGY", "BUSINESS"],"maxArticles": 50,"decodeUrls": true,"extractDescriptions": true,"extractImages": true}
📚 Historical archive (5,000 articles, full year)
{"query": "climate change","time_period": "custom","time_period_min": "01/01/2025","time_period_max": "12/31/2025","maxItems": 5000,"lr": "lang_en"}
🌍 Cross-language coverage
{"keywords": ["Olympics 2028"],"gl": "fr","hl": "fr","lr": "lang_fr"}
🧪 Specific section (e.g. Tech → AI)
{"topicUrls": ["https://news.google.com/topics/CAAq.../sections/CAQi...?hl=en-US&gl=US&ceid=US:en"],"maxArticles": 50}
🔌 No-Code Integrations
Wires up to Make, n8n, Zapier, Pipedream out of the box. Common flows:
- Schedule: run every 15 min / hourly / daily via Apify Schedules
- Push: send new articles to Google Sheets, Airtable, Notion, Slack, Discord, or your CRM
- Alert: trigger when a brand or keyword crosses a threshold
🏗️ Architecture
┌──────────────────────────────────────────┐│ curl_cffi · Chrome TLS impersonation ││ Apify Residential US (rotating IP) │└──────────────────────────────────────────┘│┌──────────────────────┼─────────────────────┐▼ ▼ ▼┌──────────────┐ ┌──────────────┐ ┌──────────────┐│ RSS feed │ │ HTML index │ │ URL decode ││ (lxml) │ │ (regex) │ │ (Fbv4je) ││ retry: 4× │ │ retry: 4× │ │ retry: 3× │└──────┬───────┘ └──────┬───────┘ └──────┬───────┘│ │ │└─────────┬───────────┘ │▼ ▼┌──────────────────┐ ┌────────────────┐│ merged item + │ │ publisher ││ HD thumbnail │ │ og:* fetch │└────────┬─────────┘ │ retry: 2× ││ └────────┬───────┘└────────────┬───────────────────┘▼┌──────────────┐│ Apify ││ Dataset │└──────────────┘
Each retry budget is independent: a 5xx storm in one stage never burns the others.
🛠️ Local Development
git clone <this-repo>cd google_newspip install -r requirements.txt# Drop a test inputmkdir -p storage/key_value_stores/defaultcat > storage/key_value_stores/default/INPUT.json <<'JSON'{"keywords": ["bitcoin"],"maxArticles": 5,"extractImages": true}JSONpython -m src
Output appears under storage/datasets/default/*.json.
ℹ️ Note: locally without
APIFY_PROXY_PASSWORD, the scraper falls back to direct connections — RSS + image extraction work, but URL decoding requires residential US proxies (Google'sFbv4jerejects non-residential IPs).
Deploy to Apify
deploy.bat # Windows# orapify push --force
🧰 Tech Stack
- Python 3.12 · async/await throughout
curl_cffi0.7+ — Chrome TLS impersonation, HTTP/2apifySDK 3.x — platform integration, proxy, datasetlxml— fast XML/HTML parsing- Custom regex parsers — for Google News HTML index and
Fbv4jebatchexecute responses
📄 License
MIT.