Google News Scraper
Pricing
$0.70 / 1,000 results
Google News Scraper
[π° $0.70 / 1K] Production-grade Google News scraper. Search by keywords, topics or any Google News topic/section URL. HD thumbnails, decoded publisher URLs, descriptions, full metadata. Up to 50,000 articles per run.
Pricing
$0.70 / 1,000 results
Rating
0.0
(0)
Developer
Kitcune Mia
Actor stats
1
Bookmarked
2
Total users
1
Monthly active users
a day ago
Last modified
Categories
Share
π° Google News Scraper
Production-grade Google News scraper. Browser-grade TLS fingerprinting, Apify Residential US proxies, three independent retry budgets, HD thumbnails β all without a single line of headless-browser code.
Pull clean, structured news data from Google News by keywords, predefined topics, or any topic/section URL you paste from your browser. Works for any country and language. Scales from a 5-article spot check to 50,000-article archives.
β‘ Why This Scraper
𧬠Real browser fingerprint, no headless browser
Built on curl_cffi with Chrome TLS impersonation. Google sees a real Chrome handshake β JA3, ALPN, HTTP/2 frames β without paying the cost of Selenium, Playwright, or a JS engine. Result: ~10Γ lower latency and CPU than a headless setup, the same survival rate against bot defenses.
π Residential US IPs that rotate per request
Every HTTP call gets a fresh exit IP from Apify's residential pool. No session pinning, no shared rate limits between requests. You can ramp parallelism without tripping Google's 429 wall.
π‘οΈ Three independent retry budgets
The scraper runs three separate I/O stages β RSS fetch Β· URL decode Β· publisher page fetch β and each has its own retry counter, exponential backoff with jitter, and 15-second hard timeout. A single flaky publisher can't drain the budget you need to keep ingesting RSS feeds.
πΌοΈ HD images without visiting publisher pages
Most scrapers that want article thumbnails have to follow every link to the publisher and parse <meta og:image>. This one fetches Google News' HTML index in parallel with the RSS feed and pulls the HD thumbnail straight from news.google.com/api/attachments/.... Free images, zero extra requests per article.
π Smart URL decoder
Optional decodeUrls resolves news.google.com/rss/articles/CBMi... redirects to the actual publisher URL via Google's Fbv4je batchexecute endpoint β so you can store, deduplicate, and re-crawl real source URLs.
π 50,000-article archives, day-by-day
Google News RSS caps each feed at ~100 results. When you ask for maxItems > 100, the scraper automatically splits the date range one day at a time and parallelizes across days β no manual chunking, no coordination on your side.
π Quick Start
Drop this into the Actor input:
{"keywords": ["Elon Musk"],"maxArticles": 10,"timeframe": "1d","region_language": "US:en","extractImages": true}
You get back HD-illustrated, ISO-timestamped, source-attributed news in seconds.
π§ Two Ways to Search (Mix and Match)
π Mode 1 β Keyword search
Full Google search operator support (site:, intitle:, inurl:, OR, AND, -, "β¦").
{"keywords": ["bitcoin", "ethereum -dogecoin", "intitle:\"AI\" site:bbc.com"],"timeframe": "1h","maxArticles": 50}
Perfect for brand monitoring Β· competitor tracking Β· trend detection Β· SEO research.
π° Mode 2 β Topics & sections
Predefined Google News topics: WORLD Β· NATION Β· BUSINESS Β· TECHNOLOGY Β· ENTERTAINMENT Β· SPORTS Β· SCIENCE Β· HEALTH
{"topics": ["TECHNOLOGY", "BUSINESS"],"maxArticles": 100,"region_language": "FR:fr"}
Custom topic/section URLs β browse Google News, navigate to anything (e.g. Sports β F1, Tech β Artificial Intelligence), and paste the URL straight from your address bar:
{"topicUrls": ["https://news.google.com/topics/CAAq.../sections/CAQi...?hl=en-US&gl=US&ceid=US:en"]}
You can combine all three modes (keywords + topics + topicUrls) in a single run.
π¦ Output Schema
Every article is one clean JSON record:
{"title": "Elon Musk tells his side of OpenAI's beginnings - PBS","source": "PBS","sourceUrl": "https://www.pbs.org","url": "https://www.pbs.org/newshour/show/elon-musk-...","rssLink": "https://news.google.com/rss/articles/CBMi...","guid": "CBMi...","articleId": "CBMi...","publishedAt": "2026-04-30T00:46:35.000Z","publishedTimestamp": 1777509995000,"image": "https://news.google.com/api/attachments/CC8i...-w400-h224-p-df-rw","description": "Tesla and SpaceX CEO Elon Musk took the witness stand...","loadedUrl": "https://www.pbs.org/newshour/show/elon-musk-...","metadata": {"scrapeTimestamp": "2026-04-30T13:57:26.461Z","sourceType": "keyword","timeframe": "1d","region": "US","language": "en","keyword": "Elon Musk"}}
| Field | Always? | Notes |
|---|---|---|
title, source, sourceUrl | β | From RSS |
url | β | Resolved publisher URL when decodeUrls=true, else Google News redirect |
rssLink, guid, articleId | β | The CBMiβ¦ identifier extracted in three forms |
publishedAt, publishedTimestamp | β | ISO-8601 UTC + Unix epoch ms |
image | when extractImages=true (default) | Google News HD thumbnail; upgraded to publisher's og:image if extractDescriptions=true |
description | when extractDescriptions=true | og:description from publisher page |
loadedUrl | when publisher page was fetched | Final URL after redirects |
metadata.sourceType | β | keyword / topic / topic_url |
Pipe this directly into BigQuery, Snowflake, Airtable, Google Sheets, Slack, your CRM, or anything that speaks JSON / CSV / XLSX.
π Input Reference
Sources (at least one required)
| Field | Type | Description |
|---|---|---|
keywords | string[] | Multiple search queries |
query | string | Single query (added to keywords if both are set) |
topics | string[] | One or more of WORLD Β· NATION Β· BUSINESS Β· TECHNOLOGY Β· ENTERTAINMENT Β· SPORTS Β· SCIENCE Β· HEALTH |
topicUrls | string[] | Full Google News topic/section URLs (or bare {TOPIC_ID}[/sections/{SECTION_ID}]) |
Volume
| Field | Type | Default | Description |
|---|---|---|---|
maxItems | int 1β50,000 | β | Global cap. >100 triggers automatic day-by-day splitting |
maxArticles | int | 10 | Per-source cap (used when maxItems is not set) |
Time window
| Field | Type | Description |
|---|---|---|
timeframe | 1h / 1d / 7d / 1m / 1y / all | Window for keyword searches |
time_period | last_hour / last_day / last_week / last_month / last_year / custom | Alias for timeframe |
time_period_min, time_period_max | MM/DD/YYYY | Used with time_period=custom |
Locale & filters
| Field | Type | Description |
|---|---|---|
region_language | "US:en", "FR:fr", β¦ | Combined region:language |
gl | ISO country (lowercase) | Overrides region |
hl | ISO language (lowercase) | Overrides language |
lr | lang_en, lang_fr, β¦ | Restrict results to a language |
cr | ISO country (lowercase) | Restrict results to a country |
nfpr | 0 / 1 | 1 = disable auto-correct |
filter | 0 / 1 | 1 = enable Similar/Omitted Results filter |
Extraction
| Field | Type | Default | Description |
|---|---|---|---|
decodeUrls | bool | false | Resolve to publisher URLs |
extractDescriptions | bool | false | Fetch og:description (implies decodeUrls) |
extractImages | bool | true | Include HD thumbnails |
Advanced
| Field | Type | Default | Description |
|---|---|---|---|
proxyConfiguration | object | Apify Residential US | Standard Apify proxy input |
concurrency | int 1β128 | 32 | In-flight requests for decode + article stages |
retryBudgetRss | int 1β10 | 4 | Independent retry attempts per RSS feed |
retryBudgetDecode | int 1β10 | 3 | Independent retry attempts per URL decode |
retryBudgetArticle | int 1β10 | 2 | Independent retry attempts per publisher page fetch |
π‘ Recipes
π Brand monitoring β last hour, multi-keyword
{"keywords": ["Acme Corp", "Acme CEO", "Acme product line"],"timeframe": "1h","maxArticles": 100,"extractImages": true}
π Daily SEO digest with full content
{"topics": ["TECHNOLOGY", "BUSINESS"],"maxArticles": 50,"decodeUrls": true,"extractDescriptions": true,"extractImages": true}
π Historical archive (5,000 articles, full year)
{"query": "climate change","time_period": "custom","time_period_min": "01/01/2025","time_period_max": "12/31/2025","maxItems": 5000,"lr": "lang_en"}
π Cross-language coverage
{"keywords": ["Olympics 2028"],"gl": "fr","hl": "fr","lr": "lang_fr"}
π§ͺ Specific section (e.g. Tech β AI)
{"topicUrls": ["https://news.google.com/topics/CAAq.../sections/CAQi...?hl=en-US&gl=US&ceid=US:en"],"maxArticles": 50}
π No-Code Integrations
Wires up to Make, n8n, Zapier, Pipedream out of the box. Common flows:
- Schedule: run every 15 min / hourly / daily via Apify Schedules
- Push: send new articles to Google Sheets, Airtable, Notion, Slack, Discord, or your CRM
- Alert: trigger when a brand or keyword crosses a threshold
ποΈ Architecture
βββββββββββββββββββββββββββββββββββββββββββββ curl_cffi Β· Chrome TLS impersonation ββ Apify Residential US (rotating IP) ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββΌ βΌ βΌββββββββββββββββ ββββββββββββββββ βββββββββββββββββ RSS feed β β HTML index β β URL decode ββ (lxml) β β (regex) β β (Fbv4je) ββ retry: 4Γ β β retry: 4Γ β β retry: 3Γ βββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬βββββββββ β ββββββββββββ¬ββββββββββββ ββΌ βΌββββββββββββββββββββ βββββββββββββββββββ merged item + β β publisher ββ HD thumbnail β β og:* fetch βββββββββββ¬ββββββββββ β retry: 2Γ ββ ββββββββββ¬ββββββββββββββββββββββ¬βββββββββββββββββββββΌβββββββββββββββββ Apify ββ Dataset βββββββββββββββββ
Each retry budget is independent: a 5xx storm in one stage never burns the others.
π οΈ Local Development
git clone <this-repo>cd google_newspip install -r requirements.txt# Drop a test inputmkdir -p storage/key_value_stores/defaultcat > storage/key_value_stores/default/INPUT.json <<'JSON'{"keywords": ["bitcoin"],"maxArticles": 5,"extractImages": true}JSONpython -m src
Output appears under storage/datasets/default/*.json.
βΉοΈ Note: locally without
APIFY_PROXY_PASSWORD, the scraper falls back to direct connections β RSS + image extraction work, but URL decoding requires residential US proxies (Google'sFbv4jerejects non-residential IPs).
Deploy to Apify
deploy.bat # Windows# orapify push --force
π§° Tech Stack
- Python 3.12 Β· async/await throughout
curl_cffi0.7+ β Chrome TLS impersonation, HTTP/2apifySDK 3.x β platform integration, proxy, datasetlxmlβ fast XML/HTML parsing- Custom regex parsers β for Google News HTML index and
Fbv4jebatchexecute responses
π License
MIT.