Google News Scraper avatar

Google News Scraper

Pricing

$0.70 / 1,000 results

Go to Apify Store
Google News Scraper

Google News Scraper

[πŸ’° $0.70 / 1K] Production-grade Google News scraper. Search by keywords, topics or any Google News topic/section URL. HD thumbnails, decoded publisher URLs, descriptions, full metadata. Up to 50,000 articles per run.

Pricing

$0.70 / 1,000 results

Rating

0.0

(0)

Developer

Kitcune Mia

Kitcune Mia

Maintained by Community

Actor stats

1

Bookmarked

2

Total users

1

Monthly active users

a day ago

Last modified

Share

πŸ“° Google News Scraper

Production-grade Google News scraper. Browser-grade TLS fingerprinting, Apify Residential US proxies, three independent retry budgets, HD thumbnails β€” all without a single line of headless-browser code.

Pull clean, structured news data from Google News by keywords, predefined topics, or any topic/section URL you paste from your browser. Works for any country and language. Scales from a 5-article spot check to 50,000-article archives.


⚑ Why This Scraper

🧬 Real browser fingerprint, no headless browser

Built on curl_cffi with Chrome TLS impersonation. Google sees a real Chrome handshake β€” JA3, ALPN, HTTP/2 frames β€” without paying the cost of Selenium, Playwright, or a JS engine. Result: ~10Γ— lower latency and CPU than a headless setup, the same survival rate against bot defenses.

🌎 Residential US IPs that rotate per request

Every HTTP call gets a fresh exit IP from Apify's residential pool. No session pinning, no shared rate limits between requests. You can ramp parallelism without tripping Google's 429 wall.

πŸ›‘οΈ Three independent retry budgets

The scraper runs three separate I/O stages β€” RSS fetch Β· URL decode Β· publisher page fetch β€” and each has its own retry counter, exponential backoff with jitter, and 15-second hard timeout. A single flaky publisher can't drain the budget you need to keep ingesting RSS feeds.

πŸ–ΌοΈ HD images without visiting publisher pages

Most scrapers that want article thumbnails have to follow every link to the publisher and parse <meta og:image>. This one fetches Google News' HTML index in parallel with the RSS feed and pulls the HD thumbnail straight from news.google.com/api/attachments/.... Free images, zero extra requests per article.

πŸ”“ Smart URL decoder

Optional decodeUrls resolves news.google.com/rss/articles/CBMi... redirects to the actual publisher URL via Google's Fbv4je batchexecute endpoint β€” so you can store, deduplicate, and re-crawl real source URLs.

πŸ“Š 50,000-article archives, day-by-day

Google News RSS caps each feed at ~100 results. When you ask for maxItems > 100, the scraper automatically splits the date range one day at a time and parallelizes across days β€” no manual chunking, no coordination on your side.


πŸš€ Quick Start

Drop this into the Actor input:

{
"keywords": ["Elon Musk"],
"maxArticles": 10,
"timeframe": "1d",
"region_language": "US:en",
"extractImages": true
}

You get back HD-illustrated, ISO-timestamped, source-attributed news in seconds.


🧭 Two Ways to Search (Mix and Match)

Full Google search operator support (site:, intitle:, inurl:, OR, AND, -, "…").

{
"keywords": ["bitcoin", "ethereum -dogecoin", "intitle:\"AI\" site:bbc.com"],
"timeframe": "1h",
"maxArticles": 50
}

Perfect for brand monitoring Β· competitor tracking Β· trend detection Β· SEO research.

πŸ“° Mode 2 β€” Topics & sections

Predefined Google News topics: WORLD Β· NATION Β· BUSINESS Β· TECHNOLOGY Β· ENTERTAINMENT Β· SPORTS Β· SCIENCE Β· HEALTH

{
"topics": ["TECHNOLOGY", "BUSINESS"],
"maxArticles": 100,
"region_language": "FR:fr"
}

Custom topic/section URLs β€” browse Google News, navigate to anything (e.g. Sports β†’ F1, Tech β†’ Artificial Intelligence), and paste the URL straight from your address bar:

{
"topicUrls": [
"https://news.google.com/topics/CAAq.../sections/CAQi...?hl=en-US&gl=US&ceid=US:en"
]
}

You can combine all three modes (keywords + topics + topicUrls) in a single run.


πŸ“¦ Output Schema

Every article is one clean JSON record:

{
"title": "Elon Musk tells his side of OpenAI's beginnings - PBS",
"source": "PBS",
"sourceUrl": "https://www.pbs.org",
"url": "https://www.pbs.org/newshour/show/elon-musk-...",
"rssLink": "https://news.google.com/rss/articles/CBMi...",
"guid": "CBMi...",
"articleId": "CBMi...",
"publishedAt": "2026-04-30T00:46:35.000Z",
"publishedTimestamp": 1777509995000,
"image": "https://news.google.com/api/attachments/CC8i...-w400-h224-p-df-rw",
"description": "Tesla and SpaceX CEO Elon Musk took the witness stand...",
"loadedUrl": "https://www.pbs.org/newshour/show/elon-musk-...",
"metadata": {
"scrapeTimestamp": "2026-04-30T13:57:26.461Z",
"sourceType": "keyword",
"timeframe": "1d",
"region": "US",
"language": "en",
"keyword": "Elon Musk"
}
}
FieldAlways?Notes
title, source, sourceUrlβœ…From RSS
urlβœ…Resolved publisher URL when decodeUrls=true, else Google News redirect
rssLink, guid, articleIdβœ…The CBMi… identifier extracted in three forms
publishedAt, publishedTimestampβœ…ISO-8601 UTC + Unix epoch ms
imagewhen extractImages=true (default)Google News HD thumbnail; upgraded to publisher's og:image if extractDescriptions=true
descriptionwhen extractDescriptions=trueog:description from publisher page
loadedUrlwhen publisher page was fetchedFinal URL after redirects
metadata.sourceTypeβœ…keyword / topic / topic_url

Pipe this directly into BigQuery, Snowflake, Airtable, Google Sheets, Slack, your CRM, or anything that speaks JSON / CSV / XLSX.


πŸ“‹ Input Reference

Sources (at least one required)

FieldTypeDescription
keywordsstring[]Multiple search queries
querystringSingle query (added to keywords if both are set)
topicsstring[]One or more of WORLD Β· NATION Β· BUSINESS Β· TECHNOLOGY Β· ENTERTAINMENT Β· SPORTS Β· SCIENCE Β· HEALTH
topicUrlsstring[]Full Google News topic/section URLs (or bare {TOPIC_ID}[/sections/{SECTION_ID}])

Volume

FieldTypeDefaultDescription
maxItemsint 1–50,000β€”Global cap. >100 triggers automatic day-by-day splitting
maxArticlesint10Per-source cap (used when maxItems is not set)

Time window

FieldTypeDescription
timeframe1h / 1d / 7d / 1m / 1y / allWindow for keyword searches
time_periodlast_hour / last_day / last_week / last_month / last_year / customAlias for timeframe
time_period_min, time_period_maxMM/DD/YYYYUsed with time_period=custom

Locale & filters

FieldTypeDescription
region_language"US:en", "FR:fr", …Combined region:language
glISO country (lowercase)Overrides region
hlISO language (lowercase)Overrides language
lrlang_en, lang_fr, …Restrict results to a language
crISO country (lowercase)Restrict results to a country
nfpr0 / 11 = disable auto-correct
filter0 / 11 = enable Similar/Omitted Results filter

Extraction

FieldTypeDefaultDescription
decodeUrlsboolfalseResolve to publisher URLs
extractDescriptionsboolfalseFetch og:description (implies decodeUrls)
extractImagesbooltrueInclude HD thumbnails

Advanced

FieldTypeDefaultDescription
proxyConfigurationobjectApify Residential USStandard Apify proxy input
concurrencyint 1–12832In-flight requests for decode + article stages
retryBudgetRssint 1–104Independent retry attempts per RSS feed
retryBudgetDecodeint 1–103Independent retry attempts per URL decode
retryBudgetArticleint 1–102Independent retry attempts per publisher page fetch

πŸ’‘ Recipes

πŸ”” Brand monitoring β€” last hour, multi-keyword

{
"keywords": ["Acme Corp", "Acme CEO", "Acme product line"],
"timeframe": "1h",
"maxArticles": 100,
"extractImages": true
}

πŸ“Š Daily SEO digest with full content

{
"topics": ["TECHNOLOGY", "BUSINESS"],
"maxArticles": 50,
"decodeUrls": true,
"extractDescriptions": true,
"extractImages": true
}

πŸ“š Historical archive (5,000 articles, full year)

{
"query": "climate change",
"time_period": "custom",
"time_period_min": "01/01/2025",
"time_period_max": "12/31/2025",
"maxItems": 5000,
"lr": "lang_en"
}

🌍 Cross-language coverage

{
"keywords": ["Olympics 2028"],
"gl": "fr",
"hl": "fr",
"lr": "lang_fr"
}

πŸ§ͺ Specific section (e.g. Tech β†’ AI)

{
"topicUrls": [
"https://news.google.com/topics/CAAq.../sections/CAQi...?hl=en-US&gl=US&ceid=US:en"
],
"maxArticles": 50
}

πŸ”Œ No-Code Integrations

Wires up to Make, n8n, Zapier, Pipedream out of the box. Common flows:

  • Schedule: run every 15 min / hourly / daily via Apify Schedules
  • Push: send new articles to Google Sheets, Airtable, Notion, Slack, Discord, or your CRM
  • Alert: trigger when a brand or keyword crosses a threshold

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ curl_cffi Β· Chrome TLS impersonation β”‚
β”‚ Apify Residential US (rotating IP) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β–Ό β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ RSS feed β”‚ β”‚ HTML index β”‚ β”‚ URL decode β”‚
β”‚ (lxml) β”‚ β”‚ (regex) β”‚ β”‚ (Fbv4je) β”‚
β”‚ retry: 4Γ— β”‚ β”‚ retry: 4Γ— β”‚ β”‚ retry: 3Γ— β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ merged item + β”‚ β”‚ publisher β”‚
β”‚ HD thumbnail β”‚ β”‚ og:* fetch β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ retry: 2Γ— β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Apify β”‚
β”‚ Dataset β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Each retry budget is independent: a 5xx storm in one stage never burns the others.


πŸ› οΈ Local Development

git clone <this-repo>
cd google_news
pip install -r requirements.txt
# Drop a test input
mkdir -p storage/key_value_stores/default
cat > storage/key_value_stores/default/INPUT.json <<'JSON'
{
"keywords": ["bitcoin"],
"maxArticles": 5,
"extractImages": true
}
JSON
python -m src

Output appears under storage/datasets/default/*.json.

ℹ️ Note: locally without APIFY_PROXY_PASSWORD, the scraper falls back to direct connections β€” RSS + image extraction work, but URL decoding requires residential US proxies (Google's Fbv4je rejects non-residential IPs).

Deploy to Apify

deploy.bat # Windows
# or
apify push --force

🧰 Tech Stack

  • Python 3.12 Β· async/await throughout
  • curl_cffi 0.7+ β€” Chrome TLS impersonation, HTTP/2
  • apify SDK 3.x β€” platform integration, proxy, dataset
  • lxml β€” fast XML/HTML parsing
  • Custom regex parsers β€” for Google News HTML index and Fbv4je batchexecute responses

πŸ“„ License

MIT.