Pricing

from $2.50 / 1,000 results

Try for free

Go to Apify Store

Google News Scraper · Full Article Bodies + Entities

Try for free

Scrape Google News with full publisher article bodies, not just snippets. Decodes the CBM redirect to the publisher page, extracts Article JSON-LD (title, body, author, image, keywords), and runs entity extraction (orgs/tickers/locations). 4 input kinds. Flat per-article billing. Pure HTTP

Pricing

from $2.50 / 1,000 results

Rating

5.0

(1)

Developer

Muhamed Didovic

Actor stats

Bookmarked

Total users

Monthly active users

5 days ago

Last modified

Google News Scraper — Articles, Bodies & Entities

How it works

How Google News Scraper works

All-in-one Google News scraper. Paste any mix of search queries, topic URLs, publisher domains, or Full-coverage cluster URLs — the actor auto-classifies each input and emits one structured row per article. Optional follow-through resolves the Google News redirect to the real publisher URL and extracts the full article body, headline, author, section, image, and keywords from Article / NewsArticle JSON-LD (with a readability-style fallback when JSON-LD is missing).

Input kind	Example	Becomes
Search query	`OpenAI earnings`	RSS feed for that query
Topic URL	`https://news.google.com/topics/CAAqJ…`	RSS feed for that topic
Publisher domain	`bloomberg.com`	RSS feed for `site:bloomberg.com`
Cluster (Full coverage)	`https://news.google.com/stories/CA…`	RSS feed for that story + nested `clusterAggregate` per row

One row per article. JSON + CSV. Pure HTTP, no Puppeteer / Playwright / headless Chromium / third-party paywall-bypass service.

Why use this Google News scraper

All four URL kinds in one actor. No need to glue together a query scraper + topic scraper + publisher scraper — one input list, auto-classified.
Full publisher article body, not just snippets. Google News only shows ~150-character snippets. Most competitor actors stop there. We resolve each Google News redirect (via the same batchexecute call Google's own client uses — no JS execution required) and extract the canonical body from the publisher's Article JSON-LD.
Structured entity extraction built in. Rule-based extraction of orgs / tickers / locations from curated lexicons (~zero false positives). Lives on every body-enriched row at no extra cost.
Flat per-article billing. One article = one row = one result. Buyer cost is rows × per-result-price — no nested-array math.
Locale-aware. Pass country + language to scope the feed (e.g. country: "DE", language: "de" for German news). Time-window filters (since / until) drop articles outside the date range before they hit your billing.

Overview

Field	Detail
Source	`news.google.com` (RSS feeds) + publisher pages (when `enrichBody:true`)
Pricing	$0.0025 per article (result) · +$0.0005 url-resolved · +$0.0005 body-enriched (Cloudflare/Akamai bypass included free)
Anti-bot	None on Google News RSS · walled publishers (Bloomberg/FT/WSJ/Yahoo) bypassed automatically
Runtime	~1-3 s per RSS query (100 articles per feed) · ~2-3 s extra per body-enriched article
Output	JSON + CSV, one row per article
Scope	Global · all Google News locales (`country` + `language` params)

Supported Inputs

The actor accepts every entry in your input list and auto-classifies it as one of four kinds. Mix and match freely.

Input	Example	What you get
Search query (bare string)	`OpenAI earnings`	Up to 100 articles matching the query, ordered by Google News relevance
Search query (URL)	`https://news.google.com/search?q=OpenAI+earnings`	Same as above — URL form is canonicalised
Topic URL	`https://news.google.com/topics/CAAqJ…`	The most recent articles in that topic feed
Publisher domain	`bloomberg.com` / `www.bloomberg.com`	Up to 100 most recent Google-News-indexed articles from that publisher
Cluster URL (Full coverage)	`https://news.google.com/stories/CA…`	Every article in the cluster + nested `clusterAggregate` per row (source count, languages, first-seen timestamp)

Use Cases

Buyer	Why they want this
RAG pipeline operators	Need clean full-body articles with stable IDs and timestamps — much cleaner than scraping each publisher directly.
Brand & competitor monitoring	Track every mention of a brand across all news outlets — paste the brand as a query, get all coverage.
Financial / equity research	Per-article ticker + org extraction, time-window filtering, full body for sentiment models.
Newsroom / media analysis	Track topic clusters ("Full coverage") to measure how many distinct publishers covered a given story.
Compliance / regulatory monitoring	Filter by `since` / `until` to catch news within a reporting window.
Sentiment / topic-modelling researchers	One actor produces both the SERP-level metadata (publisher, date, language) and the body content needed for downstream NLP.

How It Works

Google News is Cloudflare-free for the RSS feeds we use. The actor breaks past the typical pitfalls of news scraping with a pure-HTTP pipeline:

Resolve inputs to RSS feeds. Every URL kind (search / topic / publisher / cluster) has a /rss/... variant. The classifier emits the canonical RSS URL and the rest of the pipeline only deals with that.
Fetch RSS via impit Firefox + Apify Residential US. First-attempt 200 OK; no warmup, no rotation.
Parse RSS into per-article stubs. Each item gives us title, googleNewsUrl (Google's redirect), pubDate, publisher, publisherDomain (via <source url>), and guid.
Decode the Google News redirect → publisher URL (when enrichBody:true). The CBM token in the Google News URL is base64-encoded protobuf with no plaintext URL inside — but we can call the same /_/DotsSplashUi/data/batchexecute endpoint Google News' own JS client uses, passing the signature + timestamp scraped from the interstitial. Zero browser, zero JS execution.
Fetch the publisher article + extract canonical fields. Article / NewsArticle / BlogPosting JSON-LD when present (most major publishers ship this); readability-style heuristic over <article>, <main>, [itemprop="articleBody"] when not.
Run rule-based entity extraction. Curated lexicons for orgs and locations + regex for stock tickers. High precision, low false-positive rate.

Cluster URLs additionally produce a denormalised clusterAggregate (source count, languages, first-seen) on every row from that cluster — useful for measuring story spread.

Input Configuration

Field	Type	Required	Notes
`startInputs`	`string[]`	yes	Any mix of search queries, Google News URLs (search / topic / cluster), or publisher domains. Auto-classified.
`maxItems`	`integer`	no	Safety cap on total dataset rows. Default `500`. Free-tier users are capped at `50`.
`maxArticlesPerInput`	`integer`	no	Per-input cap. RSS feeds return up to ~100 articles. Default `100`, `0` = no per-input cap.
`enrichBody`	`boolean`	no	Follow each Google News redirect to the publisher and extract full body + Article JSON-LD. Default `false`. Charges an extra `body-enriched` event per article.
`extractEntities`	`boolean`	no	Rule-based entity extraction (orgs / tickers / locations). Defaults to `true` when `enrichBody` is on, `false` otherwise.
`country`	`string`	no	Google News country code (`gl=`). Default `US`.
`language`	`string`	no	Google News language code (`hl=`). Default `en`.
`since`	`string`	no	ISO date (`YYYY-MM-DD`). Drops articles older than this.
`until`	`string`	no	ISO date. Drops articles newer than this.
`proxy`	object	no	Apify Residential US recommended (and is the default). Google News is open enough that datacenter proxies usually work too, but residential keeps you out of edge-case rate-limits.

Example input

{
  "startInputs": [
    "OpenAI earnings",
    "bloomberg.com",
    "https://news.google.com/topics/CAAqKggKIiRDQkFTRlFvSUwyMHZNRGx6TVdZU0JXVnVMVlZUR2dKVlV5Z0FQAQ"
  ],
  "maxItems": 100,
  "maxArticlesPerInput": 50,
  "enrichBody": true,
  "extractEntities": true,
  "country": "US",
  "language": "en",
  "since": "2026-06-01",
  "proxy": { "useApifyProxy": true, "apifyProxyGroups": ["RESIDENTIAL"], "apifyProxyCountry": "US" }
}

That input yields one row per matched article (up to ~100 across all three inputs), each with the full publisher body, author, section, image, and rule-based entities.

Output Overview

One row per article, with the rowType discriminator set to "news-article". Same shape regardless of which input kind produced it (cluster URLs add a clusterAggregate nested object; all other kinds leave it null).

Output Samples

{
  "rowType":               "news-article",
  "sourceKind":            "query",                              // "query" | "topic" | "publisher" | "cluster"
  "sourceInput":           "OpenAI earnings",
  "sourceRssUrl":          "https://news.google.com/rss/search?q=OpenAI+earnings&hl=en-US&gl=US&ceid=US:en",

  "title":                 "OpenAI's Financials Leaked. They're Not Bad, but They're Not Great.",
  "publisher":             "Yahoo Finance",
  "publisherDomain":       "finance.yahoo.com",
  "googleNewsUrl":         "https://news.google.com/rss/articles/CBMinAFBVV95cUxPYkJmb0NVNWFtMnVMQ1BJNU16VDR…",
  "publisherUrl":          "https://finance.yahoo.com/technology/ai/articles/openai-financials-leaked-not-bad-113800414.html",
  "publishedAt":           "2026-06-17T11:38:00.000Z",
  "publishedAtRaw":        "Wed, 17 Jun 2026 11:38:00 GMT",
  "language":              "en",
  "country":               "US",
  "guid":                  "CBMinAFBVV95cUxPYkJmb0NVNWFtMnVMQ1BJNU16…",

  // Populated only when `enrichBody:true`
  "body":                  "Microsoft Just Became Irresistibly Attractive… (full article text, plain text)",
  "bodyHtml":              "<article>…</article>",
  "wordCount":             355,
  "imageUrl":              "https://media.example.com/lead-image.jpg",
  "author":                "Aseity Research",
  "section":               null,
  "keywords":              ["OpenAI", "AI economics", "earnings"],
  "bodyExtracted":         true,

  // Populated when `extractEntities:true` (default true when enrichBody is on)
  "entities": {
    "people":              [],                                   // reserved — rule-based NER too noisy; not populated in v0.1
    "orgs":                ["OpenAI", "Microsoft"],
    "tickers":             [],
    "locations":           ["US"]
  },

  // Populated only when sourceKind === "cluster"
  "clusterAggregate":      null,

  "scrapedAt":             "2026-06-18T16:35:00.412Z"
}

Cluster-URL rows include a populated clusterAggregate:

{
  "rowType":              "news-article",
  "sourceKind":           "cluster",
  "sourceInput":          "https://news.google.com/stories/CA…",
  // … all standard fields …
  "clusterAggregate": {
    "clusterId":          "CA…",
    "clusterTitle":       "Full coverage: OpenAI Q3 earnings",
    "sourceCount":        47,                                    // 47 distinct publishers covered this story
    "languages":          ["en"],
    "firstSeenAt":        "2026-06-16T08:00:00.000Z",
    "leadImage":          null
  }
}

Key Output Fields

Field	Type	Always populated?	Notes
`title`	`string`	✅	Cleaned title — the trailing `- Publisher` suffix from Google News titles is stripped.
`publisher`	`string \| null`	usually	Human-readable publisher name (`Yahoo Finance`, `Bloomberg`).
`publisherDomain`	`string \| null`	usually	Hostname only, `www.` stripped (`finance.yahoo.com`).
`googleNewsUrl`	`string`	✅	The Google News redirect URL — the stable identifier across runs.
`publisherUrl`	`string \| null`	when `enrichBody:true`	The resolved publisher article URL.
`publishedAt`	`string \| null`	usually	ISO 8601 timestamp.
`guid`	`string \| null`	usually	Google's per-article ID — useful for deduping across runs.
`body`	`string \| null`	when `enrichBody:true` and publisher isn't paywalled	Plain text, whitespace-normalised.
`wordCount`	`number \| null`	when body is present	Whitespace-split count.
`author`	`string \| null`	when JSON-LD ships it	Comes from `Article.author.name`.
`entities.orgs` / `.tickers` / `.locations`	`string[]`	when `extractEntities:true`	Lexicon-checked — very low false-positive rate.
`entities.people`	`string[]`	reserved	Always `[]` in v0.1 — rule-based Capitalised-Bigram NER is too noisy to ship. A real NER path is on the v1.1 roadmap.
`clusterAggregate`	object	when `sourceKind === "cluster"`	`null` otherwise.

FAQ

Q: Why don't all articles get a body extracted? A: Some publishers (Bloomberg, WSJ, FT, NYT premium tier) are hard-paywalled — the article URL renders an empty body or a login wall. We surface this honestly: bodyExtracted: false and body: null for paywalled articles. The Google News URL, publisher name, headline, and date are always present.

Q: How fresh is the data? A: RSS feeds are Google's live index — typically articles appear within 5-15 minutes of publication. No caching on our side.

Q: Will scraping the same query twice return duplicates? A: Filter on guid (or googleNewsUrl) to dedupe across runs — both are stable per-article identifiers.

Q: What happens if I pass a query with no results? A: The RSS feed returns 0 items, the actor logs 0 eligible items, no rows emitted, no events charged.

Q: Does the publisher follow-through use my Apify proxy credits twice? A: Yes — one HTTP hop to resolve the CBM redirect via batchexecute, one hop to fetch the publisher page. Apify Residential bytes scale linearly with enrichBody:true. Most publisher pages are 200-500 KB.

Q: Why is the people array always empty? A: Rule-based "Capitalised Bigram" detection produces too many false positives ("Microsoft Just", "Strong Buy") to be useful at scale. We intentionally skip it in v0.1. Proper NER (via LLM) is on the v1.1 roadmap.

Q: How is the date window enforced? A: since and until are applied AFTER the RSS feed is fetched but BEFORE rows are emitted, so you only pay for articles inside your window.

Q: Can I scrape Google News in another language? A: Yes — set country: "DE", language: "de" for German news, country: "JP", language: "ja" for Japanese, etc. Any valid Google News locale works.

Q: What's the difference between a topic URL and a cluster URL? A: A topic is a long-running stream (e.g. "Business", "Technology"). A cluster (Google calls it "Full coverage") is a single news story with multiple publishers covering it — cluster rows additionally carry clusterAggregate with sourceCount so you can measure how widely the story has been covered.

Support

Issues, feature requests, or custom output shapes? Open a ticket on the actor's Apify page or message the maintainer directly.

Additional Services

memo23/capterra-scraper — software product reviews from Capterra
memo23/trustradius-scraper — B2B software reviews from TrustRadius
memo23/glassdoor-scraper-ppr — jobs, reviews, salaries, interviews from Glassdoor

Explore More Scrapers

Looking for a similar pattern on a different site? See the full list of memo23 actors on the Apify Store.

⚠️ Disclaimer

This actor is intended for research, archival, and journalistic purposes. Every scraped article belongs to the publisher of record — respect their terms of service, copyright, and any robots.txt directives. The actor:

Honours robots.txt for publisher pages (Google News' own robots.txt allows the surfaces we use).
Does not bypass paywalls. If a publisher's article is paywalled, we leave body: null rather than attempt circumvention.
Does not store, redistribute, or republish article content beyond the per-run dataset delivered to the buyer.

Buyers are responsible for downstream use of scraped content under fair-use, licensing, and applicable data-protection law in their jurisdiction.

SEO Keywords

google news scraper, google news api alternative, news article extractor, news scraping, full article body scraper, publisher content extraction, news rag pipeline, brand monitoring, competitor news monitoring, financial news scraping, news sentiment data, news ner entities, news cluster aggregator, full coverage scraper, multi-publisher news, news article body, news json-ld, news article metadata, google news rss, real-time news, apify news scraper, bloomberg news, reuters news, wsj news, financial times news, yahoo finance news, seeking alpha articles, news search api, news topic feed, news cluster api, news language filter, news time window filter

Google News Article Scraper

webscrap18/google-news-article-scraper

Scrape Google News, Extract full content with Title, Article Text, Images and Structured data.

WebScrap

News Article Scraper for Feeding LLM

proscraper/newsarticlescraper

Scrape news articles metadata to feed into LLM models. Returns article body, published date, article title, author etc.

Owais Nazir

184

Google News Scraper

crawlerbros/google-news-scraper

Scrape Google News in real-time. Supports keyword search, date filters, full-text article extraction, and image extraction.

Crawler Bros

235

Google News Scraper — Real Publisher URLs, Agent-Ready

leorochasantos/google-news-scraper

Fast, reliable Google News scraper: search, topic sections, and top headlines. Resolves the obfuscated Google redirect to the real publisher article URL and returns clean flat JSON built for LLM/MCP consumption.

Leonardo Santos

Google News Scraper — Real Publisher URLs, MCP-Ready

paulovitor18/google-news-scraper-real-urls

Get the real publisher URL for every Google News article — not the opaque news.google.com redirect. Fast RSS-based (no browser); search by keyword, topic, language & country. Built for news monitoring, LLM/RAG pipelines and AI agents (MCP).

MoreLock

🔥 Google News Search Scraper

powerai/google-news-search-scraper

Search Google News and export structured metadata with optional article enrichment.

PowerAI

Google News Monitor — Keyword Tracker & Article Scraper

guiltless_vivacity/google-news-monitor

Monitor Google News for multiple keywords simultaneously. Extract headlines, sources, URLs, publication dates, and full article content. Perfect for brand monitoring, competitive intelligence, and market research.

IrisVan

Google Search to Full Article Text ⚡$4 per 1k

ohmydata/google-search-to-full-article

Turn Google search (SERP) queries into a dataset of deduplicated, clean full article text.

OhMyData

5.0

Google News Scraper

sourabhbgp/google-news-scraper

Scrape Google News and get the real article link, image, source, and date on every result, plus the author when Google provides it. Search, browse topics, or pull headlines. Get thousands of results with a date range, plus optional full article text and Full Coverage. Any country and language.

Sourabh Kumar

Google News Scraper - Articles, Sources & Monitoring

scrapesage/google-news-scraper

Scrape Google News by keyword, topic, top headlines or city. Get real publisher URLs (decoded, not redirect links), source, date, snippet, related coverage and full article text, author & image. Monitor mode returns only new articles. Export JSON, CSV, Excel.