Google News Scraper · Full Article Bodies + Entities avatar

Google News Scraper · Full Article Bodies + Entities

Pricing

from $2.50 / 1,000 results

Go to Apify Store
Google News Scraper · Full Article Bodies + Entities

Google News Scraper · Full Article Bodies + Entities

Scrape Google News with full publisher article bodies, not just snippets. Decodes the CBM redirect to the publisher page, extracts Article JSON-LD (title, body, author, image, keywords), and runs entity extraction (orgs/tickers/locations). 4 input kinds. Flat per-article billing. Pure HTTP

Pricing

from $2.50 / 1,000 results

Rating

5.0

(1)

Developer

Muhamed Didovic

Muhamed Didovic

Maintained by Community

Actor stats

0

Bookmarked

16

Total users

15

Monthly active users

5 days ago

Last modified

Categories

Share

Google News Scraper — Articles, Bodies & Entities

How it works

How Google News Scraper works


All-in-one Google News scraper. Paste any mix of search queries, topic URLs, publisher domains, or Full-coverage cluster URLs — the actor auto-classifies each input and emits one structured row per article. Optional follow-through resolves the Google News redirect to the real publisher URL and extracts the full article body, headline, author, section, image, and keywords from Article / NewsArticle JSON-LD (with a readability-style fallback when JSON-LD is missing).

Input kindExampleBecomes
Search queryOpenAI earningsRSS feed for that query
Topic URLhttps://news.google.com/topics/CAAqJ…RSS feed for that topic
Publisher domainbloomberg.comRSS feed for site:bloomberg.com
Cluster (Full coverage)https://news.google.com/stories/CA…RSS feed for that story + nested clusterAggregate per row

One row per article. JSON + CSV. Pure HTTP, no Puppeteer / Playwright / headless Chromium / third-party paywall-bypass service.


Why use this Google News scraper

  • All four URL kinds in one actor. No need to glue together a query scraper + topic scraper + publisher scraper — one input list, auto-classified.
  • Full publisher article body, not just snippets. Google News only shows ~150-character snippets. Most competitor actors stop there. We resolve each Google News redirect (via the same batchexecute call Google's own client uses — no JS execution required) and extract the canonical body from the publisher's Article JSON-LD.
  • Structured entity extraction built in. Rule-based extraction of orgs / tickers / locations from curated lexicons (~zero false positives). Lives on every body-enriched row at no extra cost.
  • Flat per-article billing. One article = one row = one result. Buyer cost is rows × per-result-price — no nested-array math.
  • Locale-aware. Pass country + language to scope the feed (e.g. country: "DE", language: "de" for German news). Time-window filters (since / until) drop articles outside the date range before they hit your billing.

Overview

FieldDetail
Sourcenews.google.com (RSS feeds) + publisher pages (when enrichBody:true)
Pricing$0.0025 per article (result) · +$0.0005 url-resolved · +$0.0005 body-enriched (Cloudflare/Akamai bypass included free)
Anti-botNone on Google News RSS · walled publishers (Bloomberg/FT/WSJ/Yahoo) bypassed automatically
Runtime~1-3 s per RSS query (100 articles per feed) · ~2-3 s extra per body-enriched article
OutputJSON + CSV, one row per article
ScopeGlobal · all Google News locales (country + language params)

Supported Inputs

The actor accepts every entry in your input list and auto-classifies it as one of four kinds. Mix and match freely.

InputExampleWhat you get
Search query (bare string)OpenAI earningsUp to 100 articles matching the query, ordered by Google News relevance
Search query (URL)https://news.google.com/search?q=OpenAI+earningsSame as above — URL form is canonicalised
Topic URLhttps://news.google.com/topics/CAAqJ…The most recent articles in that topic feed
Publisher domainbloomberg.com / www.bloomberg.comUp to 100 most recent Google-News-indexed articles from that publisher
Cluster URL (Full coverage)https://news.google.com/stories/CA…Every article in the cluster + nested clusterAggregate per row (source count, languages, first-seen timestamp)

Use Cases

BuyerWhy they want this
RAG pipeline operatorsNeed clean full-body articles with stable IDs and timestamps — much cleaner than scraping each publisher directly.
Brand & competitor monitoringTrack every mention of a brand across all news outlets — paste the brand as a query, get all coverage.
Financial / equity researchPer-article ticker + org extraction, time-window filtering, full body for sentiment models.
Newsroom / media analysisTrack topic clusters ("Full coverage") to measure how many distinct publishers covered a given story.
Compliance / regulatory monitoringFilter by since / until to catch news within a reporting window.
Sentiment / topic-modelling researchersOne actor produces both the SERP-level metadata (publisher, date, language) and the body content needed for downstream NLP.

How It Works

Google News is Cloudflare-free for the RSS feeds we use. The actor breaks past the typical pitfalls of news scraping with a pure-HTTP pipeline:

  1. Resolve inputs to RSS feeds. Every URL kind (search / topic / publisher / cluster) has a /rss/... variant. The classifier emits the canonical RSS URL and the rest of the pipeline only deals with that.
  2. Fetch RSS via impit Firefox + Apify Residential US. First-attempt 200 OK; no warmup, no rotation.
  3. Parse RSS into per-article stubs. Each item gives us title, googleNewsUrl (Google's redirect), pubDate, publisher, publisherDomain (via <source url>), and guid.
  4. Decode the Google News redirect → publisher URL (when enrichBody:true). The CBM token in the Google News URL is base64-encoded protobuf with no plaintext URL inside — but we can call the same /_/DotsSplashUi/data/batchexecute endpoint Google News' own JS client uses, passing the signature + timestamp scraped from the interstitial. Zero browser, zero JS execution.
  5. Fetch the publisher article + extract canonical fields. Article / NewsArticle / BlogPosting JSON-LD when present (most major publishers ship this); readability-style heuristic over <article>, <main>, [itemprop="articleBody"] when not.
  6. Run rule-based entity extraction. Curated lexicons for orgs and locations + regex for stock tickers. High precision, low false-positive rate.

Cluster URLs additionally produce a denormalised clusterAggregate (source count, languages, first-seen) on every row from that cluster — useful for measuring story spread.


Input Configuration

FieldTypeRequiredNotes
startInputsstring[]yesAny mix of search queries, Google News URLs (search / topic / cluster), or publisher domains. Auto-classified.
maxItemsintegernoSafety cap on total dataset rows. Default 500. Free-tier users are capped at 50.
maxArticlesPerInputintegernoPer-input cap. RSS feeds return up to ~100 articles. Default 100, 0 = no per-input cap.
enrichBodybooleannoFollow each Google News redirect to the publisher and extract full body + Article JSON-LD. Default false. Charges an extra body-enriched event per article.
extractEntitiesbooleannoRule-based entity extraction (orgs / tickers / locations). Defaults to true when enrichBody is on, false otherwise.
countrystringnoGoogle News country code (gl=). Default US.
languagestringnoGoogle News language code (hl=). Default en.
sincestringnoISO date (YYYY-MM-DD). Drops articles older than this.
untilstringnoISO date. Drops articles newer than this.
proxyobjectnoApify Residential US recommended (and is the default). Google News is open enough that datacenter proxies usually work too, but residential keeps you out of edge-case rate-limits.

Example input

{
"startInputs": [
"OpenAI earnings",
"bloomberg.com",
"https://news.google.com/topics/CAAqKggKIiRDQkFTRlFvSUwyMHZNRGx6TVdZU0JXVnVMVlZUR2dKVlV5Z0FQAQ"
],
"maxItems": 100,
"maxArticlesPerInput": 50,
"enrichBody": true,
"extractEntities": true,
"country": "US",
"language": "en",
"since": "2026-06-01",
"proxy": { "useApifyProxy": true, "apifyProxyGroups": ["RESIDENTIAL"], "apifyProxyCountry": "US" }
}

That input yields one row per matched article (up to ~100 across all three inputs), each with the full publisher body, author, section, image, and rule-based entities.


Output Overview

One row per article, with the rowType discriminator set to "news-article". Same shape regardless of which input kind produced it (cluster URLs add a clusterAggregate nested object; all other kinds leave it null).


Output Samples

{
"rowType": "news-article",
"sourceKind": "query", // "query" | "topic" | "publisher" | "cluster"
"sourceInput": "OpenAI earnings",
"sourceRssUrl": "https://news.google.com/rss/search?q=OpenAI+earnings&hl=en-US&gl=US&ceid=US:en",
"title": "OpenAI's Financials Leaked. They're Not Bad, but They're Not Great.",
"publisher": "Yahoo Finance",
"publisherDomain": "finance.yahoo.com",
"googleNewsUrl": "https://news.google.com/rss/articles/CBMinAFBVV95cUxPYkJmb0NVNWFtMnVMQ1BJNU16VDR…",
"publisherUrl": "https://finance.yahoo.com/technology/ai/articles/openai-financials-leaked-not-bad-113800414.html",
"publishedAt": "2026-06-17T11:38:00.000Z",
"publishedAtRaw": "Wed, 17 Jun 2026 11:38:00 GMT",
"language": "en",
"country": "US",
"guid": "CBMinAFBVV95cUxPYkJmb0NVNWFtMnVMQ1BJNU16…",
// Populated only when `enrichBody:true`
"body": "Microsoft Just Became Irresistibly Attractive… (full article text, plain text)",
"bodyHtml": "<article>…</article>",
"wordCount": 355,
"imageUrl": "https://media.example.com/lead-image.jpg",
"author": "Aseity Research",
"section": null,
"keywords": ["OpenAI", "AI economics", "earnings"],
"bodyExtracted": true,
// Populated when `extractEntities:true` (default true when enrichBody is on)
"entities": {
"people": [], // reserved — rule-based NER too noisy; not populated in v0.1
"orgs": ["OpenAI", "Microsoft"],
"tickers": [],
"locations": ["US"]
},
// Populated only when sourceKind === "cluster"
"clusterAggregate": null,
"scrapedAt": "2026-06-18T16:35:00.412Z"
}

Cluster-URL rows include a populated clusterAggregate:

{
"rowType": "news-article",
"sourceKind": "cluster",
"sourceInput": "https://news.google.com/stories/CA…",
// … all standard fields …
"clusterAggregate": {
"clusterId": "CA…",
"clusterTitle": "Full coverage: OpenAI Q3 earnings",
"sourceCount": 47, // 47 distinct publishers covered this story
"languages": ["en"],
"firstSeenAt": "2026-06-16T08:00:00.000Z",
"leadImage": null
}
}

Key Output Fields

FieldTypeAlways populated?Notes
titlestringCleaned title — the trailing - Publisher suffix from Google News titles is stripped.
publisherstring | nullusuallyHuman-readable publisher name (Yahoo Finance, Bloomberg).
publisherDomainstring | nullusuallyHostname only, www. stripped (finance.yahoo.com).
googleNewsUrlstringThe Google News redirect URL — the stable identifier across runs.
publisherUrlstring | nullwhen enrichBody:trueThe resolved publisher article URL.
publishedAtstring | nullusuallyISO 8601 timestamp.
guidstring | nullusuallyGoogle's per-article ID — useful for deduping across runs.
bodystring | nullwhen enrichBody:true and publisher isn't paywalledPlain text, whitespace-normalised.
wordCountnumber | nullwhen body is presentWhitespace-split count.
authorstring | nullwhen JSON-LD ships itComes from Article.author.name.
entities.orgs / .tickers / .locationsstring[]when extractEntities:trueLexicon-checked — very low false-positive rate.
entities.peoplestring[]reservedAlways [] in v0.1 — rule-based Capitalised-Bigram NER is too noisy to ship. A real NER path is on the v1.1 roadmap.
clusterAggregateobjectwhen sourceKind === "cluster"null otherwise.

FAQ

Q: Why don't all articles get a body extracted? A: Some publishers (Bloomberg, WSJ, FT, NYT premium tier) are hard-paywalled — the article URL renders an empty body or a login wall. We surface this honestly: bodyExtracted: false and body: null for paywalled articles. The Google News URL, publisher name, headline, and date are always present.

Q: How fresh is the data? A: RSS feeds are Google's live index — typically articles appear within 5-15 minutes of publication. No caching on our side.

Q: Will scraping the same query twice return duplicates? A: Filter on guid (or googleNewsUrl) to dedupe across runs — both are stable per-article identifiers.

Q: What happens if I pass a query with no results? A: The RSS feed returns 0 items, the actor logs 0 eligible items, no rows emitted, no events charged.

Q: Does the publisher follow-through use my Apify proxy credits twice? A: Yes — one HTTP hop to resolve the CBM redirect via batchexecute, one hop to fetch the publisher page. Apify Residential bytes scale linearly with enrichBody:true. Most publisher pages are 200-500 KB.

Q: Why is the people array always empty? A: Rule-based "Capitalised Bigram" detection produces too many false positives ("Microsoft Just", "Strong Buy") to be useful at scale. We intentionally skip it in v0.1. Proper NER (via LLM) is on the v1.1 roadmap.

Q: How is the date window enforced? A: since and until are applied AFTER the RSS feed is fetched but BEFORE rows are emitted, so you only pay for articles inside your window.

Q: Can I scrape Google News in another language? A: Yes — set country: "DE", language: "de" for German news, country: "JP", language: "ja" for Japanese, etc. Any valid Google News locale works.

Q: What's the difference between a topic URL and a cluster URL? A: A topic is a long-running stream (e.g. "Business", "Technology"). A cluster (Google calls it "Full coverage") is a single news story with multiple publishers covering it — cluster rows additionally carry clusterAggregate with sourceCount so you can measure how widely the story has been covered.


Support

Issues, feature requests, or custom output shapes? Open a ticket on the actor's Apify page or message the maintainer directly.


Additional Services

  • memo23/capterra-scraper — software product reviews from Capterra
  • memo23/trustradius-scraper — B2B software reviews from TrustRadius
  • memo23/glassdoor-scraper-ppr — jobs, reviews, salaries, interviews from Glassdoor

Explore More Scrapers

Looking for a similar pattern on a different site? See the full list of memo23 actors on the Apify Store.


⚠️ Disclaimer

This actor is intended for research, archival, and journalistic purposes. Every scraped article belongs to the publisher of record — respect their terms of service, copyright, and any robots.txt directives. The actor:

  • Honours robots.txt for publisher pages (Google News' own robots.txt allows the surfaces we use).
  • Does not bypass paywalls. If a publisher's article is paywalled, we leave body: null rather than attempt circumvention.
  • Does not store, redistribute, or republish article content beyond the per-run dataset delivered to the buyer.

Buyers are responsible for downstream use of scraped content under fair-use, licensing, and applicable data-protection law in their jurisdiction.


SEO Keywords

google news scraper, google news api alternative, news article extractor, news scraping, full article body scraper, publisher content extraction, news rag pipeline, brand monitoring, competitor news monitoring, financial news scraping, news sentiment data, news ner entities, news cluster aggregator, full coverage scraper, multi-publisher news, news article body, news json-ld, news article metadata, google news rss, real-time news, apify news scraper, bloomberg news, reuters news, wsj news, financial times news, yahoo finance news, seeking alpha articles, news search api, news topic feed, news cluster api, news language filter, news time window filter