# Google News Scraper · Full Article Bodies + Entities (`memo23/google-news-scraper`) Actor

Scrape Google News with full publisher article bodies, not just snippets. Decodes the CBM redirect to the publisher page, extracts Article JSON-LD (title, body, author, image, keywords), and runs entity extraction (orgs/tickers/locations). 4 input kinds. Flat per-article billing. Pure HTTP

- **URL**: https://apify.com/memo23/google-news-scraper.md
- **Developed by:** [Muhamed Didovic](https://apify.com/memo23) (community)
- **Categories:** News, AI, Agents
- **Stats:** 16 total users, 15 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: 5.00 out of 5 stars

## Pricing

from $2.50 / 1,000 results

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Google News Scraper — Articles, Bodies & Entities

#### How it works

![How Google News Scraper works](https://raw.githubusercontent.com/muhamed-didovic/muhamed-didovic.github.io/main/assets/how-it-works-google-news.png)

---

All-in-one Google News scraper. Paste any mix of search queries, topic URLs, publisher domains, or Full-coverage cluster URLs — the actor auto-classifies each input and emits one structured row per article. Optional follow-through resolves the Google News redirect to the real publisher URL and extracts the full article body, headline, author, section, image, and keywords from `Article` / `NewsArticle` JSON-LD (with a readability-style fallback when JSON-LD is missing).

| Input kind | Example | Becomes |
|---|---|---|
| **Search query** | `OpenAI earnings` | RSS feed for that query |
| **Topic URL** | `https://news.google.com/topics/CAAqJ…` | RSS feed for that topic |
| **Publisher domain** | `bloomberg.com` | RSS feed for `site:bloomberg.com` |
| **Cluster (Full coverage)** | `https://news.google.com/stories/CA…` | RSS feed for that story + nested `clusterAggregate` per row |

One row per article. JSON + CSV. Pure HTTP, no Puppeteer / Playwright / headless Chromium / third-party paywall-bypass service.

---

### Why use this Google News scraper

- **All four URL kinds in one actor.** No need to glue together a query scraper + topic scraper + publisher scraper — one input list, auto-classified.
- **Full publisher article body, not just snippets.** Google News only shows ~150-character snippets. Most competitor actors stop there. We resolve each Google News redirect (via the same `batchexecute` call Google's own client uses — no JS execution required) and extract the canonical body from the publisher's `Article` JSON-LD.
- **Structured entity extraction built in.** Rule-based extraction of orgs / tickers / locations from curated lexicons (~zero false positives). Lives on every body-enriched row at no extra cost.
- **Flat per-article billing.** One article = one row = one result. Buyer cost is `rows × per-result-price` — no nested-array math.
- **Locale-aware.** Pass `country` + `language` to scope the feed (e.g. `country: "DE", language: "de"` for German news). Time-window filters (`since` / `until`) drop articles outside the date range before they hit your billing.

---

### Overview

Field | Detail
--- | ---
Source | `news.google.com` (RSS feeds) + publisher pages (when `enrichBody:true`)
Pricing | $0.0025 per article (result) · +$0.0005 url-resolved · +$0.0005 body-enriched (Cloudflare/Akamai bypass included free)
Anti-bot | None on Google News RSS · walled publishers (Bloomberg/FT/WSJ/Yahoo) bypassed automatically
Runtime | ~1-3 s per RSS query (100 articles per feed) · ~2-3 s extra per body-enriched article
Output | JSON + CSV, one row per article
Scope | Global · all Google News locales (`country` + `language` params)

---

### Supported Inputs

The actor accepts every entry in your input list and auto-classifies it as one of four kinds. Mix and match freely.

| Input | Example | What you get |
|---|---|---|
| **Search query** (bare string) | `OpenAI earnings` | Up to 100 articles matching the query, ordered by Google News relevance |
| **Search query** (URL) | `https://news.google.com/search?q=OpenAI+earnings` | Same as above — URL form is canonicalised |
| **Topic URL** | `https://news.google.com/topics/CAAqJ…` | The most recent articles in that topic feed |
| **Publisher domain** | `bloomberg.com` / `www.bloomberg.com` | Up to 100 most recent Google-News-indexed articles from that publisher |
| **Cluster URL** (Full coverage) | `https://news.google.com/stories/CA…` | Every article in the cluster + nested `clusterAggregate` per row (source count, languages, first-seen timestamp) |

---

### Use Cases

| Buyer | Why they want this |
|---|---|
| **RAG pipeline operators** | Need clean full-body articles with stable IDs and timestamps — much cleaner than scraping each publisher directly. |
| **Brand & competitor monitoring** | Track every mention of a brand across all news outlets — paste the brand as a query, get all coverage. |
| **Financial / equity research** | Per-article ticker + org extraction, time-window filtering, full body for sentiment models. |
| **Newsroom / media analysis** | Track topic clusters ("Full coverage") to measure how many distinct publishers covered a given story. |
| **Compliance / regulatory monitoring** | Filter by `since` / `until` to catch news within a reporting window. |
| **Sentiment / topic-modelling researchers** | One actor produces both the SERP-level metadata (publisher, date, language) and the body content needed for downstream NLP. |

---

### How It Works

Google News is **Cloudflare-free** for the RSS feeds we use. The actor breaks past the typical pitfalls of news scraping with a pure-HTTP pipeline:

1. **Resolve inputs to RSS feeds.** Every URL kind (search / topic / publisher / cluster) has a `/rss/...` variant. The classifier emits the canonical RSS URL and the rest of the pipeline only deals with that.
2. **Fetch RSS via impit Firefox + Apify Residential US.** First-attempt 200 OK; no warmup, no rotation.
3. **Parse RSS into per-article stubs.** Each item gives us `title`, `googleNewsUrl` (Google's redirect), `pubDate`, `publisher`, `publisherDomain` (via `<source url>`), and `guid`.
4. **Decode the Google News redirect → publisher URL** (when `enrichBody:true`). The CBM token in the Google News URL is base64-encoded protobuf with no plaintext URL inside — but we can call the same `/_/DotsSplashUi/data/batchexecute` endpoint Google News' own JS client uses, passing the signature + timestamp scraped from the interstitial. Zero browser, zero JS execution.
5. **Fetch the publisher article + extract canonical fields.** `Article` / `NewsArticle` / `BlogPosting` JSON-LD when present (most major publishers ship this); readability-style heuristic over `<article>`, `<main>`, `[itemprop="articleBody"]` when not.
6. **Run rule-based entity extraction.** Curated lexicons for orgs and locations + regex for stock tickers. High precision, low false-positive rate.

Cluster URLs additionally produce a denormalised `clusterAggregate` (source count, languages, first-seen) on every row from that cluster — useful for measuring story spread.

---

### Input Configuration

Field | Type | Required | Notes
--- | --- | --- | ---
`startInputs` | `string[]` | yes | Any mix of search queries, Google News URLs (search / topic / cluster), or publisher domains. Auto-classified.
`maxItems` | `integer` | no | Safety cap on total dataset rows. Default `500`. Free-tier users are capped at `50`.
`maxArticlesPerInput` | `integer` | no | Per-input cap. RSS feeds return up to ~100 articles. Default `100`, `0` = no per-input cap.
`enrichBody` | `boolean` | no | Follow each Google News redirect to the publisher and extract full body + Article JSON-LD. Default `false`. Charges an extra `body-enriched` event per article.
`extractEntities` | `boolean` | no | Rule-based entity extraction (orgs / tickers / locations). Defaults to `true` when `enrichBody` is on, `false` otherwise.
`country` | `string` | no | Google News country code (`gl=`). Default `US`.
`language` | `string` | no | Google News language code (`hl=`). Default `en`.
`since` | `string` | no | ISO date (`YYYY-MM-DD`). Drops articles older than this.
`until` | `string` | no | ISO date. Drops articles newer than this.
`proxy` | object | no | Apify Residential US recommended (and is the default). Google News is open enough that datacenter proxies usually work too, but residential keeps you out of edge-case rate-limits.

#### Example input

```json
{
  "startInputs": [
    "OpenAI earnings",
    "bloomberg.com",
    "https://news.google.com/topics/CAAqKggKIiRDQkFTRlFvSUwyMHZNRGx6TVdZU0JXVnVMVlZUR2dKVlV5Z0FQAQ"
  ],
  "maxItems": 100,
  "maxArticlesPerInput": 50,
  "enrichBody": true,
  "extractEntities": true,
  "country": "US",
  "language": "en",
  "since": "2026-06-01",
  "proxy": { "useApifyProxy": true, "apifyProxyGroups": ["RESIDENTIAL"], "apifyProxyCountry": "US" }
}
````

That input yields one row per matched article (up to ~100 across all three inputs), each with the full publisher body, author, section, image, and rule-based entities.

***

### Output Overview

One row per article, with the `rowType` discriminator set to `"news-article"`. Same shape regardless of which input kind produced it (cluster URLs add a `clusterAggregate` nested object; all other kinds leave it `null`).

***

### Output Samples

```jsonc
{
  "rowType":               "news-article",
  "sourceKind":            "query",                              // "query" | "topic" | "publisher" | "cluster"
  "sourceInput":           "OpenAI earnings",
  "sourceRssUrl":          "https://news.google.com/rss/search?q=OpenAI+earnings&hl=en-US&gl=US&ceid=US:en",

  "title":                 "OpenAI's Financials Leaked. They're Not Bad, but They're Not Great.",
  "publisher":             "Yahoo Finance",
  "publisherDomain":       "finance.yahoo.com",
  "googleNewsUrl":         "https://news.google.com/rss/articles/CBMinAFBVV95cUxPYkJmb0NVNWFtMnVMQ1BJNU16VDR…",
  "publisherUrl":          "https://finance.yahoo.com/technology/ai/articles/openai-financials-leaked-not-bad-113800414.html",
  "publishedAt":           "2026-06-17T11:38:00.000Z",
  "publishedAtRaw":        "Wed, 17 Jun 2026 11:38:00 GMT",
  "language":              "en",
  "country":               "US",
  "guid":                  "CBMinAFBVV95cUxPYkJmb0NVNWFtMnVMQ1BJNU16…",

  // Populated only when `enrichBody:true`
  "body":                  "Microsoft Just Became Irresistibly Attractive… (full article text, plain text)",
  "bodyHtml":              "<article>…</article>",
  "wordCount":             355,
  "imageUrl":              "https://media.example.com/lead-image.jpg",
  "author":                "Aseity Research",
  "section":               null,
  "keywords":              ["OpenAI", "AI economics", "earnings"],
  "bodyExtracted":         true,

  // Populated when `extractEntities:true` (default true when enrichBody is on)
  "entities": {
    "people":              [],                                   // reserved — rule-based NER too noisy; not populated in v0.1
    "orgs":                ["OpenAI", "Microsoft"],
    "tickers":             [],
    "locations":           ["US"]
  },

  // Populated only when sourceKind === "cluster"
  "clusterAggregate":      null,

  "scrapedAt":             "2026-06-18T16:35:00.412Z"
}
```

Cluster-URL rows include a populated `clusterAggregate`:

```jsonc
{
  "rowType":              "news-article",
  "sourceKind":           "cluster",
  "sourceInput":          "https://news.google.com/stories/CA…",
  // … all standard fields …
  "clusterAggregate": {
    "clusterId":          "CA…",
    "clusterTitle":       "Full coverage: OpenAI Q3 earnings",
    "sourceCount":        47,                                    // 47 distinct publishers covered this story
    "languages":          ["en"],
    "firstSeenAt":        "2026-06-16T08:00:00.000Z",
    "leadImage":          null
  }
}
```

***

### Key Output Fields

Field | Type | Always populated? | Notes
\--- | --- | --- | ---
`title` | `string` | ✅ | Cleaned title — the trailing ` - Publisher` suffix from Google News titles is stripped.
`publisher` | `string \| null` | usually | Human-readable publisher name (`Yahoo Finance`, `Bloomberg`).
`publisherDomain` | `string \| null` | usually | Hostname only, `www.` stripped (`finance.yahoo.com`).
`googleNewsUrl` | `string` | ✅ | The Google News redirect URL — the stable identifier across runs.
`publisherUrl` | `string \| null` | when `enrichBody:true` | The resolved publisher article URL.
`publishedAt` | `string \| null` | usually | ISO 8601 timestamp.
`guid` | `string \| null` | usually | Google's per-article ID — useful for deduping across runs.
`body` | `string \| null` | when `enrichBody:true` and publisher isn't paywalled | Plain text, whitespace-normalised.
`wordCount` | `number \| null` | when body is present | Whitespace-split count.
`author` | `string \| null` | when JSON-LD ships it | Comes from `Article.author.name`.
`entities.orgs` / `.tickers` / `.locations` | `string[]` | when `extractEntities:true` | Lexicon-checked — very low false-positive rate.
`entities.people` | `string[]` | reserved | Always `[]` in v0.1 — rule-based Capitalised-Bigram NER is too noisy to ship. A real NER path is on the v1.1 roadmap.
`clusterAggregate` | object | when `sourceKind === "cluster"` | `null` otherwise.

***

### FAQ

**Q: Why don't all articles get a body extracted?**
A: Some publishers (Bloomberg, WSJ, FT, NYT premium tier) are hard-paywalled — the article URL renders an empty body or a login wall. We surface this honestly: `bodyExtracted: false` and `body: null` for paywalled articles. The Google News URL, publisher name, headline, and date are always present.

**Q: How fresh is the data?**
A: RSS feeds are Google's live index — typically articles appear within 5-15 minutes of publication. No caching on our side.

**Q: Will scraping the same query twice return duplicates?**
A: Filter on `guid` (or `googleNewsUrl`) to dedupe across runs — both are stable per-article identifiers.

**Q: What happens if I pass a query with no results?**
A: The RSS feed returns 0 items, the actor logs `0 eligible items`, no rows emitted, no events charged.

**Q: Does the publisher follow-through use my Apify proxy credits twice?**
A: Yes — one HTTP hop to resolve the CBM redirect via `batchexecute`, one hop to fetch the publisher page. Apify Residential bytes scale linearly with `enrichBody:true`. Most publisher pages are 200-500 KB.

**Q: Why is the `people` array always empty?**
A: Rule-based "Capitalised Bigram" detection produces too many false positives ("Microsoft Just", "Strong Buy") to be useful at scale. We intentionally skip it in v0.1. Proper NER (via LLM) is on the v1.1 roadmap.

**Q: How is the date window enforced?**
A: `since` and `until` are applied AFTER the RSS feed is fetched but BEFORE rows are emitted, so you only pay for articles inside your window.

**Q: Can I scrape Google News in another language?**
A: Yes — set `country: "DE", language: "de"` for German news, `country: "JP", language: "ja"` for Japanese, etc. Any valid Google News locale works.

**Q: What's the difference between a topic URL and a cluster URL?**
A: A *topic* is a long-running stream (e.g. "Business", "Technology"). A *cluster* (Google calls it "Full coverage") is a single news story with multiple publishers covering it — cluster rows additionally carry `clusterAggregate` with `sourceCount` so you can measure how widely the story has been covered.

***

### Support

Issues, feature requests, or custom output shapes? Open a ticket on the actor's Apify page or message the maintainer directly.

***

### Additional Services

- **memo23/capterra-scraper** — software product reviews from Capterra
- **memo23/trustradius-scraper** — B2B software reviews from TrustRadius
- **memo23/glassdoor-scraper-ppr** — jobs, reviews, salaries, interviews from Glassdoor

***

### Explore More Scrapers

Looking for a similar pattern on a different site? See the full list of memo23 actors on the Apify Store.

***

### ⚠️ Disclaimer

This actor is intended for **research, archival, and journalistic purposes**. Every scraped article belongs to the publisher of record — respect their terms of service, copyright, and any robots.txt directives. The actor:

- Honours `robots.txt` for publisher pages (Google News' own `robots.txt` allows the surfaces we use).
- Does not bypass paywalls. If a publisher's article is paywalled, we leave `body: null` rather than attempt circumvention.
- Does not store, redistribute, or republish article content beyond the per-run dataset delivered to the buyer.

Buyers are responsible for downstream use of scraped content under fair-use, licensing, and applicable data-protection law in their jurisdiction.

***

### SEO Keywords

google news scraper, google news api alternative, news article extractor, news scraping, full article body scraper, publisher content extraction, news rag pipeline, brand monitoring, competitor news monitoring, financial news scraping, news sentiment data, news ner entities, news cluster aggregator, full coverage scraper, multi-publisher news, news article body, news json-ld, news article metadata, google news rss, real-time news, apify news scraper, bloomberg news, reuters news, wsj news, financial times news, yahoo finance news, seeking alpha articles, news search api, news topic feed, news cluster api, news language filter, news time window filter

# Actor input Schema

## `startInputs` (type: `array`):

Each entry is auto-classified:

• **Search query** — bare string, e.g. `OpenAI earnings`
• **Predefined topic name** — `BUSINESS`, `TECHNOLOGY`, `WORLD`, `NATION`, `ENTERTAINMENT`, `SPORTS`, `SCIENCE`, `HEALTH` (case-insensitive)
• **Topic URL** — `https://news.google.com/topics/CAAqJ…`
• **Publisher domain** — bare hostname, e.g. `bloomberg.com`
• **Full-coverage cluster URL** — `https://news.google.com/stories/CA…`

Google search operators work too: `-trump`, `"exact phrase"`, `site:wsj.com`, `OpenAI OR Anthropic`.

## `maxItems` (type: `integer`):

Safety cap on the entire run. Each article = one row. Default 500.

## `maxArticlesPerInput` (type: `integer`):

Per-input cap. Google News RSS feeds return up to ~100 articles per query. Default 100; set 0 for no per-input cap (overall `maxItems` still applies).

## `resolveUrls` (type: `boolean`):

Decode each Google News CBM redirect to its real publisher URL using the same batchexecute endpoint the Google News client uses. One HTTP hop per article. Cheaper than `enrichBody`. Implied by `enrichBody:true`. Charges a `url-resolved` event.

## `enrichBody` (type: `boolean`):

ALSO fetch the publisher page and extract the full body, headline, image, author, section, keywords, wordCount from Article/NewsArticle JSON-LD (with Mozilla Readability as the fallback algorithm — the same one Firefox Reader View uses). Adds a second HTTP hop per article on top of url-resolved. Charges a `body-enriched` event. Recommended for LLM/RAG buyers.

## `enableCfBypass` (type: `boolean`):

When impit Firefox + Chrome both fail on a publisher (Bloomberg, NYT, WSJ, FT, sometimes Yahoo Finance), fall back to a CF-bypass service for the final attempt. No setup on your side — the bypass token is built into the actor, and there is **no extra charge** (the cost is absorbed into the base per-article price).

**Defaults to ON whenever `enrichBody` is enabled** (the publishers worth reading sit behind Cloudflare, so bodies should just arrive). Set to `false` to skip walled publishers.

## `extractEntities` (type: `boolean`):

Rule-based entity extraction on the article body or title. Cheap, no extra HTTP cost. Defaults to ON when `enrichBody` is enabled (body gives much better recall than title alone).

## `regionLanguage` (type: `string`):

Convenience input — combined Google News locale token. Examples: `US:en`, `GB:en`, `DE:de`, `JP:ja`, `FR:fr`, `ES:es`, `IN:en`, `AU:en`, `BR:pt`. When set, overrides separate `country` + `language` below.

## `timeframe` (type: `string`):

Restrict to articles from this window via Google News `when:` operator. Applies to search queries + publisher inputs; topic / cluster feeds ignore this (their windows are fixed by Google). Use `since` / `until` for exact date ranges.

## `country` (type: `string`):

ISO country code. Used only when `regionLanguage` isn't set. Examples: `US`, `GB`, `DE`, `IN`, `AU`. Default `US`.

## `language` (type: `string`):

ISO language code. Used only when `regionLanguage` isn't set. Examples: `en`, `de`, `ja`, `es`, `fr`. Default `en`.

## `since` (type: `string`):

Optional. Skip articles with `publishedAt < since`. Format: `YYYY-MM-DD` or full ISO 8601.

## `until` (type: `string`):

Optional. Skip articles with `publishedAt > until`. Format: `YYYY-MM-DD` or full ISO 8601.

## `proxy` (type: `object`):

Apify Residential US is the default and is sufficient — Google News + most publisher sites are open to standard residential traffic.

## `maxRequestRetries` (type: `integer`):

Reserved. Maximum HTTP retry attempts per URL.

## Actor input object example

```json
{
  "startInputs": [
    "OpenAI earnings",
    "BUSINESS",
    "bloomberg.com"
  ],
  "maxItems": 500,
  "maxArticlesPerInput": 100,
  "resolveUrls": false,
  "enrichBody": false,
  "enableCfBypass": true,
  "extractEntities": false,
  "regionLanguage": "US:en",
  "timeframe": "all",
  "country": "US",
  "language": "en",
  "proxy": {
    "useApifyProxy": true,
    "apifyProxyGroups": [
      "RESIDENTIAL"
    ],
    "apifyProxyCountry": "US"
  },
  "maxRequestRetries": 4
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startInputs": [
        "OpenAI earnings",
        "BUSINESS",
        "bloomberg.com"
    ],
    "proxy": {
        "useApifyProxy": true,
        "apifyProxyGroups": [
            "RESIDENTIAL"
        ],
        "apifyProxyCountry": "US"
    }
};

// Run the Actor and wait for it to finish
const run = await client.actor("memo23/google-news-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "startInputs": [
        "OpenAI earnings",
        "BUSINESS",
        "bloomberg.com",
    ],
    "proxy": {
        "useApifyProxy": True,
        "apifyProxyGroups": ["RESIDENTIAL"],
        "apifyProxyCountry": "US",
    },
}

# Run the Actor and wait for it to finish
run = client.actor("memo23/google-news-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startInputs": [
    "OpenAI earnings",
    "BUSINESS",
    "bloomberg.com"
  ],
  "proxy": {
    "useApifyProxy": true,
    "apifyProxyGroups": [
      "RESIDENTIAL"
    ],
    "apifyProxyCountry": "US"
  }
}' |
apify call memo23/google-news-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=memo23/google-news-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Google News Scraper · Full Article Bodies + Entities",
        "description": "Scrape Google News with full publisher article bodies, not just snippets. Decodes the CBM redirect to the publisher page, extracts Article JSON-LD (title, body, author, image, keywords), and runs entity extraction (orgs/tickers/locations). 4 input kinds. Flat per-article billing. Pure HTTP",
        "version": "0.1",
        "x-build-id": "7CVlrDTKPcQUiJEas"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/memo23~google-news-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-memo23-google-news-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/memo23~google-news-scraper/runs": {
            "post": {
                "operationId": "runs-sync-memo23-google-news-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/memo23~google-news-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-memo23-google-news-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "startInputs"
                ],
                "properties": {
                    "startInputs": {
                        "title": "Inputs (any mix)",
                        "type": "array",
                        "description": "Each entry is auto-classified:\n\n• **Search query** — bare string, e.g. `OpenAI earnings`\n• **Predefined topic name** — `BUSINESS`, `TECHNOLOGY`, `WORLD`, `NATION`, `ENTERTAINMENT`, `SPORTS`, `SCIENCE`, `HEALTH` (case-insensitive)\n• **Topic URL** — `https://news.google.com/topics/CAAqJ…`\n• **Publisher domain** — bare hostname, e.g. `bloomberg.com`\n• **Full-coverage cluster URL** — `https://news.google.com/stories/CA…`\n\nGoogle search operators work too: `-trump`, `\"exact phrase\"`, `site:wsj.com`, `OpenAI OR Anthropic`.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "maxItems": {
                        "title": "Max total rows",
                        "minimum": 1,
                        "type": "integer",
                        "description": "Safety cap on the entire run. Each article = one row. Default 500.",
                        "default": 500
                    },
                    "maxArticlesPerInput": {
                        "title": "Max articles per input",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Per-input cap. Google News RSS feeds return up to ~100 articles per query. Default 100; set 0 for no per-input cap (overall `maxItems` still applies).",
                        "default": 100
                    },
                    "resolveUrls": {
                        "title": "Resolve Google News redirect → publisher URL",
                        "type": "boolean",
                        "description": "Decode each Google News CBM redirect to its real publisher URL using the same batchexecute endpoint the Google News client uses. One HTTP hop per article. Cheaper than `enrichBody`. Implied by `enrichBody:true`. Charges a `url-resolved` event.",
                        "default": false
                    },
                    "enrichBody": {
                        "title": "Enrich with publisher article body",
                        "type": "boolean",
                        "description": "ALSO fetch the publisher page and extract the full body, headline, image, author, section, keywords, wordCount from Article/NewsArticle JSON-LD (with Mozilla Readability as the fallback algorithm — the same one Firefox Reader View uses). Adds a second HTTP hop per article on top of url-resolved. Charges a `body-enriched` event. Recommended for LLM/RAG buyers.",
                        "default": false
                    },
                    "enableCfBypass": {
                        "title": "Cloudflare-bypass for hard-anti-bot publishers",
                        "type": "boolean",
                        "description": "When impit Firefox + Chrome both fail on a publisher (Bloomberg, NYT, WSJ, FT, sometimes Yahoo Finance), fall back to a CF-bypass service for the final attempt. No setup on your side — the bypass token is built into the actor, and there is **no extra charge** (the cost is absorbed into the base per-article price).\n\n**Defaults to ON whenever `enrichBody` is enabled** (the publishers worth reading sit behind Cloudflare, so bodies should just arrive). Set to `false` to skip walled publishers.",
                        "default": true
                    },
                    "extractEntities": {
                        "title": "Extract named entities (people / orgs / tickers / locations)",
                        "type": "boolean",
                        "description": "Rule-based entity extraction on the article body or title. Cheap, no extra HTTP cost. Defaults to ON when `enrichBody` is enabled (body gives much better recall than title alone).",
                        "default": false
                    },
                    "regionLanguage": {
                        "title": "Region + language (combined)",
                        "type": "string",
                        "description": "Convenience input — combined Google News locale token. Examples: `US:en`, `GB:en`, `DE:de`, `JP:ja`, `FR:fr`, `ES:es`, `IN:en`, `AU:en`, `BR:pt`. When set, overrides separate `country` + `language` below.",
                        "default": "US:en"
                    },
                    "timeframe": {
                        "title": "Quick-select timeframe",
                        "enum": [
                            "all",
                            "last_hour",
                            "last_day",
                            "last_week",
                            "last_month",
                            "last_year"
                        ],
                        "type": "string",
                        "description": "Restrict to articles from this window via Google News `when:` operator. Applies to search queries + publisher inputs; topic / cluster feeds ignore this (their windows are fixed by Google). Use `since` / `until` for exact date ranges.",
                        "default": "all"
                    },
                    "country": {
                        "title": "Country (Google News gl=) — advanced override",
                        "type": "string",
                        "description": "ISO country code. Used only when `regionLanguage` isn't set. Examples: `US`, `GB`, `DE`, `IN`, `AU`. Default `US`.",
                        "default": "US"
                    },
                    "language": {
                        "title": "Language (Google News hl=) — advanced override",
                        "type": "string",
                        "description": "ISO language code. Used only when `regionLanguage` isn't set. Examples: `en`, `de`, `ja`, `es`, `fr`. Default `en`.",
                        "default": "en"
                    },
                    "since": {
                        "title": "Drop articles older than (ISO date)",
                        "type": "string",
                        "description": "Optional. Skip articles with `publishedAt < since`. Format: `YYYY-MM-DD` or full ISO 8601."
                    },
                    "until": {
                        "title": "Drop articles newer than (ISO date)",
                        "type": "string",
                        "description": "Optional. Skip articles with `publishedAt > until`. Format: `YYYY-MM-DD` or full ISO 8601."
                    },
                    "proxy": {
                        "title": "Proxy configuration",
                        "type": "object",
                        "description": "Apify Residential US is the default and is sufficient — Google News + most publisher sites are open to standard residential traffic.",
                        "default": {
                            "useApifyProxy": true,
                            "apifyProxyGroups": [
                                "RESIDENTIAL"
                            ]
                        }
                    },
                    "maxRequestRetries": {
                        "title": "Max request retries",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Reserved. Maximum HTTP retry attempts per URL.",
                        "default": 4
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
