Pricing

from $1.00 / 1,000 results

Try for free

Go to Apify Store

AI News Content Scraper

Try for free

This Apify actor searches Google News RSS for one or more country/language editions, collects article URLs, opens each news page, extracts clean text, and stores one dataset item per processed URL. It uses Google News RSS discovery with automatic HTTP fallback when browser rendering fails.

Pricing

from $1.00 / 1,000 results

Rating

5.0

(1)

Developer

Fabio Borsotti

Actor stats

Bookmarked

Total users

Monthly active users

11 days ago

Last modified

Google News RSS Article Text Extractor

This Apify actor searches Google News RSS for one or more country/language editions, collects article URLs, opens each news page, extracts clean text, and stores one dataset item per processed URL. It uses Google News RSS discovery plus Playwright-based extraction with automatic HTTP fallback when browser rendering fails.

What it does

Accepts a Google News search query, including advanced operators such as intitle, inurl, site, -, quoted exact matches, AND, and OR.
Accepts a predefined Google News topic as an alternative to searchQuery. Exactly one between searchQuery and topic must be provided.
Accepts one or more languageAndRegion pairs in the format CC:ll, for example IT:it, US:en, LT:lt, and uses them to build Google News RSS parameters gl, hl, and ceid.
Searches Google News RSS for each selected edition and decodes Google News article links into publisher URLs using googlenewsdecoder.
Limits discovery to a maximum number of unique news sites per language/region pair through maxSites.
Uses daysBack as the RSS search window and sets chunkDays = daysBack in the final actor logic.
Opens discovered article URLs with Playwright, extracts main page text, and falls back to HTTP + BeautifulSoup text extraction if Playwright fails.
Builds an Accept-Language header from the selected edition language and includes English as fallback for non-English editions.
Processes multiple pages in parallel using maxConcurrency, capped to 50 open Playwright pages at a time.
Writes one dataset item per processed URL and emits detailed performance logs for RSS discovery, URL decoding, browser extraction, fallback extraction, and overall actor runtime.

Input

You can also use advanced search operators in your queries, such as intitle, inurl, site, exclude operator -, exact match with double-quotes "", AND, OR and more.

Example queries with advanced operators

Query	Explained
`intitle:"AI" AND site:bbc.com`	Finds articles with "AI" in the title from BBC.
`site:reuters.com "stock market" -crypto`	Finds stock market articles on Reuters, excluding crypto-related ones.
`"Samsung Galaxy S25" AND (review OR comparison)`	Searches for reviews or comparisons of Samsung Galaxy S25.
`site:nytimes.com intitle:"election" after:2025-01-01`	Retrieves recent NY Times articles with "election" keyword in the title.
`inurl:blog OR inurl:news "climate change"`	Searches for climate change mentions in blog or news URLs.

For more information, see the Google Guide on Search Operators.

Example queries with advanced operators

You can use predefined Topic. Be sure to use Topic or Search query. Do not use Together. See example table with all behavior.

topic	searchQuery	Comportamento finale
`DO_NOT_USE_TOPIC`	empty	Exit without error.
`DO_NOT_USE_TOPIC`	real query	it will use `searchQuery`.
`TECHNOLOGY`	empty	it will use `topic`.
`TECHNOLOGY`	real query	Errore: `You must specify exactly one of 'searchQuery' and 'topic'.`
`WORLD`	empty	Usa `topic`.
`BUSINESS`	empty	Usa `topic`.
`HEALTH`	empty	Usa `topic`.

Input fields

Field	Type	Required	Description
`searchQuery`	string	Conditionally	Query written as for the Google News search bar. Advanced operators are supported. Use `searchQuery` as an alternative to `topic`. Exactly one between `searchQuery` and `topic` must be provided.
`topic`	string	Conditionally	Predefined Google News topic used as an alternative to `searchQuery`. Exactly one between `searchQuery` and `topic` must be provided. Supported topic values are listed below.
`maxConcurrency`	integer	No	Maximum number of pages opened simultaneously in the Playwright browser. Capped to 50. This value represents the maximum number of browser pages or tabs processed in parallel, not the number of system threads. Max value is capped to 50.
`languageAndRegion`	array of strings	Yes	One or more country/language pairs in the format `CC:ll`, for example `IT:it`, `US:en`, `LT:lt`. If you specify more then one pair foreach pair will be returned maxSites site back. Practical whitelist of commonly used Google News country/language pairs in the format CC:ll. This is not an official or complete list of all combinations supported by Google News.
`maxSites`	integer	No	Maximum number of unique news sites returned for each language/region pair. RSS feed finds more results, but domain deduplication filter stops when it reaches maxSites unique domains.
`daysBack`	integer	No	Number of days back to search in Google News RSS. How many days back to search. chunkDays is always set equal to daysBack.
`decodeInterval`	integer	No	Interval passed to `googlenewsdecoder`. Default is `1`. This is used to time the decoding of Google News links: it controls how often the decoder attempts to resolve the URL before considering it failed or moving on to the next attempt.

In your case, a value of 1 essentially indicates a very short interval between attempts, useful for not overloading the service and reducing the risk of errors or rate limits. Increasing the value slows the process but can be more stable when decoding many links in sequence. |

Example input

Search query example:

{
    "daysBack": 100,
    "decodeInterval": 1,
    "languageAndRegion": [
        "US:en"
    ],
    "maxConcurrency": 50,
    "maxSites": 10,
    "searchQuery": "jannik sinner"
}

Topic example:

{
    "daysBack": 100,
    "decodeInterval": 1,
    "languageAndRegion": [
        "US:en"
    ],
    "maxConcurrency": 50,
    "maxSites": 10,
    "topic": "TECHNOLOGY"
}

Supported topics

Topic	Meaning
`WORLD`	International news.
`NATION`	National news for the selected country/region edition.
`BUSINESS`	Business, markets, companies, and economy coverage.
`TECHNOLOGY`	Technology, digital products, software, internet, and innovation coverage.
`ENTERTAINMENT`	Entertainment, media, celebrities, film, music, and culture coverage.
`SPORTS`	Sports news and related coverage.
`SCIENCE`	Science and research coverage.
`HEALTH`	Health, medicine, and public health coverage.

How it works

The actor runs in two stages. First, for each selected language/region pair it either queries Google News RSS search using searchQuery and the after: and before: filters derived from daysBack, or it loads the corresponding Google News topic feed when topic is provided. It then decodes Google News links into publisher URLs and deduplicates the results.

Second, it opens each discovered news URL with Playwright, attempts to extract the main text from article-like selectors such as article, main, and common content containers, and falls back to a plain HTTP request plus HTML text cleanup when browser extraction fails.

Concurrency

The actor uses a shared Playwright browser and limits concurrent page work with asyncio.Semaphore(maxConcurrency), so maxConcurrency represents the maximum number of browser pages or tabs processed in parallel, not the number of OS threads.

This makes concurrency easy to reason about operationally: if maxConcurrency is set to 10, the actor will process up to ten pages at the same time in Chromium.

Language handling

For each languageAndRegion pair, the actor extracts the language part and builds the Accept-Language header used for both Playwright and HTTP fallback requests. For non-English editions it uses a language list such as ["it", "en"], which becomes a header like it,en;q=0.9; for English editions it uses ["en"].

This improves localization hints to publisher sites, but the final language of the returned page still depends on how each site handles geographic and language negotiation.

Output

Each dataset item contains the processing result for one discovered publisher URL.

Successful Playwright result

undefined

Successful fallback result

{
    "position": 1,
    "title": "The importance of pomp and protocol as Trump goes to China - PBS",
    "domain": "http://www.pbs.org",
    "thumbnail": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAYEBAUEBAYFBQUGBgYHCQ4JCQgICRINDQoOFRIWFhUSFBQXGiEcFxgfGRQUHSc....",
    "snippet": "From the moment President Donald Trump lands in Beijing on Wednesday, all eyes will be on how much of a spectacle the Chinese government rolls out, such as who lines up to greet him, what music is played and whether Chinese and American children wave flowers and flags.",
    "url": "https://www.pbs.org/newshour/world/the-importance-of-pomp-and-protocol-as-trump-goes-to-china",
    "success": true,
    "statusCode": 200,
    "langs": [
      "en"
    ],
    "acceptLanguage": "en",
    "htmlLang": "en-us",
    "text": "Full Episode\nWednesday, May 13",
    "error": null,
    "extractor": "playwright",
    "published": "Wed, 13 May 2026 18:51:05 GMT",
    "googleNewsUrl": "https://news.google.com/rss/articles/CBMimAFBVV95cUxQRVh4b0xFbFdDOXdtTkJ6eHdCRVBWM0dKNWZ5elFLY00td3hBdzE0eHhPYmdFaHlpb25qMmZqOENkeEt4SndtMWl6SVpubl9EWkp0cEV3cmZGSzMyVmdoT1NMWGhwUS01dE8xTC1hX2tGYUVvOXpRamZVbERPZ2xZLXhGbjk2MkJ2UWhhU3dPVm1pSmVMNktIctIBngFBVV95cUxONk1UX0d2MUtYNFFENjAxMXhKTGI0WE5sOVVzM1FsNnZsY0lVQlNJN0ZJRksycGFtZHJtNnVZY1p0N0RVaUw3cXlnaTdKQWc2RTkybkl6WGxkczdBNzFuSGpXUF9SZ3IyVnpDMDNTaUdsNTY3UFZkYlZWb2J1dzBJaVBWOUlJbEk5NnBMS1pYZEF5dTgza21vdVZPVHY1UQ?oc=5",
    "query": "donald trump after:2026-05-09 before:2026-05-14",
    "hl": "en",
    "gl": "US",
    "ceid": "US:en"
  },
  {
    "position": 2,
    "title": "It’s 10pm, do you know where Donald Trump is? - The Guardian",
    "domain": "http://www.theguardian.com",
    "thumbnail": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAYEBAUEBAYFBQ...",
    "snippet": "The US president’s late-night Truth Social vitriol riddled with erratic capitalization and spelling? That’s leadership",
    "url": "https://www.theguardian.com/us-news/2026/may/13/its-1015-do-you-know-where-your-president-is",
    "success": true,
    "statusCode": 200,
    "langs": [
      "en"
    ],
    "acceptLanguage": "en",
    "htmlLang": "en",
    "text": "View image in fullscreen\nDonald Trump in the Rose Garden of the White House on 11 May. Photograph: Kent Nishimura/AFP/Getty Images\nThis Week in Trumpland\nUS news\nAnalysis\nIt’s 10pm, do you know where Donald Trump is?\nRachel Leingang\n\nThe US president’s late-night Truth Social vitriol riddled with erratic capitalization and spelling? That’s leadership\n\n \n\nThis was originally published in This Week in Trumpland. Sign up to receive it in your inbox every Wednesday\n\nWed 13 May 2026 13.00 EDT\nShare\n\nGas prices are soaring because of blockages in the strait of Hormuz as part of the unauthorized war in Iran. There’s a highly consequential meeting with the president of China on the books for this week. The FDA director just stepped down over a disagreement over fruit-flavored vapes. Southern states are redrawing maps at breakneck pace to gerrymander Black voters out of their electoral voices.\n\nYou know what that means: it’s time for some conspiracy-laden, high-speed Truth Social posting.\n\nDonald Trump again this week went on a spree on his own social media site, posting more than 50 times in less than three hours, all after 10pm ET. He continued to post through it on Monday morning.\n\nIt was a greatest hits of the president’s enemies. He went after Barack Obama multiple times with false or unfounded accusations, claiming the former president plotted a coup against Trump and calling Obama the “most DEMONIC FORCE” in American politics. He shared altered images of Obama, Joe Biden and Nancy Pelosi in the Lincoln Memorial’s reflecting pool with the caption “Dumacrats love sewage”.\n\nWhat Trump’s Bible stunt says about his complicated history with Christianity\nRead more\n\n....",
    "error": null,
    "extractor": "playwright",
    "published": "Wed, 13 May 2026 22:00:00 GMT",
    "googleNewsUrl": "https://news.google.com/rss/articles/CBMilwFBVV95cUxPdVk0STB1WFVlZXhRV18wSGtHUGVpakJmSWJZcWt0UFNlb19xcHdqc1FhZzlsWHQtXzg1dUlNeDh2aWRvWEFBNzF1UXhubFBOSGliUkJmSmRncV9wVTFiLVNKd1UySHUtX3FtakV0S2FlaHdzcVY1UzBodDVuWGVWN084VUR6TWVmSVJRWmVjSGw3TzlqOUFj?oc=5",
    "query": "donald trump after:2026-05-09 before:2026-05-14",
    "hl": "en",
    "gl": "US",
    "ceid": "US:en"

Final failure result

{
  "url": "https://example.com",
  "success": false,
  "statusCode": null,
  "langs": ["it", "en"],
  "acceptLanguage": "it,en;q=0.9",
  "htmlLang": null,
  "text": null,
  "error": "Playwright: ... | HTTP fallback: ...",
  "extractor": null
}

Logs and performance metrics

The final actor includes structured logs that make it easier to debug and profile runs. Key events include actor lifecycle logs such as actor.start, actor.input, and actor.done; RSS discovery logs such as rss.collect.start, rss.chunk.start, rss.collect.stop; extraction logs such as playwright.start, extract.playwright_failed, extract.batch.start, and extract.batch.progress; and timing logs such as timing.parse_feed, timing.decode_google_news_url, timing.playwright_extract, timing.httpx_fallback, timing.discovery_total, timing.extraction_total, and timing.actor_total.

These logs are designed to help verify each step of the actor and identify bottlenecks in RSS fetching, URL decoding, browser extraction, fallback extraction, and total runtime.

Notes and limitations

Google News RSS is used only for discovery; final extraction runs against publisher pages, not Google-hosted article pages.
maxSites applies to unique news sites per language/region pair, not to the total number of articles globally across all pairs.
daysBack controls the search lookback window, and the final actor sets chunkDays = daysBack, so each edition is queried with a single time chunk covering the full requested range.
daysBack applies to search-based discovery. Topic-based discovery reads the selected Google News topic feed and does not add after: or before: search filters.
The extracted text depends on publisher HTML structure and on whether content is accessible to browser automation or plain HTTP requests.
Some publishers may block automation, enforce paywalls, or deliver reduced content to automated clients.
The languageAndRegion choices in the input schema can be represented as a practical whitelist, but Google does not publish a stable, official, exhaustive list of all supported ceid combinations.

Client code examples

Node.js

import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with API token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "searchQuery": "\"samsung galaxy s25\" -ultra",
    "maxConcurrency": 10,
    "languageAndRegion": [
        "IT:it",
        "US:en",
        "LT:lt"
    ],
    "maxSites": 50,
    "daysBack": 7,
    "decodeInterval": 1
};

(async () => {
    // Run the Actor and wait for it to finish
    const run = await client.actor("nQxTzKe5yrbyzysYh").call(input);

    // Fetch and print Actor results from the run's dataset (if any)
    console.log('Results from dataset');
    const { items } = await client.dataset(run.defaultDatasetId).listItems();
    items.forEach((item) => {
        console.dir(item);
    });
})();

Python

from apify_client import ApifyClient

# Initialize the ApifyClient with your API token
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "searchQuery": "\"samsung galaxy s25\" -ultra",
    "maxConcurrency": 10,
    "languageAndRegion": [
        "IT:it",
        "US:en",
        "LT:lt",
    ],
    "maxSites": 50,
    "daysBack": 7,
    "decodeInterval": 1,
}

# Run the Actor and wait for it to finish
run = client.actor("nQxTzKe5yrbyzysYh").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

cURL

# Set API token
API_TOKEN=<YOUR_API_TOKEN>

# Prepare Actor input
cat > input.json <<'EOF'
{
  "searchQuery": "\"samsung galaxy s25\" -ultra",
  "maxConcurrency": 10,
  "languageAndRegion": [
    "IT:it",
    "US:en",
    "LT:lt"
  ],
  "maxSites": 50,
  "daysBack": 7,
  "decodeInterval": 1
}
EOF

# Run the Actor
curl "https://api.apify.com/v2/acts/nQxTzKe5yrbyzysYh/runs?token=$API_TOKEN" \
  -X POST \
  -d @input.json \
  -H 'Content-Type: application/json'

Google News Scraper

oneary/google-news-scraper

Luan

Free Google News API — Search News by Keyword + Country

s-r/google-news

Free Google News scraper — get clean structured news results for any query, country, and language. Use it as a Google News API for brand monitoring, topic alerts, news clipping, and bulk article URL harvesting.

Google News Scraper

futurizerush/google-news-scraper

Google News Search Scraper - Real-time news aggregation from Google News. Features smart article enrichment with full content extraction. Perfect for market research, trend analysis, and content monitoring.

Rush

5.0

Google News Scraper - Cheap

bot_kevin/Google-News-Scraper

Easily scrape news from Google News page in .json format.

bot

Google News Article Scraper

webscrap18/google-news-article-scraper

Scrape Google News, Extract full content with Title, Article Text, Images and Structured data.

WebScrap

Google News Scraper

fortuitous_pirate/google-news-scraper

Scrape news articles from Google News by search query or topic. Extracts article title, source, published date, and URL. Supports language and country filtering. Export to JSON, CSV, or Excel.

Fortuitous Pirate

Fast News Content Scraper

datapilot/fast-news-content-scraper

Fast News Content Scraper Actor collects news articles using Fast News RSS and . It extracts title, URL, publish date, author, description, and full article text. Supports multiple queries, anti-bot delays, and outputs structured JSON with source site and scrape timestamp.

Data Pilot

Google News RSS Scraper

cloud9_ai/google-news-scraper

Scrape Google News search results via RSS feed. Returns article titles, URLs, sources, publish dates, and summaries for any keyword. No API key needed.

cloud9

Google News Scraper

akash9078/google-news-scraper

A lightweight Google News API that provides structured news search results with HTTP-based requests.

Akash Kumar Naik

Google News API

johnvc/GoogleNewsAPI

Search and scrape news articles from Google News. Fast. Location-based searches, language filters, safe search, pagination control. Returns structured JSON data with article titles, links, sources, snippets, and publication dates. News monitoring, markets, and content aggregation.

John

5.0

AI News Content Scraper

Google News RSS Article Text Extractor

What it does

Input

Advanced search filters

Example queries with advanced operators

Example queries with advanced operators

Input fields

Example input

Supported topics

How it works

Concurrency

Language handling

Output

Successful Playwright result

Successful fallback result

Final failure result

Logs and performance metrics

Notes and limitations

Client code examples

Node.js

Python

cURL

You might also like

Google News Scraper

Free Google News API — Search News by Keyword + Country

Google News Scraper

Google News Scraper - Cheap

Google News Article Scraper

Google News Scraper

Fast News Content Scraper

Google News RSS Scraper

Google News Scraper

Google News API