AI News Content Crawler avatar

AI News Content Crawler

Pricing

from $1.00 / 1,000 results

Go to Apify Store
AI News Content Crawler

AI News Content Crawler

This Apify actor searches Google News RSS for one or more country/language editions, collects article URLs, opens each news page, extracts clean text, and stores one dataset item per processed URL. It uses Google News RSS discovery with automatic HTTP fallback when browser rendering fails.

Pricing

from $1.00 / 1,000 results

Rating

0.0

(0)

Developer

Fabio Borsotti

Fabio Borsotti

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 hours ago

Last modified

Categories

Share

Google News RSS Article Text Extractor

This Apify actor searches Google News RSS for one or more country/language editions, collects article URLs, opens each news page, extracts clean text, and stores one dataset item per processed URL. It uses Google News RSS discovery plus Playwright-based extraction with automatic HTTP fallback when browser rendering fails.

What it does

  • Accepts a Google News search query, including advanced operators such as intitle, inurl, site, -, quoted exact matches, AND, and OR.
  • Accepts one or more languageAndRegion pairs in the format CC:ll, for example IT:it, US:en, LT:lt, and uses them to build Google News RSS parameters gl, hl, and ceid.
  • Searches Google News RSS for each selected edition and decodes Google News article links into publisher URLs using googlenewsdecoder.
  • Limits discovery to a maximum number of unique news sites per language/region pair through maxSites.
  • Uses daysBack as the RSS search window and sets chunkDays = daysBack in the final actor logic.
  • Opens discovered article URLs with Playwright, extracts main page text, and falls back to HTTP + BeautifulSoup text extraction if Playwright fails.
  • Builds an Accept-Language header from the selected edition language and includes English as fallback for non-English editions.
  • Processes multiple pages in parallel using maxConcurrency, capped to 50 open Playwright pages at a time.
  • Writes one dataset item per processed URL and emits detailed performance logs for RSS discovery, URL decoding, browser extraction, fallback extraction, and overall actor runtime.

Input

Advanced search filters

You can also use advanced search operators in your queries, such as intitle, inurl, site, exclude operator -, exact match with double-quotes "", AND, OR and more.

Example queries with advanced operators

QueryExplained
intitle:"AI" AND site:bbc.comFinds articles with "AI" in the title from BBC.
site:reuters.com "stock market" -cryptoFinds stock market articles on Reuters, excluding crypto-related ones.
"Samsung Galaxy S25" AND (review OR comparison)Searches for reviews or comparisons of Samsung Galaxy S25.
site:nytimes.com intitle:"election" after:2025-01-01Retrieves recent NY Times articles with "election" keyword in the title.
inurl:blog OR inurl:news "climate change"Searches for climate change mentions in blog or news URLs.

For more information, see the Google Guide on Search Operators.

Input fields

FieldTypeRequiredDescription
searchQuerystringYesQuery written as for the Google News search bar. Advanced operators are supported. Enter search query as you would write it to Google News search bar. You can use advanced operators such as intitle, inurl, site, exclude operator -, exact match with double-quotes, AND, OR and more.
maxConcurrencyintegerNoMaximum number of pages opened simultaneously in the Playwright browser. Capped to 50. This value represents the maximum number of browser pages or tabs processed in parallel, not the number of system threads. Max value is capped to 50.
languageAndRegionarray of stringsYesOne or more country/language pairs in the format CC:ll, for example IT:it, US:en, LT:lt. If you specify more then one pair foreach pair will be returned maxSites site back. Practical whitelist of commonly used Google News country/language pairs in the format CC:ll. This is not an official or complete list of all combinations supported by Google News.
maxSitesintegerNoMaximum number of unique news sites returned for each language/region pair. RSS feed finds more results, but domain deduplication filter stops when it reaches maxSites unique domains.
daysBackintegerNoNumber of days back to search in Google News RSS. How many days back to search. chunkDays is always set equal to daysBack.
decodeIntervalintegerNoInterval passed to googlenewsdecoder. Default is 1. This is used to time the decoding of Google News links: it controls how often the decoder attempts to resolve the URL before considering it failed or moving on to the next attempt.

In your case, a value of 1 essentially indicates a very short interval between attempts, useful for not overloading the service and reducing the risk of errors or rate limits. Increasing the value slows the process but can be more stable when decoding many links in sequence. |

Example input

{
"daysBack": 100,
"decodeInterval": 1,
"languageAndRegion": [
"US:en"
],
"maxConcurrency": 50,
"maxSites": 10,
"searchQuery": "jannik sinner"
}

How it works

The actor runs in two stages. First, it queries Google News RSS for each selected language/region pair, applies the after: and before: filters derived from daysBack, decodes Google News links into publisher URLs, and deduplicates the results.

Second, it opens each discovered news URL with Playwright, attempts to extract the main text from article-like selectors such as article, main, and common content containers, and falls back to a plain HTTP request plus HTML text cleanup when browser extraction fails.

Concurrency

The actor uses a shared Playwright browser and limits concurrent page work with asyncio.Semaphore(maxConcurrency), so maxConcurrency represents the maximum number of browser pages or tabs processed in parallel, not the number of OS threads.

This makes concurrency easy to reason about operationally: if maxConcurrency is set to 10, the actor will process up to ten pages at the same time in Chromium.

Language handling

For each languageAndRegion pair, the actor extracts the language part and builds the Accept-Language header used for both Playwright and HTTP fallback requests. For non-English editions it uses a language list such as ["it", "en"], which becomes a header like it,en;q=0.9; for English editions it uses ["en"].

This improves localization hints to publisher sites, but the final language of the returned page still depends on how each site handles geographic and language negotiation.

Output

Each dataset item contains the processing result for one discovered publisher URL.

Successful Playwright result

undefined

Successful fallback result

{
"position": 1,
"title": "The importance of pomp and protocol as Trump goes to China - PBS",
"domain": "http://www.pbs.org",
"thumbnail": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAYEBAUEBAYFBQUGBgYHCQ4JCQgICRINDQoOFRIWFhUSFBQXGiEcFxgfGRQUHSc....",
"snippet": "From the moment President Donald Trump lands in Beijing on Wednesday, all eyes will be on how much of a spectacle the Chinese government rolls out, such as who lines up to greet him, what music is played and whether Chinese and American children wave flowers and flags.",
"url": "https://www.pbs.org/newshour/world/the-importance-of-pomp-and-protocol-as-trump-goes-to-china",
"success": true,
"statusCode": 200,
"langs": [
"en"
],
"acceptLanguage": "en",
"htmlLang": "en-us",
"text": "Full Episode\nWednesday, May 13",
"error": null,
"extractor": "playwright",
"published": "Wed, 13 May 2026 18:51:05 GMT",
"googleNewsUrl": "https://news.google.com/rss/articles/CBMimAFBVV95cUxQRVh4b0xFbFdDOXdtTkJ6eHdCRVBWM0dKNWZ5elFLY00td3hBdzE0eHhPYmdFaHlpb25qMmZqOENkeEt4SndtMWl6SVpubl9EWkp0cEV3cmZGSzMyVmdoT1NMWGhwUS01dE8xTC1hX2tGYUVvOXpRamZVbERPZ2xZLXhGbjk2MkJ2UWhhU3dPVm1pSmVMNktIctIBngFBVV95cUxONk1UX0d2MUtYNFFENjAxMXhKTGI0WE5sOVVzM1FsNnZsY0lVQlNJN0ZJRksycGFtZHJtNnVZY1p0N0RVaUw3cXlnaTdKQWc2RTkybkl6WGxkczdBNzFuSGpXUF9SZ3IyVnpDMDNTaUdsNTY3UFZkYlZWb2J1dzBJaVBWOUlJbEk5NnBMS1pYZEF5dTgza21vdVZPVHY1UQ?oc=5",
"query": "donald trump after:2026-05-09 before:2026-05-14",
"hl": "en",
"gl": "US",
"ceid": "US:en"
},
{
"position": 2,
"title": "It’s 10pm, do you know where Donald Trump is? - The Guardian",
"domain": "http://www.theguardian.com",
"thumbnail": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAYEBAUEBAYFBQ...",
"snippet": "The US president’s late-night Truth Social vitriol riddled with erratic capitalization and spelling? That’s leadership",
"url": "https://www.theguardian.com/us-news/2026/may/13/its-1015-do-you-know-where-your-president-is",
"success": true,
"statusCode": 200,
"langs": [
"en"
],
"acceptLanguage": "en",
"htmlLang": "en",
"text": "View image in fullscreen\nDonald Trump in the Rose Garden of the White House on 11 May. Photograph: Kent Nishimura/AFP/Getty Images\nThis Week in Trumpland\nUS news\nAnalysis\nIt’s 10pm, do you know where Donald Trump is?\nRachel Leingang\n\nThe US president’s late-night Truth Social vitriol riddled with erratic capitalization and spelling? That’s leadership\n\n \n\nThis was originally published in This Week in Trumpland. Sign up to receive it in your inbox every Wednesday\n\nWed 13 May 2026 13.00 EDT\nShare\n\nGas prices are soaring because of blockages in the strait of Hormuz as part of the unauthorized war in Iran. There’s a highly consequential meeting with the president of China on the books for this week. The FDA director just stepped down over a disagreement over fruit-flavored vapes. Southern states are redrawing maps at breakneck pace to gerrymander Black voters out of their electoral voices.\n\nYou know what that means: it’s time for some conspiracy-laden, high-speed Truth Social posting.\n\nDonald Trump again this week went on a spree on his own social media site, posting more than 50 times in less than three hours, all after 10pm ET. He continued to post through it on Monday morning.\n\nIt was a greatest hits of the president’s enemies. He went after Barack Obama multiple times with false or unfounded accusations, claiming the former president plotted a coup against Trump and calling Obama the “most DEMONIC FORCE” in American politics. He shared altered images of Obama, Joe Biden and Nancy Pelosi in the Lincoln Memorial’s reflecting pool with the caption “Dumacrats love sewage”.\n\nWhat Trump’s Bible stunt says about his complicated history with Christianity\nRead more\n\n....",
"error": null,
"extractor": "playwright",
"published": "Wed, 13 May 2026 22:00:00 GMT",
"googleNewsUrl": "https://news.google.com/rss/articles/CBMilwFBVV95cUxPdVk0STB1WFVlZXhRV18wSGtHUGVpakJmSWJZcWt0UFNlb19xcHdqc1FhZzlsWHQtXzg1dUlNeDh2aWRvWEFBNzF1UXhubFBOSGliUkJmSmRncV9wVTFiLVNKd1UySHUtX3FtakV0S2FlaHdzcVY1UzBodDVuWGVWN084VUR6TWVmSVJRWmVjSGw3TzlqOUFj?oc=5",
"query": "donald trump after:2026-05-09 before:2026-05-14",
"hl": "en",
"gl": "US",
"ceid": "US:en"

Final failure result

{
"url": "https://example.com",
"success": false,
"statusCode": null,
"langs": ["it", "en"],
"acceptLanguage": "it,en;q=0.9",
"htmlLang": null,
"text": null,
"error": "Playwright: ... | HTTP fallback: ...",
"extractor": null
}

Logs and performance metrics

The final actor includes structured logs that make it easier to debug and profile runs. Key events include actor lifecycle logs such as actor.start, actor.input, and actor.done; RSS discovery logs such as rss.collect.start, rss.chunk.start, rss.collect.stop; extraction logs such as playwright.start, extract.playwright_failed, extract.batch.start, and extract.batch.progress; and timing logs such as timing.parse_feed, timing.decode_google_news_url, timing.playwright_extract, timing.httpx_fallback, timing.discovery_total, timing.extraction_total, and timing.actor_total.

These logs are designed to help verify each step of the actor and identify bottlenecks in RSS fetching, URL decoding, browser extraction, fallback extraction, and total runtime.

Notes and limitations

  • Google News RSS is used only for discovery; final extraction runs against publisher pages, not Google-hosted article pages.
  • maxSites applies to unique news sites per language/region pair, not to the total number of articles globally across all pairs.
  • daysBack controls the search lookback window, and the final actor sets chunkDays = daysBack, so each edition is queried with a single time chunk covering the full requested range.
  • The extracted text depends on publisher HTML structure and on whether content is accessible to browser automation or plain HTTP requests.
  • Some publishers may block automation, enforce paywalls, or deliver reduced content to automated clients.
  • The languageAndRegion choices in the input schema can be represented as a practical whitelist, but Google does not publish a stable, official, exhaustive list of all supported ceid combinations.

Client code examples

Node.js

import { ApifyClient } from 'apify-client';
// Initialize the ApifyClient with API token
const client = new ApifyClient({
token: '<YOUR_API_TOKEN>',
});
// Prepare Actor input
const input = {
"searchQuery": "\"samsung galaxy s25\" -ultra",
"maxConcurrency": 10,
"languageAndRegion": [
"IT:it",
"US:en",
"LT:lt"
],
"maxSites": 50,
"daysBack": 7,
"decodeInterval": 1
};
(async () => {
// Run the Actor and wait for it to finish
const run = await client.actor("nQxTzKe5yrbyzysYh").call(input);
// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
console.dir(item);
});
})();

Python

from apify_client import ApifyClient
# Initialize the ApifyClient with your API token
client = ApifyClient("<YOUR_API_TOKEN>")
# Prepare the Actor input
run_input = {
"searchQuery": "\"samsung galaxy s25\" -ultra",
"maxConcurrency": 10,
"languageAndRegion": [
"IT:it",
"US:en",
"LT:lt",
],
"maxSites": 50,
"daysBack": 7,
"decodeInterval": 1,
}
# Run the Actor and wait for it to finish
run = client.actor("nQxTzKe5yrbyzysYh").call(run_input=run_input)
# Fetch and print Actor results from the run's dataset (if there are any)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item)

cURL

# Set API token
API_TOKEN=<YOUR_API_TOKEN>
# Prepare Actor input
cat > input.json <<'EOF'
{
"searchQuery": "\"samsung galaxy s25\" -ultra",
"maxConcurrency": 10,
"languageAndRegion": [
"IT:it",
"US:en",
"LT:lt"
],
"maxSites": 50,
"daysBack": 7,
"decodeInterval": 1
}
EOF
# Run the Actor
curl "https://api.apify.com/v2/acts/nQxTzKe5yrbyzysYh/runs?token=$API_TOKEN" \
-X POST \
-d @input.json \
-H 'Content-Type: application/json'