AI News Content Crawler avatar

AI News Content Crawler

Pricing

from $1.00 / 1,000 results

Go to Apify Store
AI News Content Crawler

AI News Content Crawler

This Apify actor searches Google News RSS for one or more country/language editions, collects article URLs, opens each news page, extracts clean text, and stores one dataset item per processed URL. It uses Google News RSS discovery with automatic HTTP fallback when browser rendering fails.

Pricing

from $1.00 / 1,000 results

Rating

0.0

(0)

Developer

Fabio Borsotti

Fabio Borsotti

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

Google News RSS Article Text Extractor

This Apify actor searches Google News RSS for one or more country/language editions, collects article URLs, opens each news page, extracts clean text, and stores one dataset item per processed URL. It uses Google News RSS discovery plus Playwright-based extraction with automatic HTTP fallback when browser rendering fails.

What it does

  • Accepts a Google News search query, including advanced operators such as intitle, inurl, site, -, quoted exact matches, AND, and OR.
  • Accepts one or more languageAndRegion pairs in the format CC:ll, for example IT:it, US:en, LT:lt, and uses them to build Google News RSS parameters gl, hl, and ceid.
  • Searches Google News RSS for each selected edition and decodes Google News article links into publisher URLs using googlenewsdecoder.
  • Limits discovery to a maximum number of unique news sites per language/region pair through maxSites.
  • Uses daysBack as the RSS search window and sets chunkDays = daysBack in the final actor logic.
  • Opens discovered article URLs with Playwright, extracts main page text, and falls back to HTTP + BeautifulSoup text extraction if Playwright fails.
  • Builds an Accept-Language header from the selected edition language and includes English as fallback for non-English editions.
  • Processes multiple pages in parallel using maxConcurrency, capped to 50 open Playwright pages at a time.
  • Writes one dataset item per processed URL and emits detailed performance logs for RSS discovery, URL decoding, browser extraction, fallback extraction, and overall actor runtime.

Input

Input fields

FieldTypeRequiredDescription
searchQuerystringYesQuery written as for the Google News search bar. Advanced operators are supported. Enter search query as you would write it to Google News search bar. You can use advanced operators such as intitle, inurl, site, exclude operator -, exact match with double-quotes, AND, OR and more.
maxConcurrencyintegerNoMaximum number of pages opened simultaneously in the Playwright browser. Capped to 50. This value represents the maximum number of browser pages or tabs processed in parallel, not the number of system threads. Max value is capped to 50.
languageAndRegionarray of stringsYesOne or more country/language pairs in the format CC:ll, for example IT:it, US:en, LT:lt. If you specify more then one pair foreach pair will be returned maxSites site back. Practical whitelist of commonly used Google News country/language pairs in the format CC:ll. This is not an official or complete list of all combinations supported by Google News.
maxSitesintegerNoMaximum number of unique news sites returned for each language/region pair. RSS feed finds more results, but domain deduplication filter stops when it reaches maxSites unique domains.
daysBackintegerNoNumber of days back to search in Google News RSS. How many days back to search. chunkDays is always set equal to daysBack.
decodeIntervalintegerNoInterval passed to googlenewsdecoder. Default is 1. This is used to time the decoding of Google News links: it controls how often the decoder attempts to resolve the URL before considering it failed or moving on to the next attempt.

In your case, a value of 1 essentially indicates a very short interval between attempts, useful for not overloading the service and reducing the risk of errors or rate limits. Increasing the value slows the process but can be more stable when decoding many links in sequence. |

Example input

{
"searchQuery": "\"samsung galaxy s25\" -ultra",
"maxConcurrency": 10,
"languageAndRegion": ["IT:it", "US:en", "LT:lt"],
"maxSites": 50,
"daysBack": 7,
"decodeInterval": 1
}

How it works

The actor runs in two stages. First, it queries Google News RSS for each selected language/region pair, applies the after: and before: filters derived from daysBack, decodes Google News links into publisher URLs, and deduplicates the results.

Second, it opens each discovered news URL with Playwright, attempts to extract the main text from article-like selectors such as article, main, and common content containers, and falls back to a plain HTTP request plus HTML text cleanup when browser extraction fails.

Concurrency

The actor uses a shared Playwright browser and limits concurrent page work with asyncio.Semaphore(maxConcurrency), so maxConcurrency represents the maximum number of browser pages or tabs processed in parallel, not the number of OS threads.

This makes concurrency easy to reason about operationally: if maxConcurrency is set to 10, the actor will process up to ten pages at the same time in Chromium.

Language handling

For each languageAndRegion pair, the actor extracts the language part and builds the Accept-Language header used for both Playwright and HTTP fallback requests. For non-English editions it uses a language list such as ["it", "en"], which becomes a header like it,en;q=0.9; for English editions it uses ["en"].

This improves localization hints to publisher sites, but the final language of the returned page still depends on how each site handles geographic and language negotiation.

Output

Each dataset item contains the processing result for one discovered publisher URL.

Successful Playwright result

{
"url": "https://example.com",
"success": true,
"statusCode": 200,
"langs": ["it", "en"],
"acceptLanguage": "it,en;q=0.9",
"htmlLang": "en",
"text": "Example Domain ...",
"error": null,
"extractor": "playwright"
}

Successful fallback result

{
"url": "https://example.com",
"success": true,
"statusCode": 200,
"langs": ["it", "en"],
"acceptLanguage": "it,en;q=0.9",
"htmlLang": "en",
"text": "Example Domain ...",
"error": "Playwright failed: ...",
"extractor": "httpx-fallback"
}

Final failure result

{
"url": "https://example.com",
"success": false,
"statusCode": null,
"langs": ["it", "en"],
"acceptLanguage": "it,en;q=0.9",
"htmlLang": null,
"text": null,
"error": "Playwright: ... | HTTP fallback: ...",
"extractor": null
}

Logs and performance metrics

The final actor includes structured logs that make it easier to debug and profile runs. Key events include actor lifecycle logs such as actor.start, actor.input, and actor.done; RSS discovery logs such as rss.collect.start, rss.chunk.start, rss.collect.stop; extraction logs such as playwright.start, extract.playwright_failed, extract.batch.start, and extract.batch.progress; and timing logs such as timing.parse_feed, timing.decode_google_news_url, timing.playwright_extract, timing.httpx_fallback, timing.discovery_total, timing.extraction_total, and timing.actor_total.

These logs are designed to help verify each step of the actor and identify bottlenecks in RSS fetching, URL decoding, browser extraction, fallback extraction, and total runtime.

Notes and limitations

  • Google News RSS is used only for discovery; final extraction runs against publisher pages, not Google-hosted article pages.
  • maxSites applies to unique news sites per language/region pair, not to the total number of articles globally across all pairs.
  • daysBack controls the search lookback window, and the final actor sets chunkDays = daysBack, so each edition is queried with a single time chunk covering the full requested range.
  • The extracted text depends on publisher HTML structure and on whether content is accessible to browser automation or plain HTTP requests.
  • Some publishers may block automation, enforce paywalls, or deliver reduced content to automated clients.
  • The languageAndRegion choices in the input schema can be represented as a practical whitelist, but Google does not publish a stable, official, exhaustive list of all supported ceid combinations.

Client code examples

Node.js

import { ApifyClient } from 'apify-client';
// Initialize the ApifyClient with API token
const client = new ApifyClient({
token: '<YOUR_API_TOKEN>',
});
// Prepare Actor input
const input = {
"searchQuery": "\"samsung galaxy s25\" -ultra",
"maxConcurrency": 10,
"languageAndRegion": [
"IT:it",
"US:en",
"LT:lt"
],
"maxSites": 50,
"daysBack": 7,
"decodeInterval": 1
};
(async () => {
// Run the Actor and wait for it to finish
const run = await client.actor("nQxTzKe5yrbyzysYh").call(input);
// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
console.dir(item);
});
})();

Python

from apify_client import ApifyClient
# Initialize the ApifyClient with your API token
client = ApifyClient("<YOUR_API_TOKEN>")
# Prepare the Actor input
run_input = {
"searchQuery": "\"samsung galaxy s25\" -ultra",
"maxConcurrency": 10,
"languageAndRegion": [
"IT:it",
"US:en",
"LT:lt",
],
"maxSites": 50,
"daysBack": 7,
"decodeInterval": 1,
}
# Run the Actor and wait for it to finish
run = client.actor("nQxTzKe5yrbyzysYh").call(run_input=run_input)
# Fetch and print Actor results from the run's dataset (if there are any)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item)

cURL

# Set API token
API_TOKEN=<YOUR_API_TOKEN>
# Prepare Actor input
cat > input.json <<'EOF'
{
"searchQuery": "\"samsung galaxy s25\" -ultra",
"maxConcurrency": 10,
"languageAndRegion": [
"IT:it",
"US:en",
"LT:lt"
],
"maxSites": 50,
"daysBack": 7,
"decodeInterval": 1
}
EOF
# Run the Actor
curl "https://api.apify.com/v2/acts/nQxTzKe5yrbyzysYh/runs?token=$API_TOKEN" \
-X POST \
-d @input.json \
-H 'Content-Type: application/json'