AI News Content Crawler
Pricing
from $1.00 / 1,000 results
AI News Content Crawler
This Apify actor searches Google News RSS for one or more country/language editions, collects article URLs, opens each news page, extracts clean text, and stores one dataset item per processed URL. It uses Google News RSS discovery with automatic HTTP fallback when browser rendering fails.
Pricing
from $1.00 / 1,000 results
Rating
0.0
(0)
Developer
Fabio Borsotti
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
Google News RSS Article Text Extractor
This Apify actor searches Google News RSS for one or more country/language editions, collects article URLs, opens each news page, extracts clean text, and stores one dataset item per processed URL. It uses Google News RSS discovery plus Playwright-based extraction with automatic HTTP fallback when browser rendering fails.
What it does
- Accepts a Google News search query, including advanced operators such as
intitle,inurl,site,-, quoted exact matches,AND, andOR. - Accepts one or more
languageAndRegionpairs in the formatCC:ll, for exampleIT:it,US:en,LT:lt, and uses them to build Google News RSS parametersgl,hl, andceid. - Searches Google News RSS for each selected edition and decodes Google News article links into publisher URLs using
googlenewsdecoder. - Limits discovery to a maximum number of unique news sites per language/region pair through
maxSites. - Uses
daysBackas the RSS search window and setschunkDays = daysBackin the final actor logic. - Opens discovered article URLs with Playwright, extracts main page text, and falls back to HTTP + BeautifulSoup text extraction if Playwright fails.
- Builds an
Accept-Languageheader from the selected edition language and includes English as fallback for non-English editions. - Processes multiple pages in parallel using
maxConcurrency, capped to 50 open Playwright pages at a time. - Writes one dataset item per processed URL and emits detailed performance logs for RSS discovery, URL decoding, browser extraction, fallback extraction, and overall actor runtime.
Input
Input fields
| Field | Type | Required | Description |
|---|---|---|---|
searchQuery | string | Yes | Query written as for the Google News search bar. Advanced operators are supported. Enter search query as you would write it to Google News search bar. You can use advanced operators such as intitle, inurl, site, exclude operator -, exact match with double-quotes, AND, OR and more. |
maxConcurrency | integer | No | Maximum number of pages opened simultaneously in the Playwright browser. Capped to 50. This value represents the maximum number of browser pages or tabs processed in parallel, not the number of system threads. Max value is capped to 50. |
languageAndRegion | array of strings | Yes | One or more country/language pairs in the format CC:ll, for example IT:it, US:en, LT:lt. If you specify more then one pair foreach pair will be returned maxSites site back. Practical whitelist of commonly used Google News country/language pairs in the format CC:ll. This is not an official or complete list of all combinations supported by Google News. |
maxSites | integer | No | Maximum number of unique news sites returned for each language/region pair. RSS feed finds more results, but domain deduplication filter stops when it reaches maxSites unique domains. |
daysBack | integer | No | Number of days back to search in Google News RSS. How many days back to search. chunkDays is always set equal to daysBack. |
decodeInterval | integer | No | Interval passed to googlenewsdecoder. Default is 1. This is used to time the decoding of Google News links: it controls how often the decoder attempts to resolve the URL before considering it failed or moving on to the next attempt. |
In your case, a value of 1 essentially indicates a very short interval between attempts, useful for not overloading the service and reducing the risk of errors or rate limits. Increasing the value slows the process but can be more stable when decoding many links in sequence. |
Example input
{"searchQuery": "\"samsung galaxy s25\" -ultra","maxConcurrency": 10,"languageAndRegion": ["IT:it", "US:en", "LT:lt"],"maxSites": 50,"daysBack": 7,"decodeInterval": 1}
How it works
The actor runs in two stages. First, it queries Google News RSS for each selected language/region pair, applies the after: and before: filters derived from daysBack, decodes Google News links into publisher URLs, and deduplicates the results.
Second, it opens each discovered news URL with Playwright, attempts to extract the main text from article-like selectors such as article, main, and common content containers, and falls back to a plain HTTP request plus HTML text cleanup when browser extraction fails.
Concurrency
The actor uses a shared Playwright browser and limits concurrent page work with asyncio.Semaphore(maxConcurrency), so maxConcurrency represents the maximum number of browser pages or tabs processed in parallel, not the number of OS threads.
This makes concurrency easy to reason about operationally: if maxConcurrency is set to 10, the actor will process up to ten pages at the same time in Chromium.
Language handling
For each languageAndRegion pair, the actor extracts the language part and builds the Accept-Language header used for both Playwright and HTTP fallback requests. For non-English editions it uses a language list such as ["it", "en"], which becomes a header like it,en;q=0.9; for English editions it uses ["en"].
This improves localization hints to publisher sites, but the final language of the returned page still depends on how each site handles geographic and language negotiation.
Output
Each dataset item contains the processing result for one discovered publisher URL.
Successful Playwright result
{"url": "https://example.com","success": true,"statusCode": 200,"langs": ["it", "en"],"acceptLanguage": "it,en;q=0.9","htmlLang": "en","text": "Example Domain ...","error": null,"extractor": "playwright"}
Successful fallback result
{"url": "https://example.com","success": true,"statusCode": 200,"langs": ["it", "en"],"acceptLanguage": "it,en;q=0.9","htmlLang": "en","text": "Example Domain ...","error": "Playwright failed: ...","extractor": "httpx-fallback"}
Final failure result
{"url": "https://example.com","success": false,"statusCode": null,"langs": ["it", "en"],"acceptLanguage": "it,en;q=0.9","htmlLang": null,"text": null,"error": "Playwright: ... | HTTP fallback: ...","extractor": null}
Logs and performance metrics
The final actor includes structured logs that make it easier to debug and profile runs. Key events include actor lifecycle logs such as actor.start, actor.input, and actor.done; RSS discovery logs such as rss.collect.start, rss.chunk.start, rss.collect.stop; extraction logs such as playwright.start, extract.playwright_failed, extract.batch.start, and extract.batch.progress; and timing logs such as timing.parse_feed, timing.decode_google_news_url, timing.playwright_extract, timing.httpx_fallback, timing.discovery_total, timing.extraction_total, and timing.actor_total.
These logs are designed to help verify each step of the actor and identify bottlenecks in RSS fetching, URL decoding, browser extraction, fallback extraction, and total runtime.
Notes and limitations
- Google News RSS is used only for discovery; final extraction runs against publisher pages, not Google-hosted article pages.
maxSitesapplies to unique news sites per language/region pair, not to the total number of articles globally across all pairs.daysBackcontrols the search lookback window, and the final actor setschunkDays = daysBack, so each edition is queried with a single time chunk covering the full requested range.- The extracted text depends on publisher HTML structure and on whether content is accessible to browser automation or plain HTTP requests.
- Some publishers may block automation, enforce paywalls, or deliver reduced content to automated clients.
- The
languageAndRegionchoices in the input schema can be represented as a practical whitelist, but Google does not publish a stable, official, exhaustive list of all supportedceidcombinations.
Client code examples
Node.js
import { ApifyClient } from 'apify-client';// Initialize the ApifyClient with API tokenconst client = new ApifyClient({token: '<YOUR_API_TOKEN>',});// Prepare Actor inputconst input = {"searchQuery": "\"samsung galaxy s25\" -ultra","maxConcurrency": 10,"languageAndRegion": ["IT:it","US:en","LT:lt"],"maxSites": 50,"daysBack": 7,"decodeInterval": 1};(async () => {// Run the Actor and wait for it to finishconst run = await client.actor("nQxTzKe5yrbyzysYh").call(input);// Fetch and print Actor results from the run's dataset (if any)console.log('Results from dataset');const { items } = await client.dataset(run.defaultDatasetId).listItems();items.forEach((item) => {console.dir(item);});})();
Python
from apify_client import ApifyClient# Initialize the ApifyClient with your API tokenclient = ApifyClient("<YOUR_API_TOKEN>")# Prepare the Actor inputrun_input = {"searchQuery": "\"samsung galaxy s25\" -ultra","maxConcurrency": 10,"languageAndRegion": ["IT:it","US:en","LT:lt",],"maxSites": 50,"daysBack": 7,"decodeInterval": 1,}# Run the Actor and wait for it to finishrun = client.actor("nQxTzKe5yrbyzysYh").call(run_input=run_input)# Fetch and print Actor results from the run's dataset (if there are any)for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(item)
cURL
# Set API tokenAPI_TOKEN=<YOUR_API_TOKEN># Prepare Actor inputcat > input.json <<'EOF'{"searchQuery": "\"samsung galaxy s25\" -ultra","maxConcurrency": 10,"languageAndRegion": ["IT:it","US:en","LT:lt"],"maxSites": 50,"daysBack": 7,"decodeInterval": 1}EOF# Run the Actorcurl "https://api.apify.com/v2/acts/nQxTzKe5yrbyzysYh/runs?token=$API_TOKEN" \-X POST \-d @input.json \-H 'Content-Type: application/json'