AI News Content Crawler
Pricing
from $1.00 / 1,000 results
AI News Content Crawler
This Apify actor searches Google News RSS for one or more country/language editions, collects article URLs, opens each news page, extracts clean text, and stores one dataset item per processed URL. It uses Google News RSS discovery with automatic HTTP fallback when browser rendering fails.
Pricing
from $1.00 / 1,000 results
Rating
0.0
(0)
Developer
Fabio Borsotti
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
4 hours ago
Last modified
Categories
Share
Google News RSS Article Text Extractor
This Apify actor searches Google News RSS for one or more country/language editions, collects article URLs, opens each news page, extracts clean text, and stores one dataset item per processed URL. It uses Google News RSS discovery plus Playwright-based extraction with automatic HTTP fallback when browser rendering fails.
What it does
- Accepts a Google News search query, including advanced operators such as
intitle,inurl,site,-, quoted exact matches,AND, andOR. - Accepts one or more
languageAndRegionpairs in the formatCC:ll, for exampleIT:it,US:en,LT:lt, and uses them to build Google News RSS parametersgl,hl, andceid. - Searches Google News RSS for each selected edition and decodes Google News article links into publisher URLs using
googlenewsdecoder. - Limits discovery to a maximum number of unique news sites per language/region pair through
maxSites. - Uses
daysBackas the RSS search window and setschunkDays = daysBackin the final actor logic. - Opens discovered article URLs with Playwright, extracts main page text, and falls back to HTTP + BeautifulSoup text extraction if Playwright fails.
- Builds an
Accept-Languageheader from the selected edition language and includes English as fallback for non-English editions. - Processes multiple pages in parallel using
maxConcurrency, capped to 50 open Playwright pages at a time. - Writes one dataset item per processed URL and emits detailed performance logs for RSS discovery, URL decoding, browser extraction, fallback extraction, and overall actor runtime.
Input
Advanced search filters
You can also use advanced search operators in your queries, such as intitle, inurl, site, exclude operator -, exact match with double-quotes "", AND, OR and more.
Example queries with advanced operators
| Query | Explained |
|---|---|
intitle:"AI" AND site:bbc.com | Finds articles with "AI" in the title from BBC. |
site:reuters.com "stock market" -crypto | Finds stock market articles on Reuters, excluding crypto-related ones. |
"Samsung Galaxy S25" AND (review OR comparison) | Searches for reviews or comparisons of Samsung Galaxy S25. |
site:nytimes.com intitle:"election" after:2025-01-01 | Retrieves recent NY Times articles with "election" keyword in the title. |
inurl:blog OR inurl:news "climate change" | Searches for climate change mentions in blog or news URLs. |
For more information, see the Google Guide on Search Operators.
Input fields
| Field | Type | Required | Description |
|---|---|---|---|
searchQuery | string | Yes | Query written as for the Google News search bar. Advanced operators are supported. Enter search query as you would write it to Google News search bar. You can use advanced operators such as intitle, inurl, site, exclude operator -, exact match with double-quotes, AND, OR and more. |
maxConcurrency | integer | No | Maximum number of pages opened simultaneously in the Playwright browser. Capped to 50. This value represents the maximum number of browser pages or tabs processed in parallel, not the number of system threads. Max value is capped to 50. |
languageAndRegion | array of strings | Yes | One or more country/language pairs in the format CC:ll, for example IT:it, US:en, LT:lt. If you specify more then one pair foreach pair will be returned maxSites site back. Practical whitelist of commonly used Google News country/language pairs in the format CC:ll. This is not an official or complete list of all combinations supported by Google News. |
maxSites | integer | No | Maximum number of unique news sites returned for each language/region pair. RSS feed finds more results, but domain deduplication filter stops when it reaches maxSites unique domains. |
daysBack | integer | No | Number of days back to search in Google News RSS. How many days back to search. chunkDays is always set equal to daysBack. |
decodeInterval | integer | No | Interval passed to googlenewsdecoder. Default is 1. This is used to time the decoding of Google News links: it controls how often the decoder attempts to resolve the URL before considering it failed or moving on to the next attempt. |
In your case, a value of 1 essentially indicates a very short interval between attempts, useful for not overloading the service and reducing the risk of errors or rate limits. Increasing the value slows the process but can be more stable when decoding many links in sequence. |
Example input
{"daysBack": 100,"decodeInterval": 1,"languageAndRegion": ["US:en"],"maxConcurrency": 50,"maxSites": 10,"searchQuery": "jannik sinner"}
How it works
The actor runs in two stages. First, it queries Google News RSS for each selected language/region pair, applies the after: and before: filters derived from daysBack, decodes Google News links into publisher URLs, and deduplicates the results.
Second, it opens each discovered news URL with Playwright, attempts to extract the main text from article-like selectors such as article, main, and common content containers, and falls back to a plain HTTP request plus HTML text cleanup when browser extraction fails.
Concurrency
The actor uses a shared Playwright browser and limits concurrent page work with asyncio.Semaphore(maxConcurrency), so maxConcurrency represents the maximum number of browser pages or tabs processed in parallel, not the number of OS threads.
This makes concurrency easy to reason about operationally: if maxConcurrency is set to 10, the actor will process up to ten pages at the same time in Chromium.
Language handling
For each languageAndRegion pair, the actor extracts the language part and builds the Accept-Language header used for both Playwright and HTTP fallback requests. For non-English editions it uses a language list such as ["it", "en"], which becomes a header like it,en;q=0.9; for English editions it uses ["en"].
This improves localization hints to publisher sites, but the final language of the returned page still depends on how each site handles geographic and language negotiation.
Output
Each dataset item contains the processing result for one discovered publisher URL.
Successful Playwright result
undefined
Successful fallback result
{"position": 1,"title": "The importance of pomp and protocol as Trump goes to China - PBS","domain": "http://www.pbs.org","thumbnail": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAYEBAUEBAYFBQUGBgYHCQ4JCQgICRINDQoOFRIWFhUSFBQXGiEcFxgfGRQUHSc....","snippet": "From the moment President Donald Trump lands in Beijing on Wednesday, all eyes will be on how much of a spectacle the Chinese government rolls out, such as who lines up to greet him, what music is played and whether Chinese and American children wave flowers and flags.","url": "https://www.pbs.org/newshour/world/the-importance-of-pomp-and-protocol-as-trump-goes-to-china","success": true,"statusCode": 200,"langs": ["en"],"acceptLanguage": "en","htmlLang": "en-us","text": "Full Episode\nWednesday, May 13","error": null,"extractor": "playwright","published": "Wed, 13 May 2026 18:51:05 GMT","googleNewsUrl": "https://news.google.com/rss/articles/CBMimAFBVV95cUxQRVh4b0xFbFdDOXdtTkJ6eHdCRVBWM0dKNWZ5elFLY00td3hBdzE0eHhPYmdFaHlpb25qMmZqOENkeEt4SndtMWl6SVpubl9EWkp0cEV3cmZGSzMyVmdoT1NMWGhwUS01dE8xTC1hX2tGYUVvOXpRamZVbERPZ2xZLXhGbjk2MkJ2UWhhU3dPVm1pSmVMNktIctIBngFBVV95cUxONk1UX0d2MUtYNFFENjAxMXhKTGI0WE5sOVVzM1FsNnZsY0lVQlNJN0ZJRksycGFtZHJtNnVZY1p0N0RVaUw3cXlnaTdKQWc2RTkybkl6WGxkczdBNzFuSGpXUF9SZ3IyVnpDMDNTaUdsNTY3UFZkYlZWb2J1dzBJaVBWOUlJbEk5NnBMS1pYZEF5dTgza21vdVZPVHY1UQ?oc=5","query": "donald trump after:2026-05-09 before:2026-05-14","hl": "en","gl": "US","ceid": "US:en"},{"position": 2,"title": "It’s 10pm, do you know where Donald Trump is? - The Guardian","domain": "http://www.theguardian.com","thumbnail": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAYEBAUEBAYFBQ...","snippet": "The US president’s late-night Truth Social vitriol riddled with erratic capitalization and spelling? That’s leadership","url": "https://www.theguardian.com/us-news/2026/may/13/its-1015-do-you-know-where-your-president-is","success": true,"statusCode": 200,"langs": ["en"],"acceptLanguage": "en","htmlLang": "en","text": "View image in fullscreen\nDonald Trump in the Rose Garden of the White House on 11 May. Photograph: Kent Nishimura/AFP/Getty Images\nThis Week in Trumpland\nUS news\nAnalysis\nIt’s 10pm, do you know where Donald Trump is?\nRachel Leingang\n\nThe US president’s late-night Truth Social vitriol riddled with erratic capitalization and spelling? That’s leadership\n\n \n\nThis was originally published in This Week in Trumpland. Sign up to receive it in your inbox every Wednesday\n\nWed 13 May 2026 13.00 EDT\nShare\n\nGas prices are soaring because of blockages in the strait of Hormuz as part of the unauthorized war in Iran. There’s a highly consequential meeting with the president of China on the books for this week. The FDA director just stepped down over a disagreement over fruit-flavored vapes. Southern states are redrawing maps at breakneck pace to gerrymander Black voters out of their electoral voices.\n\nYou know what that means: it’s time for some conspiracy-laden, high-speed Truth Social posting.\n\nDonald Trump again this week went on a spree on his own social media site, posting more than 50 times in less than three hours, all after 10pm ET. He continued to post through it on Monday morning.\n\nIt was a greatest hits of the president’s enemies. He went after Barack Obama multiple times with false or unfounded accusations, claiming the former president plotted a coup against Trump and calling Obama the “most DEMONIC FORCE” in American politics. He shared altered images of Obama, Joe Biden and Nancy Pelosi in the Lincoln Memorial’s reflecting pool with the caption “Dumacrats love sewage”.\n\nWhat Trump’s Bible stunt says about his complicated history with Christianity\nRead more\n\n....","error": null,"extractor": "playwright","published": "Wed, 13 May 2026 22:00:00 GMT","googleNewsUrl": "https://news.google.com/rss/articles/CBMilwFBVV95cUxPdVk0STB1WFVlZXhRV18wSGtHUGVpakJmSWJZcWt0UFNlb19xcHdqc1FhZzlsWHQtXzg1dUlNeDh2aWRvWEFBNzF1UXhubFBOSGliUkJmSmRncV9wVTFiLVNKd1UySHUtX3FtakV0S2FlaHdzcVY1UzBodDVuWGVWN084VUR6TWVmSVJRWmVjSGw3TzlqOUFj?oc=5","query": "donald trump after:2026-05-09 before:2026-05-14","hl": "en","gl": "US","ceid": "US:en"
Final failure result
{"url": "https://example.com","success": false,"statusCode": null,"langs": ["it", "en"],"acceptLanguage": "it,en;q=0.9","htmlLang": null,"text": null,"error": "Playwright: ... | HTTP fallback: ...","extractor": null}
Logs and performance metrics
The final actor includes structured logs that make it easier to debug and profile runs. Key events include actor lifecycle logs such as actor.start, actor.input, and actor.done; RSS discovery logs such as rss.collect.start, rss.chunk.start, rss.collect.stop; extraction logs such as playwright.start, extract.playwright_failed, extract.batch.start, and extract.batch.progress; and timing logs such as timing.parse_feed, timing.decode_google_news_url, timing.playwright_extract, timing.httpx_fallback, timing.discovery_total, timing.extraction_total, and timing.actor_total.
These logs are designed to help verify each step of the actor and identify bottlenecks in RSS fetching, URL decoding, browser extraction, fallback extraction, and total runtime.
Notes and limitations
- Google News RSS is used only for discovery; final extraction runs against publisher pages, not Google-hosted article pages.
maxSitesapplies to unique news sites per language/region pair, not to the total number of articles globally across all pairs.daysBackcontrols the search lookback window, and the final actor setschunkDays = daysBack, so each edition is queried with a single time chunk covering the full requested range.- The extracted text depends on publisher HTML structure and on whether content is accessible to browser automation or plain HTTP requests.
- Some publishers may block automation, enforce paywalls, or deliver reduced content to automated clients.
- The
languageAndRegionchoices in the input schema can be represented as a practical whitelist, but Google does not publish a stable, official, exhaustive list of all supportedceidcombinations.
Client code examples
Node.js
import { ApifyClient } from 'apify-client';// Initialize the ApifyClient with API tokenconst client = new ApifyClient({token: '<YOUR_API_TOKEN>',});// Prepare Actor inputconst input = {"searchQuery": "\"samsung galaxy s25\" -ultra","maxConcurrency": 10,"languageAndRegion": ["IT:it","US:en","LT:lt"],"maxSites": 50,"daysBack": 7,"decodeInterval": 1};(async () => {// Run the Actor and wait for it to finishconst run = await client.actor("nQxTzKe5yrbyzysYh").call(input);// Fetch and print Actor results from the run's dataset (if any)console.log('Results from dataset');const { items } = await client.dataset(run.defaultDatasetId).listItems();items.forEach((item) => {console.dir(item);});})();
Python
from apify_client import ApifyClient# Initialize the ApifyClient with your API tokenclient = ApifyClient("<YOUR_API_TOKEN>")# Prepare the Actor inputrun_input = {"searchQuery": "\"samsung galaxy s25\" -ultra","maxConcurrency": 10,"languageAndRegion": ["IT:it","US:en","LT:lt",],"maxSites": 50,"daysBack": 7,"decodeInterval": 1,}# Run the Actor and wait for it to finishrun = client.actor("nQxTzKe5yrbyzysYh").call(run_input=run_input)# Fetch and print Actor results from the run's dataset (if there are any)for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(item)
cURL
# Set API tokenAPI_TOKEN=<YOUR_API_TOKEN># Prepare Actor inputcat > input.json <<'EOF'{"searchQuery": "\"samsung galaxy s25\" -ultra","maxConcurrency": 10,"languageAndRegion": ["IT:it","US:en","LT:lt"],"maxSites": 50,"daysBack": 7,"decodeInterval": 1}EOF# Run the Actorcurl "https://api.apify.com/v2/acts/nQxTzKe5yrbyzysYh/runs?token=$API_TOKEN" \-X POST \-d @input.json \-H 'Content-Type: application/json'