Google News Scraper — Headlines, Sources, URLs avatar

Google News Scraper — Headlines, Sources, URLs

Pricing

from $20.00 / 1,000 results

Go to Apify Store
Google News Scraper — Headlines, Sources, URLs

Google News Scraper — Headlines, Sources, URLs

Turn any Google News query into a deduplicated dataset of up to 2,000 articles: titles, sources, dates, RSS links, resolved publisher URLs, clean snippets. Multiple RSS time-window passes for depth beyond single-feed limits. Excel-ready CSV. No API key. Not affiliated with Google.

Pricing

from $20.00 / 1,000 results

Rating

0.0

(0)

Developer

Scrapeify

Scrapeify

Maintained by Community

Actor stats

1

Bookmarked

1

Total users

0

Monthly active users

13 hours ago

Last modified

Share

Google News RSS Scraper — Structured Headlines, Sources & Article URLs (Up to 2,000)

Turn any Google News search query into a deduplicated, structured dataset of headlines, publisher names, publication timestamps, RSS links, and resolved article URLs — without a Google API key or a headless browser. The Scrapeify Google News Scraper issues multiple RSS passes across time-window phases to overcome single-feed size limits, merges and deduplicates results across passes, and exports to a Dataset, RESULTS_CSV (Excel-friendly UTF-8 BOM), RESULTS_JSON, and a run OUTPUT summary.

Built for media monitoring teams, competitive intelligence analysts, AI content pipelines, and search visibility researchers who need repeatable, structured coverage of any news topic at scale.


Features

CapabilityDetail
RSS-first architectureHTTP fetches to news.google.com/rss/search — lightweight, no browser required
Multi-phase coverageMultiple when passes (1h, 7d, 30d, 1y) to approximate depth beyond single-feed limits
DeduplicationMerges results across phases using stable RSS identifiers and normalized URLs
Clean text fieldsHTML stripped from descriptions for downstream NLP and embedding workflows
Canonical URL resolutionParses Google redirect parameters to surface publisher articleUrl where available
429 / 5xx retry logicBounded retry attempts with backoff for transient Google RSS errors
Up to 2,000 articlesPer-run cap with input validation; dedup stats in OUTPUT
Structured columnsposition, keyword, title, link, articleUrl, pubDate, sourceName, description
Excel-ready CSVRESULTS_CSV with UTF-8 BOM and quoted fields for Windows compatibility
Input flexibilityAliases: query, searchQuery, q for keyword; maxResults for numberOfResults

Use Cases

Media Monitoring & Press Tracking

Track news coverage for brand names, executives, products, or regulatory topics. Schedule hourly or daily runs and diff new link values since the previous run to surface breaking coverage before competitors do.

Competitive Intelligence

Monitor rival company and product news. Identify PR campaigns, product launches, partnership announcements, and negative press. Build a structured archive of competitor mentions for strategic planning.

SEO & Search Visibility Research

Map which publishers and articles rank in Google News for your target keywords. Identify content gaps, measure your brand's News presence, and track competitors' earned media performance over time.

AI Content Pipeline (Stage 1 Retrieval)

Use as Stage 1 of a retrieval stack: headlines + snippets cheaply triage topic relevance → LLMs decide which URLs warrant full article fetching and chunking → agents post summaries to ticketing or Slack.

RAG Knowledge Base Construction

Feed title + description + articleUrl into embedding pipelines. Store with keyword and sourceName metadata for semantic retrieval. Enable AI-generated answers with cited, timestamped news sources.

Industry Trend Analysis

Aggregate sourceName distributions and publication cadence for any keyword over time. Identify which outlets cover a topic most frequently, which publishers are emerging voices, and how news volume correlates with market events.

E-Commerce & Brand Intelligence

Track product recalls, supply chain disruptions, competitor product launches, and category news that affects purchasing decisions. Combine with Amazon Scraper data for comprehensive market intelligence.

Automation & Alert Pipelines

Trigger Apify runs on a cron schedule. Diff against previous dataset by link or articleUrl. Push new articles to Slack, email, or a ticketing system automatically.

Data Aggregation & Multi-Source Research

Combine Google News results with Google Maps, Amazon, and Meta Ad Library actor outputs for comprehensive multi-source dossiers on brands, markets, or topics.

Academic & Policy Research

Track news coverage of policy topics, scientific developments, or public health issues at scale. Export to CSV for corpus analysis, NLP research, or data journalism workflows.


Why Choose This Actor

  • Lightweight and cost-efficient — HTTP-only; no browser fleet; suitable for high-frequency scheduling
  • Deduplication built in — fewer duplicate rows than naive single-RSS pulls
  • Production outputs — Dataset + CSV + JSON keys fit ETL, BI, and client-reporting workflows
  • Cloud-native — Apify standard Dataset and Key-value store semantics with scheduling and webhooks
  • Automation-ready — identical input contract across Console, REST API, and SDK clients

Quick Start

  1. Open the Scrapeify Google News Scraper on Apify Console.
  2. Enter a keyword (e.g. renewable energy policy) and set numberOfResults (e.g. 500).
  3. Click Start and wait for completion (typically seconds to low minutes).
  4. Export the Dataset as JSON or CSV, or download RESULTS_CSV from Storage → Key-value store.

Tip: Start with numberOfResults: 50 to validate keyword coverage before scaling to the 2,000-article limit.


Input Schema

{
"keyword": "semiconductor supply chain",
"numberOfResults": 500
}
FieldTypeRequiredDescription
keywordstringYesNews search phrase. Aliases: query, searchQuery, q. Supports operators (quotes, site:, etc.)
numberOfResultsintegerYesUnique articles to collect (1–2,000). Alias: maxResults

Output Schema

Dataset Row (one row per article)

{
"position": 1,
"keyword": "semiconductor supply chain",
"title": "Fab expansion slows as equipment backlog extends into 2027",
"link": "https://news.google.com/rss/articles/CBMiXGh0dHBzOi8vd3d3LmV4YW1wbGUuY29tL3RlY2gvZmFiLWRlbGF5cw...",
"articleUrl": "https://www.example.com/tech/fab-delays",
"pubDate": "Wed, 07 May 2026 08:15:00 GMT",
"sourceName": "TechCrunch",
"description": "Equipment vendors report extended lead times for EUV modules as chipmakers compete for capacity at advanced nodes."
}
FieldTypeDescription
positionintegerDeduped result position (1-based)
keywordstringInput keyword echoed on every row for joins and audits
titlestringArticle headline
linkstringGoogle News RSS link (use as stable identifier)
articleUrlstringResolved publisher URL when available; null if redirect omitted
pubDatestringPublication date in RSS format
sourceNamestringPublisher name
descriptionstringArticle snippet with HTML stripped

Note: articleUrl resolves the Google redirect to the original publisher URL when redirect parameters are present. Use link as the stable dedup key; articleUrl as the citation URL for downstream crawling.

Run Summary (OUTPUT key in default KV store)

{
"ok": true,
"keyword": "semiconductor supply chain",
"numberOfResults": 500,
"returnedCount": 487,
"meta": {
"stoppedReason": "target_reached",
"passesCompleted": 4,
"totalFetched": 512,
"uniqueAfterDedupe": 487
},
"scrapedAt": "2026-05-07T04:00:00.000Z",
"download": {
"dataset": "Export as CSV/JSON from Dataset tab",
"keyValueStore": "RESULTS_CSV = Excel-friendly CSV (UTF-8 BOM, quoted fields)"
},
"csv": null,
"note": "CSV too large to embed inline; use RESULTS_CSV key."
}
FieldTypeDescription
okbooleantrue if articles were returned; false on error or empty
returnedCountintegerUnique articles after deduplication
meta.stoppedReasonstringtarget_reached, exhausted, or error descriptor
meta.passesCompletedintegerNumber of RSS phase passes completed
meta.uniqueAfterDedupeintegerArticles remaining after cross-phase dedup
csvstring/nullEmbedded CSV string when small enough; else null

Additional KV keys: RESULTS_CSV (full CSV, UTF-8 BOM), RESULTS_JSON (full JSON array).


API Examples

cURL

curl "https://api.apify.com/v2/acts/scrapeify~google-news-scraper/runs?token=$APIFY_TOKEN" \
-X POST \
-H "Content-Type: application/json" \
-d '{
"keyword": "climate policy",
"numberOfResults": 250
}'

Python

import os
from apify_client import ApifyClient
client = ApifyClient(os.environ["APIFY_TOKEN"])
run = client.actor("scrapeify/google-news-scraper").call(
run_input={"keyword": "climate policy", "numberOfResults": 250}
)
for article in client.dataset(run["defaultDatasetId"]).iterate_items():
url = article.get("articleUrl") or article["link"]
print(article["title"], article["sourceName"], url)

JavaScript / Node.js

import { ApifyClient } from "apify-client";
const client = new ApifyClient({ token: process.env.APIFY_TOKEN });
const run = await client.actor("scrapeify/google-news-scraper").call({
keyword: "climate policy",
numberOfResults: 250,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`Collected ${items.length} unique articles from ${new Set(items.map(a => a.sourceName)).size} publishers`);

Integration Examples

ChatGPT / Custom GPT Actions

Register the Apify run endpoint as a Custom GPT action. Return title, sourceName, pubDate, and articleUrl as a JSON array. The model can summarize recent coverage, identify trends, or answer questions grounded in actual news articles.

Claude Tool Use

from langchain.tools import tool
@tool
def get_recent_news(keyword: str, n: int = 100) -> list:
"""Fetch recent Google News articles for a keyword. Returns structured article data."""
run = client.actor("scrapeify/google-news-scraper").call(
run_input={"keyword": keyword, "numberOfResults": n}
)
return client.dataset(run["defaultDatasetId"]).list_items().items

Pass the structured list to Claude for summarization, entity extraction, or sentiment analysis with articleUrl citations.

Gemini

Fetch 500+ article headlines and snippets → pass to Gemini's long-context window → generate a comprehensive topic briefing with source attribution and emerging narrative threads.

LangChain

from langchain.tools import tool
from langchain.text_splitter import RecursiveCharacterTextSplitter
@tool
def fetch_news_corpus(keyword: str, n: int) -> list:
"""Search Google News and return article data for RAG ingestion."""
run = client.actor("scrapeify/google-news-scraper").call(
run_input={"keyword": keyword, "numberOfResults": n}
)
return client.dataset(run["defaultDatasetId"]).list_items().items
# Use as a retriever tool in a ConversationalRetrievalChain

CrewAI

NewsResearchAgent fetches articles with this tool. AnalysisAgent identifies key themes and entities. WritingAgent drafts a briefing document with source citations and publication dates.

AutoGen

# UserProxyAgent: "Summarize the last 100 news articles about EV battery technology"
# ResearchAgent: calls google_news_scraper tool → returns structured JSON
# SynthesisAgent: extracts key claims, publisher perspectives, and publication timeline

n8n / Make.com / Zapier

Cron trigger → Apify run → iterate Dataset items → filter for new link values since last run → push to Slack digest, Notion page, or HubSpot deal activity feed.

RAG Systems

# 1. Fetch articles
articles = get_recent_news("renewable energy", n=500)
# 2. Create documents for vector store
from langchain.schema import Document
docs = [
Document(
page_content=f"{a['title']}. {a['description']}",
metadata={"url": a.get("articleUrl") or a["link"],
"source": a["sourceName"],
"date": a["pubDate"]}
)
for a in articles
]
# 3. Embed and index
vectorstore.add_documents(docs)

Frequently Asked Questions

1. Do I need a Google API key or Google Cloud account? No. The actor fetches public RSS endpoints from news.google.com — no API credentials required.

2. Why do I sometimes get fewer articles than requested? There may not be enough distinct articles across RSS phases for the keyword. Inspect meta.uniqueAfterDedupe and meta.stoppedReason in OUTPUT.

3. When is articleUrl null? Some Google News RSS entries don't include redirect parameters that allow URL resolution. Fall back to link for stable identification.

4. How does deduplication work across phases? The actor tracks stable RSS identifiers and normalized URLs across all passes. Articles seen in multiple time-window phases are merged into a single row.

5. Can I search by country or language? The current implementation uses default hl and gl parameters. Fork the actor for specific ceid locale pairs (e.g. ceid=GB%3Aen for UK English).

6. Is full article text included? No — only RSS fields: title, snippet, source, date, and URL. Crawl articleUrl with a separate article fetcher to retrieve full text.

7. How fast are runs typically? Seconds to low minutes depending on numberOfResults and Google RSS response times.

8. How does the actor handle 429 rate limiting? Bounded retry attempts with backoff. Avoid launching excessive parallel runs from a single IP for the same keyword.

9. Does RESULTS_CSV open correctly in Excel? Yes — RESULTS_CSV uses UTF-8 BOM encoding and quoted fields for Windows Excel compatibility.

10. Can I schedule hourly monitoring runs? Yes — use Apify Schedules combined with webhooks to your notification stack.

11. Are publication dates reliable? pubDate reflects what the RSS feed reports. Some publishers use the crawl date rather than original publication date.

12. Can I combine results with other Scrapeify actors? Yes — join Google News results with Maps, Amazon, or Ad Library actor outputs in your data warehouse by keyword or entity.

13. What input aliases are supported? query, searchQuery, q for the keyword; maxResults for numberOfResults.

14. What causes an empty dataset with error rows? Check message in pushed error items and OUTPUT.ok for details. Common causes: empty keyword, Google temporarily blocking the IP, or zero-result queries.

15. Can I use this for real-time news alerts? Hourly runs are practical. For sub-minute latency, a dedicated news API is more appropriate.

16. How do I ingest into a vector database? Use title + description as the text content. Store articleUrl, sourceName, keyword, and pubDate as metadata for filtering and citation.

17. What is the difference between link and articleUrl? link is the Google News RSS URL — use as the stable dedup key. articleUrl is the resolved publisher URL — use as the citation link for downstream crawling and user-facing references.

18. Can I track which publishers cover a topic most? Yes — aggregate sourceName values across Dataset rows. Sort by frequency to rank publishers by topic coverage volume.

19. Does the actor support Google Alerts-style monitoring? This actor provides structured rows for programmatic pipelines. For email digests, Google Alerts is a simpler option. For database-integrated monitoring and downstream automation, this actor is the better choice.

20. Is there an upper limit per keyword per run? Yes — 2,000 unique articles per run (input validation). For broader coverage, run multiple passes across overlapping time windows with different when parameters.

21. How should I handle GDPR for article data? Headlines and snippets may mention individuals. Apply your organization's data retention and classification policies to stored news corpora.

22. Can I retrieve articles from specific publishers? Add site:publisher.com to the keyword query to target a specific domain in Google News search.

23. What is meta.passesCompleted? The number of RSS phase passes the actor completed (e.g. 1h, 7d, 30d, 1y windows). More passes generally yield broader coverage.

24. Does this include paywalled articles? Only metadata (title, snippet, source, URL) is collected from RSS — no paywall bypass. Full text requires a separate article fetcher.

25. How do I build an idempotent monitoring pipeline? Key on link or normalized articleUrl before inserting into your database. Compare new link sets against the previous run to identify net-new coverage.


Best Practices

  • Stagger schedules — don't hammer RSS from many simultaneous tasks on one egress IP
  • Key on link for idempotent pipelines before inserting into Postgres or vector stores
  • Rate-limit downstream crawling — respect robots.txt and publisher terms when fetching full article text from articleUrl
  • Start small — validate with numberOfResults: 50 before scaling to 2,000
  • Monitor returnedCount trends — alert on significant drops week-over-week for fixed keywords
  • Archive RESULTS_JSON alongside OUTPUT for each scheduled run to enable historical diff analysis
  • Use keyword column for joins — it's echoed on every row, making multi-keyword batch pipelines easy to merge

Performance & Scalability

FactorGuidance
ThroughputHTTP-only; highly efficient for high-frequency scheduling
Upper bound2,000 deduplicated articles per run
Run timeSeconds to low minutes depending on RSS response latency and numberOfResults
Horizontal scaleRun parallel actors per keyword list — each is independent
StorageDataset is authoritative; RESULTS_CSV and RESULTS_JSON may be limited by KV size for large runs

AI & Automation Workflows

3-stage retrieval pipeline:

  1. Stage 1 (this actor): headlines + snippets cheaply triage topic relevance
  2. Stage 2 (article fetcher): crawl articleUrl for full text on relevant articles
  3. Stage 3 (LLM): chunk, embed, and index full text; generate answers with articleUrl citations

Competitive briefing automation: Schedule weekly Google News runs for competitor brand names → extract key themes from titles and snippets using an LLM → generate competitive intelligence brief → post to Confluence or Notion.

Trend detection pipeline: Daily runs for industry keywords → aggregate pubDate distribution → detect volume spikes indicating major news events → alert stakeholders before the news cycle peaks.


Error Handling

ScenarioBehavior
Missing or empty keywordError row in Dataset + OUTPUT.ok: false
Empty resultsCompletes with returnedCount = 0; meta.stoppedReason = exhausted
429 rate limitingBounded retries with backoff; persistent failures surface in run logs
KV size limitscsv field in OUTPUT set to null; use RESULTS_CSV KV key or Dataset export
Transient HTTP errorsRetried per module constants; logged if persistent

Trust & Reliability

Scrapeify maintains this actor for repeatable news monitoring with structured outputs, explicit dedup statistics, and clear storage keys — suitable for production automation when combined with appropriate compliance review and downstream content policies.


Explore the full Scrapeify suite — chain these actors together for end-to-end automation pipelines:

ActorWhat it does
Amazon ScraperASINs, prices, sponsored flags across 23 marketplaces
Instagram Ad Library ScraperInstagram-only ads from Meta Ad Library
Meta Ad Library ScraperFacebook & Instagram ads with sort options
WhatsApp Ad ScraperClick-to-WhatsApp ad creatives
YouTube Video DownloaderVideos & audio to Apify Key-Value Store
Meta Brand & Page ID FinderResolve brand names to numeric Page IDs
Google Maps ScraperLocal business leads, reviews, emails, contacts

Google News is a trademark of Google LLC. This actor is not affiliated with or endorsed by Google.