Google News Scraper avatar
Google News Scraper

Pricing

$10.00/month + usage

Go to Apify Store
Google News Scraper

Google News Scraper

Extract full Google News articles with text, images & metadata. 95%+ success rate, multi-region support, smart content extraction with automatic fallbacks. Production-ready & cost-optimized

Pricing

$10.00/month + usage

Rating

5.0

(1)

Developer

Yevhenii Molodtsov

Yevhenii Molodtsov

Maintained by Community

Actor stats

2

Bookmarked

13

Total users

3

Monthly active users

a day ago

Last modified

Share

Google News Bulk Scraper

Google News → publisher URLs → clean article text + images + metadata, with JS rendering, paywall, and consent-page fallbacks. HTTP-first, Playwright only when needed.

Scrape one query or thousands in a single run. Each article lands as its own dataset row with the full text, images, author, source, language, and a quality score — ready for NLP pipelines, media monitoring, or research datasets.

What You Get

Each article in the output includes:

  • title — headline as published
  • url — canonical publisher URL (not the Google News redirect)
  • source — publisher name (e.g. "Reuters", "TechCrunch")
  • publishedAt — ISO 8601 timestamp
  • author — byline when available
  • text — clean full-text content (300+ characters, validated)
  • images — OG image, featured image, and in-article images with alt text
  • language — detected content language
  • extractionSuccess — boolean flag for downstream filtering
  • contentQuality — score (0-100), level (low/medium/high), and warnings

Set fetchArticleDetails: false to skip crawling and get RSS metadata only (title, source, date, link) at minimal cost.

Quick Start

Using Apify Console

  1. Visit Apify Console
  2. Search for "Google News Scraper"
  3. Configure your search parameters
  4. Run the actor

Using Apify CLI

npm install -g apify-cli
# Single query
apify call google-news-scraper --input '{
"query": "Tesla",
"maxItemsPerUrl": 10
}'
# Multiple queries (string shorthand)
apify call google-news-scraper --input '{
"queries": ["tesla", "apple"],
"maxItemsPerUrl": 10
}'
# Multiple queries with passthrough fields
apify call google-news-scraper --input '{
"queries": [
{ "query": "Kim Kardashian", "profileUrl": "https://news.google.com/search?q=kim+kardashian" },
{ "query": "MrBeast" }
],
"maxItemsPerUrl": 10,
"maxItems": 15
}'

Using Apify API

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const run = await client.actor('google-news-scraper').call({
queries: [
{ query: 'Taylor Swift', profileUrl: 'https://news.google.com/search?q=taylor+swift' },
{ query: 'Elon Musk', profileUrl: 'https://news.google.com/search?q=elon+musk' },
],
maxItemsPerUrl: 10,
maxItems: 50,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
// items is a flat array of articles, each with query + passthrough fields merged in
console.log(items);

Input Modes

Most Common: Single Query

{
"query": "artificial intelligence",
"maxItemsPerUrl": 10
}

That's it — one query, up to 10 articles.

Bulk: Multiple Queries

Pass an array of strings to scrape several topics in one run:

{
"queries": ["tesla", "apple", "nvidia"],
"maxItemsPerUrl": 10
}

Advanced: Queries with Passthrough Fields

Each query can be an object. Any field besides query is passed through to every output article for that query — useful for linking results back to your own IDs, profile URLs, or tags:

{
"queries": [
{ "query": "Kim Kardashian", "profileUrl": "https://news.google.com/search?q=kim+kardashian" },
{ "query": "MrBeast", "customField": "my-tag" },
"Taylor Swift"
],
"maxItemsPerUrl": 10,
"maxItems": 25
}

Precedence: queries > query. If both are provided, queries wins.

Configuration

Input Parameters

ParameterTypeRequiredDefaultDescription
querystringNo*-Simple search query string
queriesarrayNo*-Array of strings or objects with query and optional passthrough fields
maxItemsPerUrlintegerNo50Max articles per individual query
maxItemsintegerNo0Optional global cap on total articles (0 = unlimited)
fetchArticleDetailsbooleanNotrueIf false, skip article crawling and return RSS metadata only
regionstringNo"US"Country code (US, GB, CA, AU, DE, ES, MX, IT)
languagestringNo"en-US"Language code (en-US, en-GB, en-CA, en-AU, de-DE, es-ES, es-MX, it-IT)
dateFromstringNo-Start date (YYYY-MM-DD)
dateTostringNo-End date (YYYY-MM-DD)
disableBrowserFallbackbooleanNofalseSkip Playwright fallback — cheaper but may return fewer articles
proxyConfigurationobjectNoApify Proxy enabledProxy settings; defaults to Apify Proxy

*At least one of query or queries is required.

How Extraction Works

The pipeline resolves every Google News redirect to the real publisher URL, then extracts content through six ordered strategies — stopping at the first one that produces 300+ characters of text with images:

  1. HTTP fetch — fast, cheap, works for most publishers
  2. Playwright browser — automatic fallback for JS-rendered or consent-gated pages
  3. Readability / Extractus / JSON-LD / custom selectors / meta tags / heuristics — six extraction strategies tried in order

Every article is quality-scored (text length, image presence, error-page detection). Low-quality results are filtered before they reach your dataset.

Estimated Cost

All costs depend on article count, target sites, and proxy tier. The numbers below are rough guidelines based on typical runs using Apify Proxy (datacenter tier).

ScenarioArticlesTypical Cost
RSS metadata only (fetchArticleDetails: false)100~$0.01 – $0.02
Full text, HTTP-first (most sites)100~$0.05 – $0.10
Full text, mixed HTTP + Playwright fallback100~$0.10 – $0.25
Heavy JS sites (frequent Playwright)100~$0.20 – $0.50

Cost levers you control:

  • fetchArticleDetails: false — skip article crawling entirely for near-zero cost
  • disableBrowserFallback: true — stay HTTP-only, ~2-5x cheaper, fewer articles from JS-heavy sites
  • maxItemsPerUrl / maxItems — hard caps on article count
  • Proxy tier — datacenter is default and cheapest; residential auto-escalates only on repeated 429/403 errors

Limitations

Be aware of these before you buy:

  • Paywalled sites — articles behind hard paywalls (WSJ, FT, NYT subscriber-only) will return partial text or fail. The scraper extracts whatever is publicly visible.
  • Heavy bot protection — sites with aggressive Cloudflare challenges or CAPTCHAs may need multiple retries and residential proxies, increasing cost.
  • Region/language variance — Google News returns different articles depending on region and language. The same query may yield different results from US vs DE.
  • RSS feed limits — Google News RSS feeds return a limited window of articles (roughly 24-72 hours). For historical coverage, use dateFrom/dateTo date slicing, which the scraper handles automatically.
  • Image availability — some publishers strip images or serve them via CDN policies that block external access. Articles without valid images receive a lower quality score.

Output Format

Output is a flat array of articles. Each article is a separate dataset entry with the query string and any passthrough fields merged at the top level:

[
{
"query": "Taylor Swift",
"profileUrl": "https://news.google.com/search?q=taylor+swift",
"title": "Taylor Swift Announces New Album - Billboard",
"url": "https://www.billboard.com/2025/08/05/taylor-swift-new-album.html",
"source": "Billboard",
"publishedAt": "2025-08-05T14:08:57.000Z",
"author": "Jane Smith",
"text": "Full article content...",
"description": "Brief summary of the article...",
"images": [
{
"url": "https://example.com/image.jpg",
"type": "featured-og",
"alt": "Image description"
}
],
"tags": ["Taylor Swift"],
"language": "en",
"extractionSuccess": true,
"contentQuality": {
"score": 85,
"level": "high",
"isValid": true,
"warnings": []
}
},
{
"query": "MrBeast",
"customField": "test-passthrough",
"title": "MrBeast Breaks YouTube Record",
"url": "https://www.example.com/mrbeast-record.html",
"source": "Example News",
"publishedAt": "2025-08-05T10:00:00.000Z",
"text": "Full article content...",
"..."
}
]

Development

Setup

git clone https://github.com/YevheniiM/google-news-scrapper
cd google-news-scrapper
npm install

Running

# Production
npm start
# Development mode (DEBUG=true, NODE_ENV=development)
npm run dev
# Development with file watching
npm run dev:watch

Create an INPUT.json at the project root for local input:

{
"queries": [{ "query": "Taylor Swift" }, { "query": "Elon Musk" }],
"maxItemsPerUrl": 5
}

Testing

# Run all tests
npm test
# Watch mode
npm run test:watch
# With coverage
npm run test:coverage

Formatting

npm run format
npm run format:check

License

MIT -- see LICENSE for details.

Acknowledgments