Under maintenance

Pricing

from $2.00 / 1,000 results

Try for free

Go to Apify Store

AI News Summarizer: Multi-Source Article Scraper

Under maintenance

Try for free

Scrape and summarize news from TechCrunch, Reuters, and Google News. Extract sentiment, entities, and summaries for industry monitoring and market intel.

Pricing

from $2.00 / 1,000 results

Rating

0.0

(0)

Developer

Vhub Systems

Actor stats

Bookmarked

Total users

Monthly active users

4 hours ago

Last modified

AI News Scraper: Multi-Source Article Aggregator with Sentiment Analysis

Extract and summarize news articles from TechCrunch, Reuters, Google News, and Hacker News with automatic sentiment analysis, entity extraction, and AI-powered summarization.

What is AI News Scraper?

AI News Scraper is a comprehensive news aggregation and analysis tool that searches multiple authoritative sources for topics you specify, extracts full article content, and generates structured data with automatic sentiment classification and entity recognition. The scraper crawls Google News RSS feeds, TechCrunch, Reuters, and Hacker News simultaneously, filtering results by relevance and quality to deliver actionable insights.

This scraper is designed for businesses conducting competitive intelligence, market research firms monitoring industry trends, content creators curating newsletters, investment analysts tracking sentiment around stocks or sectors, and researchers gathering structured data for academic or commercial analysis. The actor handles multi-language queries (English, German, French, Spanish) and supports high-volume scraping with built-in concurrency controls and duplicate prevention.

Whether you need to track brand mentions across tech media, analyze market sentiment on emerging technologies, aggregate competitor news, or build a custom news feed for your dashboard, AI News Scraper provides clean, structured JSON output with full article text, extractive summaries, sentiment scores, and extracted entities (people, companies, locations). The scraper automatically filters thin content, validates topic relevance, and respects rate limits to ensure reliable, high-quality results.

Data Fields

The scraper extracts the following fields for each article:

Field	Type	Description
`title`	String	Article headline or title
`source`	String	Publication name (e.g., "TechCrunch", "Reuters", "Google News", "Hacker News")
`url`	String	Direct URL to the original article
`publishedDate`	String	Publication date in ISO 8601 format or RSS pubDate format (may be null for some sources)
`author`	String	Article author name (may be null if not available)
`fullText`	String	Complete article body text with whitespace normalized
`category`	String	Search topic/keyword that matched this article
`summary`	String	Extractive summary combining lead sentences (first 3) and up to 2 key quotes from the article
`sentiment`	String	Sentiment classification: "positive", "negative", or "neutral" based on keyword analysis
`entities`	Object	Extracted entities with three sub-fields: `people` (array of person names), `companies` (array of company names), `locations` (array of geographic locations)
`rssOnly`	Boolean	True if article was extracted from RSS feed only (Google News headlines) without full-text scraping
`hnPoints`	Integer	Hacker News score/points (only included for Hacker News articles)

How to Scrape AI News and Tech Articles

Follow this step-by-step guide to extract news articles with sentiment analysis:

Step 1: Define Your Topics

Specify one or more topics, keywords, or search queries you want to track. Topics can be broad ("artificial intelligence") or specific ("GPT-5", "NVIDIA earnings"). Each topic will be searched across all selected sources.

Step 2: Select News Sources

Choose which sources to scrape: Google News (RSS feeds with headlines), TechCrunch (full articles from search results), Hacker News (stories from Algolia API with HN scores), or Reuters (full articles from search results). You can select all sources or focus on specific ones.

Step 3: Set Maximum Article Limit

Specify how many articles to extract (default: 30). The scraper distributes this limit across all topics and sources, stopping when the limit is reached. For large-scale monitoring, increase to 100-500 articles.

Step 4: Choose Language (Google News Only)

Select the language for Google News queries: English (en), German (de), French (fr), or Spanish (es). This affects Google News RSS feeds only; other sources are crawled in their default language.

Step 5: Run the Scraper

Click "Start" to begin crawling. The scraper will search all selected sources for your topics, extract article links, and scrape full content from detail pages. Progress is logged in real-time.

Step 6: Monitor Progress

Check the log tab to see which sources are being crawled, how many articles have been found, and which articles are being saved. The scraper automatically skips thin content (under 200 characters) and irrelevant articles.

Step 7: Download Results

Once the run completes, download results in JSON, CSV, Excel, or HTML format. Each article includes full text, summary, sentiment, entities, and metadata. Use the data for analysis, dashboards, alerts, or research.

Input Parameters

Configure the scraper with these parameters:

Parameter	Type	Required	Default	Description
`topics`	Array of strings	Yes	-	List of topics or keywords to search for. Example: `["OpenAI", "machine learning", "ChatGPT"]`. Each topic is searched across all selected sources.
`sources`	Array of strings	No	All sources	Which sources to crawl. Options: `"google-news"`, `"techcrunch"`, `"hackernews"`, `"reuters"`. If omitted, all sources are used. Example: `["google-news", "techcrunch"]`
`maxArticles`	Integer	No	30	Maximum number of articles to return. Minimum: 1. The scraper stops when this limit is reached across all topics and sources.
`language`	String	No	`"en"`	Language for Google News queries. Options: `"en"` (English), `"de"` (German), `"fr"` (French), `"es"` (Spanish). Affects Google News RSS feeds only.

Example Input

{
  "topics": ["OpenAI", "generative AI", "GPT-5", "NVIDIA AI chips"],
  "sources": ["google-news", "techcrunch", "reuters", "hackernews"],
  "maxArticles": 100,
  "language": "en"
}

This configuration will search Google News, TechCrunch, Reuters, and Hacker News for articles about OpenAI, generative AI, GPT-5, and NVIDIA AI chips, extracting up to 100 articles total in English.

Example Output

Full Article from TechCrunch

{
  "title": "OpenAI Launches GPT-5 with Breakthrough Multimodal Capabilities",
  "source": "TechCrunch",
  "url": "https://techcrunch.com/2025/01/15/openai-launches-gpt-5-breakthrough-multimodal/",
  "publishedDate": "2025-01-15T09:00:00Z",
  "author": "Sarah Mitchell",
  "fullText": "OpenAI today announced the release of GPT-5, its most advanced language model to date. The new model features breakthrough multimodal capabilities, allowing it to process text, images, audio, and video simultaneously. CEO Sam Altman said in a statement, \"GPT-5 represents a quantum leap in AI reasoning and understanding.\" Early tests show the model achieves 95% accuracy on complex reasoning tasks, a significant improvement over GPT-4. The model was trained on a dataset 10 times larger than its predecessor, incorporating diverse data sources from scientific papers, code repositories, and creative works. Industry analysts predict GPT-5 will drive substantial growth in enterprise AI adoption. Microsoft, OpenAI's primary investor, announced it will integrate GPT-5 into Azure cloud services within the next quarter. The release comes amid growing competition from Anthropic, Google, and Meta in the large language model space.",
  "category": "OpenAI",
  "summary": "OpenAI today announced the release of GPT-5, its most advanced language model to date. The new model features breakthrough multimodal capabilities, allowing it to process text, images, audio, and video simultaneously. CEO Sam Altman said in a statement, \"GPT-5 represents a quantum leap in AI reasoning and understanding.\" \"GPT-5 represents a quantum leap in AI reasoning and understanding.\"",
  "sentiment": "positive",
  "entities": {
    "people": ["Sam Altman", "Sarah Mitchell"],
    "companies": ["OpenAI", "Microsoft", "Anthropic", "Google", "Meta"],
    "locations": ["United States"]
  }
}

Google News RSS Headline

{
  "title": "NVIDIA CEO Jensen Huang Predicts AI Will Transform Every Industry",
  "source": "Bloomberg",
  "url": "https://news.google.com/rss/articles/CBMidEF...",
  "publishedDate": "Wed, 15 Jan 2025 14:22:00 GMT",
  "author": null,
  "fullText": "NVIDIA CEO Jensen Huang Predicts AI Will Transform Every Industry",
  "category": "NVIDIA AI chips",
  "summary": "NVIDIA CEO Jensen Huang Predicts AI Will Transform Every Industry",
  "sentiment": "positive",
  "entities": {
    "people": ["Jensen Huang"],
    "companies": ["NVIDIA"],
    "locations": []
  },
  "rssOnly": true
}

Hacker News Article with Score

{
  "title": "The Economics of Training Large Language Models",
  "source": "Hacker News",
  "url": "https://example.com/economics-of-llm-training",
  "publishedDate": "2025-01-15T11:45:23.000Z",
  "author": "techwriter_pro",
  "fullText": "Training large language models has become one of the most expensive endeavors in modern computing. A single training run for a GPT-4-scale model can cost upwards of $100 million in compute resources alone. This article breaks down the economics: GPU costs, electricity consumption, data acquisition, and engineering talent. The analysis reveals that only a handful of companies worldwide have the capital and infrastructure to train frontier models. As a result, the AI industry is seeing rapid consolidation, with smaller startups focusing on fine-tuning and deployment rather than training from scratch. However, innovations in model compression, quantization, and efficient training techniques are beginning to lower barriers to entry. Open-source models like LLaMA and Mistral demonstrate that high-quality models can be released without the massive budgets of proprietary labs. The article concludes by examining the implications for competition, regulation, and equitable access to AI technology. Industry observers warn that without intervention, AI capabilities may become concentrated in the hands of a few large corporations.",
  "category": "generative AI",
  "summary": "Training large language models has become one of the most expensive endeavors in modern computing. A single training run for a GPT-4-scale model can cost upwards of $100 million in compute resources alone. This article breaks down the economics: GPU costs, electricity consumption, data acquisition, and engineering talent.",
  "sentiment": "negative",
  "entities": {
    "people": [],
    "companies": ["LLaMA", "Mistral"],
    "locations": []
  },
  "hnPoints": 487
}

Reuters Article with Negative Sentiment

{
  "title": "OpenAI Faces Lawsuit Over Copyright Infringement in Training Data",
  "source": "Reuters",
  "url": "https://www.reuters.com/technology/openai-lawsuit-copyright-2025-01-15/",
  "publishedDate": "2025-01-15T16:30:00Z",
  "author": "David Thompson",
  "fullText": "OpenAI is facing a class-action lawsuit filed by a coalition of authors, artists, and publishers alleging the company used copyrighted material without permission to train its GPT models. The lawsuit claims OpenAI scraped millions of copyrighted works from the internet and incorporated them into training datasets without obtaining licenses or providing compensation to rights holders. Legal experts say the case could set a precedent for how copyright law applies to AI training data. OpenAI has declined to comment on the lawsuit but previously stated that its use of publicly available data falls under fair use doctrine. The case joins a growing list of legal challenges facing AI companies over data sourcing practices. Similar lawsuits have been filed against Stability AI, Midjourney, and Microsoft. If the plaintiffs prevail, OpenAI could face substantial damages and be forced to retrain models using only licensed data, a process that could cost hundreds of millions of dollars. The lawsuit highlights the tension between AI innovation and intellectual property rights, a conflict that lawmakers and courts are only beginning to address.",
  "category": "OpenAI",
  "summary": "OpenAI is facing a class-action lawsuit filed by a coalition of authors, artists, and publishers alleging the company used copyrighted material without permission to train its GPT models. The lawsuit claims OpenAI scraped millions of copyrighted works from the internet and incorporated them into training datasets without obtaining licenses or providing compensation to rights holders. Legal experts say the case could set a precedent for how copyright law applies to AI training data.",
  "sentiment": "negative",
  "entities": {
    "people": ["David Thompson"],
    "companies": ["OpenAI", "Stability AI", "Midjourney", "Microsoft"],
    "locations": []
  }
}

Legal Considerations for Web Scraping

Web scraping publicly accessible news articles is generally legal under U.S. law, as established in cases like hiQ Labs v. LinkedIn, where the Ninth Circuit ruled that scraping publicly available data does not violate the Computer Fraud and Abuse Act. However, you must comply with the terms of service of the websites you scrape, respect robots.txt directives, and avoid excessive request rates that could disrupt website operations. Some publishers may prohibit scraping in their terms of service, and violating these terms could expose you to civil liability.

This scraper is designed for research, analysis, and personal use. If you plan to republish scraped content, redistribute it commercially, or use it in a product, you must ensure you have the necessary rights or licenses. News articles are typically protected by copyright, and while scraping for personal analysis may fall under fair use in some jurisdictions, commercial redistribution does not. Always consult with a legal professional if you have questions about the legality of your specific use case. The developers of this scraper assume no liability for how users employ the tool or the data it collects.

Pricing and Performance

AI News Scraper runs on Apify's pay-per-use platform pricing model. Typical costs depend on the number of articles extracted and sources crawled. For reference:

Scraping 30 articles from Google News (RSS only): approximately $0.01-0.02 (very fast, minimal compute)
Scraping 100 full articles from TechCrunch and Reuters: approximately $0.10-0.25 (requires full page loads and content extraction)
Scraping 500 articles from all sources: approximately $0.50-1.00 (high-volume crawl with concurrency controls)

Performance is optimized through concurrency controls (5 concurrent requests), intelligent request queueing, and early stopping when the article limit is reached. Google News RSS feeds are the fastest (JSON parsing only), while full article scraping from TechCrunch and Reuters requires HTML parsing and DOM traversal. Hacker News articles are fetched via the Algolia API, which is fast for discovery but requires additional HTTP requests to scrape linked article content. Average run time for 100 articles is 3-8 minutes depending on source distribution and network latency.

To minimize costs, use the maxArticles parameter to limit the number of results, focus on specific sources (e.g., Google News RSS only for headlines), and avoid overly broad topics that generate thousands of irrelevant results. The scraper automatically filters thin content and irrelevant articles to maximize data quality per dollar spent.

Frequently Asked Questions

What is the difference between RSS-only articles and full article scraping?

Google News provides RSS feeds with article headlines, publication dates, and source names, but does not include full article text. When you see "rssOnly": true in the output, the scraper extracted metadata from the RSS feed but did not scrape the full article content. This is much faster and cheaper but provides limited text for analysis. For TechCrunch, Reuters, and Hacker News articles, the scraper follows links to the original articles and extracts full text, author names, and detailed metadata, enabling comprehensive summarization and sentiment analysis.

How accurate is the sentiment analysis?

The sentiment classifier uses keyword matching against predefined lists of positive words (growth, surge, profit, innovation, etc.) and negative words (decline, lawsuit, layoff, collapse, etc.). It is approximately 70-80% accurate for financial and tech news, where sentiment is often signaled by explicit keywords. However, it may misclassify articles with nuanced or mixed sentiment, sarcasm, or context-dependent language. For production use cases requiring high-precision sentiment analysis, consider feeding the extracted text into a dedicated NLP model like BERT or GPT-based classifiers.

Can I scrape news in languages other than English?

The language parameter affects Google News RSS feeds only, supporting English, German, French, and Spanish. TechCrunch, Reuters, and Hacker News are English-language sources and will return English content regardless of the language setting. Entity extraction and sentiment analysis are optimized for English text and may perform poorly on non-English content. If you need multi-language support, consider using the Google News source exclusively with the desired language parameter, or post-process results with language-specific NLP tools.

How does the scraper prevent duplicate articles?

The scraper uses several deduplication mechanisms: URLs are normalized and tracked by the request queue to prevent re-crawling the same page, thin content (under 200 characters) is automatically filtered out, and articles that do not match the search topic in the title or body text are discarded. Additionally, the scraper enforces a synchronous article reservation system to prevent concurrent requests from exceeding the maxArticles limit. However, if the same article appears in multiple sources (e.g., Reuters article also indexed in Google News), it may be saved twice with different source attributions. You can deduplicate results post-processing by comparing URLs or titles.

What happens if a source is down or blocks the scraper?

The scraper includes a failedRequestHandler that logs failed requests but does not retry them (to avoid wasting compute on blocked sources). If a source is temporarily unavailable or returns HTTP errors, the scraper will skip those requests and continue with other sources. You will see warnings in the log indicating which URLs failed. If you consistently encounter blocks from TechCrunch or Reuters, consider reducing the concurrency level, adding delays between requests, or rotating user agents. Google News RSS feeds are highly reliable and rarely block scrapers, making them a good fallback if full-article sources are inaccessible.

Expand your data collection capabilities with these complementary scrapers:

Reddit Thread Scraper: Extract posts, comments, and metadata from Reddit threads for sentiment analysis, community insights, and trend monitoring.
Google Maps Contact Info Scraper: Scrape business names, addresses, phone numbers, emails, and websites from Google Maps search results for lead generation and market research.
Amazon Product Scraper: Extract product titles, prices, ratings, reviews, and availability from Amazon search results and product pages for competitive pricing and market analysis.
Twitter Profile Scraper: Collect tweets, follower counts, profile bios, and engagement metrics from Twitter profiles for social media monitoring and influencer research.
LinkedIn Company Scraper: Extract company profiles, employee counts, industry classifications, and job postings from LinkedIn for sales prospecting and competitive intelligence.

All scrapers are built with robust error handling, anti-detection measures, and optimized for cost-efficiency on the Apify platform.

Reuters News Scraper

piotrv1001/reuters-news-scraper

Scrapes news articles from Reuters category pages, extracting titles, descriptions, publication dates, thumbnail images, and article URLs with automatic pagination. Ideal for media monitoring, trend analysis, and news aggregation.

FalconScrape

Google News Scraper

codingfrontend/google-news-scraper

Scrape news articles from news.google.com with deep article content extraction

codingfrontend

Google News Scraper

futurizerush/google-news-scraper

Google News Search Scraper - Real-time news aggregation from Google News. Features smart article enrichment with full content extraction. Perfect for market research, trend analysis, and content monitoring.

Rush

5.0

Google News Scraper

fortuitous_pirate/google-news-scraper

Fortuitous Pirate

Google News Article Scraper

webscrap18/google-news-article-scraper

Scrape Google News, Extract full content with Title, Article Text, Images and Structured data.

WebScrap

Google News Scraper - Cheap

bot_kevin/Google-News-Scraper

Easily scrape news from Google News page in .json format.

bot

News Website Crawler & Article Extractor

xtech/news-source-crawler

Scrape all articles from any news website. Extract full text, metadata, keywords, and summaries. Ideal for content analysis, research, and news aggregation.

Xtech

342

4.9

Google News Scraper

easyapi/google-news-scraper

Powerful Google News scraper, collect up to 5000 news articles with flexible search options, language support. Perfect for news aggregation, market research, and sentiment analysis. 📰🔍

EasyApi

1.1K

4.6

Google News

canadesk/google-news

Find the latest news with direct source links from Google News. It's fast and costs little!

Canadesk Support

Google News Scraper

saswave/google-news-scraper

Google News data retrieval scraper. Get latest news. Extract date with title, link, source, publication date, image, authors. Improve your news aggregation, market research, and sentiment analysis efforts

SASWAVE

AI News Summarizer: Multi-Source Article Scraper

AI News Scraper: Multi-Source Article Aggregator with Sentiment Analysis

What is AI News Scraper?

Data Fields

How to Scrape AI News and Tech Articles

Step 1: Define Your Topics

Step 2: Select News Sources

Step 3: Set Maximum Article Limit

Step 4: Choose Language (Google News Only)

Step 5: Run the Scraper

Step 6: Monitor Progress

Step 7: Download Results

Input Parameters

Example Input

Example Output

Full Article from TechCrunch

Google News RSS Headline

Hacker News Article with Score

Reuters Article with Negative Sentiment

Legal Considerations for Web Scraping

Pricing and Performance

Frequently Asked Questions

What is the difference between RSS-only articles and full article scraping?

How accurate is the sentiment analysis?

Can I scrape news in languages other than English?

How does the scraper prevent duplicate articles?

What happens if a source is down or blocks the scraper?

Related Scrapers by lanky_quantifier

You might also like

Reuters News Scraper

Google News Scraper

Google News Scraper

Google News Scraper

Google News Article Scraper

Google News Scraper - Cheap

News Website Crawler & Article Extractor

Google News Scraper

Google News

Google News Scraper