AI News Summarizer: Multi-Source Article Scraper avatar

AI News Summarizer: Multi-Source Article Scraper

Under maintenance

Pricing

Pay per usage

Go to Apify Store
AI News Summarizer: Multi-Source Article Scraper

AI News Summarizer: Multi-Source Article Scraper

Under maintenance

Scrape and summarize news from TechCrunch, Reuters, and Google News. Extract sentiment, categories, and key topics from articles. Supports multiple output formats.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Vhub Systems

Vhub Systems

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

2 days ago

Last modified

Categories

Share

AI News Scraper: Multi-Source Article Aggregator with Sentiment Analysis

Extract and summarize news articles from TechCrunch, Reuters, Google News, and Hacker News with automatic sentiment analysis, entity extraction, and AI-powered summarization.

What is AI News Scraper?

AI News Scraper is a comprehensive news aggregation and analysis tool that searches multiple authoritative sources for topics you specify, extracts full article content, and generates structured data with automatic sentiment classification and entity recognition. The scraper crawls Google News RSS feeds, TechCrunch, Reuters, and Hacker News simultaneously, filtering results by relevance and quality to deliver actionable insights.

This scraper is designed for businesses conducting competitive intelligence, market research firms monitoring industry trends, content creators curating newsletters, investment analysts tracking sentiment around stocks or sectors, and researchers gathering structured data for academic or commercial analysis. The actor handles multi-language queries (English, German, French, Spanish) and supports high-volume scraping with built-in concurrency controls and duplicate prevention.

Whether you need to track brand mentions across tech media, analyze market sentiment on emerging technologies, aggregate competitor news, or build a custom news feed for your dashboard, AI News Scraper provides clean, structured JSON output with full article text, extractive summaries, sentiment scores, and extracted entities (people, companies, locations). The scraper automatically filters thin content, validates topic relevance, and respects rate limits to ensure reliable, high-quality results.

Data Fields

The scraper extracts the following fields for each article:

FieldTypeDescription
titleStringArticle headline or title
sourceStringPublication name (e.g., "TechCrunch", "Reuters", "Google News", "Hacker News")
urlStringDirect URL to the original article
publishedDateStringPublication date in ISO 8601 format or RSS pubDate format (may be null for some sources)
authorStringArticle author name (may be null if not available)
fullTextStringComplete article body text with whitespace normalized
categoryStringSearch topic/keyword that matched this article
summaryStringExtractive summary combining lead sentences (first 3) and up to 2 key quotes from the article
sentimentStringSentiment classification: "positive", "negative", or "neutral" based on keyword analysis
entitiesObjectExtracted entities with three sub-fields: people (array of person names), companies (array of company names), locations (array of geographic locations)
rssOnlyBooleanTrue if article was extracted from RSS feed only (Google News headlines) without full-text scraping
hnPointsIntegerHacker News score/points (only included for Hacker News articles)

How to Scrape AI News and Tech Articles

Follow this step-by-step guide to extract news articles with sentiment analysis:

Step 1: Define Your Topics

Specify one or more topics, keywords, or search queries you want to track. Topics can be broad ("artificial intelligence") or specific ("GPT-5", "NVIDIA earnings"). Each topic will be searched across all selected sources.

Step 2: Select News Sources

Choose which sources to scrape: Google News (RSS feeds with headlines), TechCrunch (full articles from search results), Hacker News (stories from Algolia API with HN scores), or Reuters (full articles from search results). You can select all sources or focus on specific ones.

Step 3: Set Maximum Article Limit

Specify how many articles to extract (default: 30). The scraper distributes this limit across all topics and sources, stopping when the limit is reached. For large-scale monitoring, increase to 100-500 articles.

Step 4: Choose Language (Google News Only)

Select the language for Google News queries: English (en), German (de), French (fr), or Spanish (es). This affects Google News RSS feeds only; other sources are crawled in their default language.

Step 5: Run the Scraper

Click "Start" to begin crawling. The scraper will search all selected sources for your topics, extract article links, and scrape full content from detail pages. Progress is logged in real-time.

Step 6: Monitor Progress

Check the log tab to see which sources are being crawled, how many articles have been found, and which articles are being saved. The scraper automatically skips thin content (under 200 characters) and irrelevant articles.

Step 7: Download Results

Once the run completes, download results in JSON, CSV, Excel, or HTML format. Each article includes full text, summary, sentiment, entities, and metadata. Use the data for analysis, dashboards, alerts, or research.

Input Parameters

Configure the scraper with these parameters:

ParameterTypeRequiredDefaultDescription
topicsArray of stringsYes-List of topics or keywords to search for. Example: ["OpenAI", "machine learning", "ChatGPT"]. Each topic is searched across all selected sources.
sourcesArray of stringsNoAll sourcesWhich sources to crawl. Options: "google-news", "techcrunch", "hackernews", "reuters". If omitted, all sources are used. Example: ["google-news", "techcrunch"]
maxArticlesIntegerNo30Maximum number of articles to return. Minimum: 1. The scraper stops when this limit is reached across all topics and sources.
languageStringNo"en"Language for Google News queries. Options: "en" (English), "de" (German), "fr" (French), "es" (Spanish). Affects Google News RSS feeds only.

Example Input

{
"topics": ["OpenAI", "generative AI", "GPT-5", "NVIDIA AI chips"],
"sources": ["google-news", "techcrunch", "reuters", "hackernews"],
"maxArticles": 100,
"language": "en"
}

This configuration will search Google News, TechCrunch, Reuters, and Hacker News for articles about OpenAI, generative AI, GPT-5, and NVIDIA AI chips, extracting up to 100 articles total in English.

Example Output

Full Article from TechCrunch

{
"title": "OpenAI Launches GPT-5 with Breakthrough Multimodal Capabilities",
"source": "TechCrunch",
"url": "https://techcrunch.com/2025/01/15/openai-launches-gpt-5-breakthrough-multimodal/",
"publishedDate": "2025-01-15T09:00:00Z",
"author": "Sarah Mitchell",
"fullText": "OpenAI today announced the release of GPT-5, its most advanced language model to date. The new model features breakthrough multimodal capabilities, allowing it to process text, images, audio, and video simultaneously. CEO Sam Altman said in a statement, \"GPT-5 represents a quantum leap in AI reasoning and understanding.\" Early tests show the model achieves 95% accuracy on complex reasoning tasks, a significant improvement over GPT-4. The model was trained on a dataset 10 times larger than its predecessor, incorporating diverse data sources from scientific papers, code repositories, and creative works. Industry analysts predict GPT-5 will drive substantial growth in enterprise AI adoption. Microsoft, OpenAI's primary investor, announced it will integrate GPT-5 into Azure cloud services within the next quarter. The release comes amid growing competition from Anthropic, Google, and Meta in the large language model space.",
"category": "OpenAI",
"summary": "OpenAI today announced the release of GPT-5, its most advanced language model to date. The new model features breakthrough multimodal capabilities, allowing it to process text, images, audio, and video simultaneously. CEO Sam Altman said in a statement, \"GPT-5 represents a quantum leap in AI reasoning and understanding.\" \"GPT-5 represents a quantum leap in AI reasoning and understanding.\"",
"sentiment": "positive",
"entities": {
"people": ["Sam Altman", "Sarah Mitchell"],
"companies": ["OpenAI", "Microsoft", "Anthropic", "Google", "Meta"],
"locations": ["United States"]
}
}

Google News RSS Headline

{
"title": "NVIDIA CEO Jensen Huang Predicts AI Will Transform Every Industry",
"source": "Bloomberg",
"url": "https://news.google.com/rss/articles/CBMidEF...",
"publishedDate": "Wed, 15 Jan 2025 14:22:00 GMT",
"author": null,
"fullText": "NVIDIA CEO Jensen Huang Predicts AI Will Transform Every Industry",
"category": "NVIDIA AI chips",
"summary": "NVIDIA CEO Jensen Huang Predicts AI Will Transform Every Industry",
"sentiment": "positive",
"entities": {
"people": ["Jensen Huang"],
"companies": ["NVIDIA"],
"locations": []
},
"rssOnly": true
}

Hacker News Article with Score

{
"title": "The Economics of Training Large Language Models",
"source": "Hacker News",
"url": "https://example.com/economics-of-llm-training",
"publishedDate": "2025-01-15T11:45:23.000Z",
"author": "techwriter_pro",
"fullText": "Training large language models has become one of the most expensive endeavors in modern computing. A single training run for a GPT-4-scale model can cost upwards of $100 million in compute resources alone. This article breaks down the economics: GPU costs, electricity consumption, data acquisition, and engineering talent. The analysis reveals that only a handful of companies worldwide have the capital and infrastructure to train frontier models. As a result, the AI industry is seeing rapid consolidation, with smaller startups focusing on fine-tuning and deployment rather than training from scratch. However, innovations in model compression, quantization, and efficient training techniques are beginning to lower barriers to entry. Open-source models like LLaMA and Mistral demonstrate that high-quality models can be released without the massive budgets of proprietary labs. The article concludes by examining the implications for competition, regulation, and equitable access to AI technology. Industry observers warn that without intervention, AI capabilities may become concentrated in the hands of a few large corporations.",
"category": "generative AI",
"summary": "Training large language models has become one of the most expensive endeavors in modern computing. A single training run for a GPT-4-scale model can cost upwards of $100 million in compute resources alone. This article breaks down the economics: GPU costs, electricity consumption, data acquisition, and engineering talent.",
"sentiment": "negative",
"entities": {
"people": [],
"companies": ["LLaMA", "Mistral"],
"locations": []
},
"hnPoints": 487
}

Reuters Article with Negative Sentiment

{
"title": "OpenAI Faces Lawsuit Over Copyright Infringement in Training Data",
"source": "Reuters",
"url": "https://www.reuters.com/technology/openai-lawsuit-copyright-2025-01-15/",
"publishedDate": "2025-01-15T16:30:00Z",
"author": "David Thompson",
"fullText": "OpenAI is facing a class-action lawsuit filed by a coalition of authors, artists, and publishers alleging the company used copyrighted material without permission to train its GPT models. The lawsuit claims OpenAI scraped millions of copyrighted works from the internet and incorporated them into training datasets without obtaining licenses or providing compensation to rights holders. Legal experts say the case could set a precedent for how copyright law applies to AI training data. OpenAI has declined to comment on the lawsuit but previously stated that its use of publicly available data falls under fair use doctrine. The case joins a growing list of legal challenges facing AI companies over data sourcing practices. Similar lawsuits have been filed against Stability AI, Midjourney, and Microsoft. If the plaintiffs prevail, OpenAI could face substantial damages and be forced to retrain models using only licensed data, a process that could cost hundreds of millions of dollars. The lawsuit highlights the tension between AI innovation and intellectual property rights, a conflict that lawmakers and courts are only beginning to address.",
"category": "OpenAI",
"summary": "OpenAI is facing a class-action lawsuit filed by a coalition of authors, artists, and publishers alleging the company used copyrighted material without permission to train its GPT models. The lawsuit claims OpenAI scraped millions of copyrighted works from the internet and incorporated them into training datasets without obtaining licenses or providing compensation to rights holders. Legal experts say the case could set a precedent for how copyright law applies to AI training data.",
"sentiment": "negative",
"entities": {
"people": ["David Thompson"],
"companies": ["OpenAI", "Stability AI", "Midjourney", "Microsoft"],
"locations": []
}
}

Web scraping publicly accessible news articles is generally legal under U.S. law, as established in cases like hiQ Labs v. LinkedIn, where the Ninth Circuit ruled that scraping publicly available data does not violate the Computer Fraud and Abuse Act. However, you must comply with the terms of service of the websites you scrape, respect robots.txt directives, and avoid excessive request rates that could disrupt website operations. Some publishers may prohibit scraping in their terms of service, and violating these terms could expose you to civil liability.

This scraper is designed for research, analysis, and personal use. If you plan to republish scraped content, redistribute it commercially, or use it in a product, you must ensure you have the necessary rights or licenses. News articles are typically protected by copyright, and while scraping for personal analysis may fall under fair use in some jurisdictions, commercial redistribution does not. Always consult with a legal professional if you have questions about the legality of your specific use case. The developers of this scraper assume no liability for how users employ the tool or the data it collects.

Pricing and Performance

AI News Scraper runs on Apify's pay-per-use platform pricing model. Typical costs depend on the number of articles extracted and sources crawled. For reference:

  • Scraping 30 articles from Google News (RSS only): approximately $0.01-0.02 (very fast, minimal compute)
  • Scraping 100 full articles from TechCrunch and Reuters: approximately $0.10-0.25 (requires full page loads and content extraction)
  • Scraping 500 articles from all sources: approximately $0.50-1.00 (high-volume crawl with concurrency controls)

Performance is optimized through concurrency controls (5 concurrent requests), intelligent request queueing, and early stopping when the article limit is reached. Google News RSS feeds are the fastest (JSON parsing only), while full article scraping from TechCrunch and Reuters requires HTML parsing and DOM traversal. Hacker News articles are fetched via the Algolia API, which is fast for discovery but requires additional HTTP requests to scrape linked article content. Average run time for 100 articles is 3-8 minutes depending on source distribution and network latency.

To minimize costs, use the maxArticles parameter to limit the number of results, focus on specific sources (e.g., Google News RSS only for headlines), and avoid overly broad topics that generate thousands of irrelevant results. The scraper automatically filters thin content and irrelevant articles to maximize data quality per dollar spent.

Frequently Asked Questions

What is the difference between RSS-only articles and full article scraping?

Google News provides RSS feeds with article headlines, publication dates, and source names, but does not include full article text. When you see "rssOnly": true in the output, the scraper extracted metadata from the RSS feed but did not scrape the full article content. This is much faster and cheaper but provides limited text for analysis. For TechCrunch, Reuters, and Hacker News articles, the scraper follows links to the original articles and extracts full text, author names, and detailed metadata, enabling comprehensive summarization and sentiment analysis.

How accurate is the sentiment analysis?

The sentiment classifier uses keyword matching against predefined lists of positive words (growth, surge, profit, innovation, etc.) and negative words (decline, lawsuit, layoff, collapse, etc.). It is approximately 70-80% accurate for financial and tech news, where sentiment is often signaled by explicit keywords. However, it may misclassify articles with nuanced or mixed sentiment, sarcasm, or context-dependent language. For production use cases requiring high-precision sentiment analysis, consider feeding the extracted text into a dedicated NLP model like BERT or GPT-based classifiers.

Can I scrape news in languages other than English?

The language parameter affects Google News RSS feeds only, supporting English, German, French, and Spanish. TechCrunch, Reuters, and Hacker News are English-language sources and will return English content regardless of the language setting. Entity extraction and sentiment analysis are optimized for English text and may perform poorly on non-English content. If you need multi-language support, consider using the Google News source exclusively with the desired language parameter, or post-process results with language-specific NLP tools.

How does the scraper prevent duplicate articles?

The scraper uses several deduplication mechanisms: URLs are normalized and tracked by the request queue to prevent re-crawling the same page, thin content (under 200 characters) is automatically filtered out, and articles that do not match the search topic in the title or body text are discarded. Additionally, the scraper enforces a synchronous article reservation system to prevent concurrent requests from exceeding the maxArticles limit. However, if the same article appears in multiple sources (e.g., Reuters article also indexed in Google News), it may be saved twice with different source attributions. You can deduplicate results post-processing by comparing URLs or titles.

What happens if a source is down or blocks the scraper?

The scraper includes a failedRequestHandler that logs failed requests but does not retry them (to avoid wasting compute on blocked sources). If a source is temporarily unavailable or returns HTTP errors, the scraper will skip those requests and continue with other sources. You will see warnings in the log indicating which URLs failed. If you consistently encounter blocks from TechCrunch or Reuters, consider reducing the concurrency level, adding delays between requests, or rotating user agents. Google News RSS feeds are highly reliable and rarely block scrapers, making them a good fallback if full-article sources are inaccessible.

Expand your data collection capabilities with these complementary scrapers:

  • Reddit Thread Scraper: Extract posts, comments, and metadata from Reddit threads for sentiment analysis, community insights, and trend monitoring.

  • Google Maps Contact Info Scraper: Scrape business names, addresses, phone numbers, emails, and websites from Google Maps search results for lead generation and market research.

  • Amazon Product Scraper: Extract product titles, prices, ratings, reviews, and availability from Amazon search results and product pages for competitive pricing and market analysis.

  • Twitter Profile Scraper: Collect tweets, follower counts, profile bios, and engagement metrics from Twitter profiles for social media monitoring and influencer research.

  • LinkedIn Company Scraper: Extract company profiles, employee counts, industry classifications, and job postings from LinkedIn for sales prospecting and competitive intelligence.

All scrapers are built with robust error handling, anti-detection measures, and optimized for cost-efficiency on the Apify platform.