News Scraper — Monitor News Articles & Headlines
Pricing
$12.00 / 1,000 results
Go to Apify Store
News Scraper — Monitor News Articles & Headlines
📰 Scrape news articles from top publishers worldwide — extract headlines, full text, authors, publish dates, and featured images. Monitor breaking news, track industry coverage, and feed content aggregation platforms. Supports keyword search, domain filtering, and multi-language extraction
News & Monitoring Scraper — Apify Actor
Scrapes news articles from specified sources using Playwright (headless browser) and Crawlee, with content extraction via Mozilla Readability. Supports keyword filtering, deduplication, date range filtering, and optional full-text extraction.
Features
- Playwright-powered — handles JavaScript-rendered pages
- Smart extraction — article title, author, date, category, content summary, and optional full text
- Keyword filtering — only keep articles matching one or more keywords
- Date range filtering — restrict by
dateFrom/dateTo - Deduplication — URL-based dedup within a single run
- Keyword auto-extraction — extracts relevant keywords from article content
- Proxy support — integrates with Apify proxy for reliable scraping
- Multiple output formats — JSON, CSV, or JSON array
Input
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
startUrls | array | ✅ | — | One or more news article / index URLs to scrape |
maxArticles | integer | ❌ | 100 | Max articles to scrape (0 = unlimited) |
proxyConfiguration | object | ❌ | Apify proxy on | Proxy settings |
keywords | array | ❌ | [] | Case-insensitive keyword whitelist (empty = keep all) |
sources | array | ❌ | [] | Descriptive source names (defaults to domain) |
dateFrom | string | ❌ | "" | ISO date filter (e.g. "2025-01-01") |
dateTo | string | ❌ | "" | ISO date filter (e.g. "2025-12-31") |
includeFullText | boolean | ❌ | false | Extract full article text via Readability |
outputFormat | string | ❌ | "json" | json, csv, or jsonArray |
Output
Each dataset item contains:
| Field | Type | Description |
|---|---|---|
title | string | Article title |
url | string | Source URL |
publishedDate | string (ISO) | null | Detected publish date |
author | string | null | Detected author |
contentSummary | string | First 500 chars or meta description |
fullText | string | null | Full article text (if includeFullText enabled) |
source | string | Source name or domain |
category | string | null | Detected category/section |
keywords | array | Auto-extracted keywords |
scrapedAt | string (ISO) | Timestamp of scrape |
Usage
Local development
npm installnpm start
Environment variables
APIFY_TOKEN— Apify API token (for proxy & dataset access)
Technical stack
- Node.js (ESM)
- Crawlee — crawling & request management
- Playwright — headless browser
- @mozilla/readability — article extraction
- jsdom — DOM parsing for metadata
- Apify SDK — proxy, dataset, platform integration
License
Apache-2.0