Ultimate News Scraper - Rise of the Phoenix
Pricing
from $4.50 / 1,000 results
Ultimate News Scraper - Rise of the Phoenix
Powerful Apify news scraper for real-time and historical article extraction across 800+ global publishers. Built with smart fallback crawling (Scrapling, PyDoll, Selenium), category targeting, proxy support, and clean JSON output with error analytics for reliable, scalable intelligence pipelines.
Pricing
from $4.50 / 1,000 results
Rating
0.0
(0)
Developer
Inus Grobler
Actor stats
0
Bookmarked
3
Total users
2
Monthly active users
3 days ago
Last modified
Categories
Share
Global News Scraper for Apify - Current + Historical Article Extraction
Extract structured news articles at scale from a large global publisher catalog using a resilient multi-backend pipeline (scrapling -> pydoll -> selenium).
This Apify Actor is built for teams that need reliable news scraping, historical news backfills, and structured article datasets for analytics, monitoring, AI pipelines, OSINT workflows, and research.
Why this Apify Actor
- Scrape current headlines or run deep historical backfills from tracked news websites.
- Target all catalog sites, specific sites, or specific site-category URLs.
- Automatically falls back across multiple fetch/extraction backends for better resilience.
- Produces normalized article data in the default dataset.
- Writes scrape failures and diagnostics to a dedicated
error-logdataset. - Supports Apify Proxy and custom proxy URLs for difficult domains.
- Uses URL-hash based item caching logic to reduce repeated processing.
Best use cases
- Media monitoring and competitive intelligence
- News aggregation and content intelligence pipelines
- Historical event datasets for LLM/RAG ingestion
- Topic tracking by website category
- Regional and multilingual news collection
How it works
- Normalize Actor input into a validated runtime config.
- Resolve proxy settings (Apify Proxy or custom proxy URLs).
- Build a stable cache key from target scope (sites + categories + mode).
- Run scraper pipeline with fallback fetchers.
- Push successful items to the default dataset.
- Push error telemetry to the
error-logdataset when available. - Store run summary in key-value store record
OUTPUT.
Input reference
Use these input keys in your Apify run:
| Input field | Type | Required | Description |
|---|---|---|---|
sites_to_scrape | array[string] | No | Select one or more active sites. If omitted, defaults to ["AP News"]. If present as an empty array, the Actor scrapes all active catalog sites. |
categories_to_scrape | array[string] | No | Optional category overrides in format `Site Name |
execution_mode | string | Yes | current or historic. Defaults to current. |
historic_cutoff_date | string | Required in historic mode | ISO-8601 cutoff (example: 2025-01-01T00:00:00Z). |
historic_max_pages_per_category | integer | No | Optional max pagination depth per category in historic mode. |
max_items_per_site | integer | No | Per-site cap when no_items_limit is false. Default 1. |
no_items_limit | boolean | No | If true, ignores max_items_per_site. |
proxy_config | object | No | Apify Proxy or custom proxy URLs for better reliability. |
site_category_filters | array[object] | No | Advanced legacy override. Prefer categories_to_scrape. |
Quick start
1) Current news scrape (selected sites)
{"sites_to_scrape": ["Reuters", "Gulf News", "AP News"],"execution_mode": "current","max_items_per_site": 50,"no_items_limit": false,"proxy_config": {"useApifyProxy": true,"apifyProxyGroups": ["RESIDENTIAL"]}}
2) Historical news scraping (backfill)
{"sites_to_scrape": ["The Punch", "The Guardian UK"],"execution_mode": "historic","historic_cutoff_date": "2025-01-01T00:00:00Z","historic_max_pages_per_category": 100,"no_items_limit": true,"proxy_config": {"useApifyProxy": true}}
3) Category-targeted scraping
{"sites_to_scrape": ["Reuters", "Gulf News"],"categories_to_scrape": ["Reuters|||https://www.reuters.com/world/","Reuters|||https://www.reuters.com/business/","Gulf News|||https://gulfnews.com/business"],"execution_mode": "current","max_items_per_site": 100}
Output
Default dataset
Successful article records are pushed to the default dataset.
Typical fields include:
site_name,country,region,languagearticle_title,author,article_body,tagsdate_published,article_url,url_hashmain_image_url,seo_descriptionscraped_at,scraping_tool,execution_modecategory_url,source_html_lang,cutoff_filtered
Example item:
{"site_name": "Reuters","country": "United Kingdom","region": "Europe","language": "en","article_title": "Sample headline","author": "Editorial Team","article_body": "Full normalized article text...","tags": ["markets", "energy"],"date_published": "2026-03-20T10:15:00Z","article_url": "https://www.reuters.com/world/example-story/","url_hash": "d41d8cd98f00b204e9800998ecf8427e","main_image_url": "https://example.com/image.jpg","seo_description": "Summary description","scraped_at": "2026-03-20T10:20:00Z","scraping_tool": "scrapling","execution_mode": "historic","category_url": "https://www.reuters.com/world/","source_html_lang": "en","cutoff_filtered": false}
Additional run artifacts
- Named dataset
error-log: extraction/fetch failures and fallback diagnostics - Key-value store record
OUTPUT: run summary (successItemCount,errorItemCount, mode, selected scope) - Output tab links (configured in
.actor/output_schema.json):- default dataset items
- overview dataset view
OUTPUTrecord- run API details
Data quality and reliability notes
- Website markup changes can affect extraction quality for specific sources.
- Protected sites may require proxy routing for stable results.
- Historical runs can be large; use
historic_max_pages_per_categoryand/ormax_items_per_sitefor faster, more controlled runs. - Backend fallback improves resilience but can increase runtime on difficult pages.
Performance tips
- Use
execution_mode: "current"for recurring monitoring. - Use
execution_mode: "historic"+historic_cutoff_datefor backfills. - Keep site scope narrow during testing before large runs.
- Enable
proxy_config.useApifyProxyfor better success rates on anti-bot-protected domains.
SEO keywords
Apify news scraper, global news scraping, historical news scraper, real-time news scraping, article extraction API, structured news dataset, media monitoring scraper, web scraping for journalism, multilingual news scraping, category-based news scraping.
Compliance reminder
Use this Actor responsibly and in line with each target website's terms, robots directives, and applicable laws and regulations in your jurisdiction.