Ultimate News Scraper - Rise of the Phoenix avatar

Ultimate News Scraper - Rise of the Phoenix

Pricing

from $4.50 / 1,000 results

Go to Apify Store
Ultimate News Scraper - Rise of the Phoenix

Ultimate News Scraper - Rise of the Phoenix

Powerful Apify news scraper for real-time and historical article extraction across 800+ global publishers. Built with smart fallback crawling (Scrapling, PyDoll, Selenium), category targeting, proxy support, and clean JSON output with error analytics for reliable, scalable intelligence pipelines.

Pricing

from $4.50 / 1,000 results

Rating

0.0

(0)

Developer

Inus Grobler

Inus Grobler

Maintained by Community

Actor stats

0

Bookmarked

3

Total users

2

Monthly active users

3 days ago

Last modified

Share

Global News Scraper for Apify - Current + Historical Article Extraction

Extract structured news articles at scale from a large global publisher catalog using a resilient multi-backend pipeline (scrapling -> pydoll -> selenium).

This Apify Actor is built for teams that need reliable news scraping, historical news backfills, and structured article datasets for analytics, monitoring, AI pipelines, OSINT workflows, and research.

Why this Apify Actor

  • Scrape current headlines or run deep historical backfills from tracked news websites.
  • Target all catalog sites, specific sites, or specific site-category URLs.
  • Automatically falls back across multiple fetch/extraction backends for better resilience.
  • Produces normalized article data in the default dataset.
  • Writes scrape failures and diagnostics to a dedicated error-log dataset.
  • Supports Apify Proxy and custom proxy URLs for difficult domains.
  • Uses URL-hash based item caching logic to reduce repeated processing.

Best use cases

  • Media monitoring and competitive intelligence
  • News aggregation and content intelligence pipelines
  • Historical event datasets for LLM/RAG ingestion
  • Topic tracking by website category
  • Regional and multilingual news collection

How it works

  1. Normalize Actor input into a validated runtime config.
  2. Resolve proxy settings (Apify Proxy or custom proxy URLs).
  3. Build a stable cache key from target scope (sites + categories + mode).
  4. Run scraper pipeline with fallback fetchers.
  5. Push successful items to the default dataset.
  6. Push error telemetry to the error-log dataset when available.
  7. Store run summary in key-value store record OUTPUT.

Input reference

Use these input keys in your Apify run:

Input fieldTypeRequiredDescription
sites_to_scrapearray[string]NoSelect one or more active sites. If omitted, defaults to ["AP News"]. If present as an empty array, the Actor scrapes all active catalog sites.
categories_to_scrapearray[string]NoOptional category overrides in format `Site Name
execution_modestringYescurrent or historic. Defaults to current.
historic_cutoff_datestringRequired in historic modeISO-8601 cutoff (example: 2025-01-01T00:00:00Z).
historic_max_pages_per_categoryintegerNoOptional max pagination depth per category in historic mode.
max_items_per_siteintegerNoPer-site cap when no_items_limit is false. Default 1.
no_items_limitbooleanNoIf true, ignores max_items_per_site.
proxy_configobjectNoApify Proxy or custom proxy URLs for better reliability.
site_category_filtersarray[object]NoAdvanced legacy override. Prefer categories_to_scrape.

Quick start

1) Current news scrape (selected sites)

{
"sites_to_scrape": ["Reuters", "Gulf News", "AP News"],
"execution_mode": "current",
"max_items_per_site": 50,
"no_items_limit": false,
"proxy_config": {
"useApifyProxy": true,
"apifyProxyGroups": ["RESIDENTIAL"]
}
}

2) Historical news scraping (backfill)

{
"sites_to_scrape": ["The Punch", "The Guardian UK"],
"execution_mode": "historic",
"historic_cutoff_date": "2025-01-01T00:00:00Z",
"historic_max_pages_per_category": 100,
"no_items_limit": true,
"proxy_config": {
"useApifyProxy": true
}
}

3) Category-targeted scraping

{
"sites_to_scrape": ["Reuters", "Gulf News"],
"categories_to_scrape": [
"Reuters|||https://www.reuters.com/world/",
"Reuters|||https://www.reuters.com/business/",
"Gulf News|||https://gulfnews.com/business"
],
"execution_mode": "current",
"max_items_per_site": 100
}

Output

Default dataset

Successful article records are pushed to the default dataset.

Typical fields include:

  • site_name, country, region, language
  • article_title, author, article_body, tags
  • date_published, article_url, url_hash
  • main_image_url, seo_description
  • scraped_at, scraping_tool, execution_mode
  • category_url, source_html_lang, cutoff_filtered

Example item:

{
"site_name": "Reuters",
"country": "United Kingdom",
"region": "Europe",
"language": "en",
"article_title": "Sample headline",
"author": "Editorial Team",
"article_body": "Full normalized article text...",
"tags": ["markets", "energy"],
"date_published": "2026-03-20T10:15:00Z",
"article_url": "https://www.reuters.com/world/example-story/",
"url_hash": "d41d8cd98f00b204e9800998ecf8427e",
"main_image_url": "https://example.com/image.jpg",
"seo_description": "Summary description",
"scraped_at": "2026-03-20T10:20:00Z",
"scraping_tool": "scrapling",
"execution_mode": "historic",
"category_url": "https://www.reuters.com/world/",
"source_html_lang": "en",
"cutoff_filtered": false
}

Additional run artifacts

  • Named dataset error-log: extraction/fetch failures and fallback diagnostics
  • Key-value store record OUTPUT: run summary (successItemCount, errorItemCount, mode, selected scope)
  • Output tab links (configured in .actor/output_schema.json):
    • default dataset items
    • overview dataset view
    • OUTPUT record
    • run API details

Data quality and reliability notes

  • Website markup changes can affect extraction quality for specific sources.
  • Protected sites may require proxy routing for stable results.
  • Historical runs can be large; use historic_max_pages_per_category and/or max_items_per_site for faster, more controlled runs.
  • Backend fallback improves resilience but can increase runtime on difficult pages.

Performance tips

  • Use execution_mode: "current" for recurring monitoring.
  • Use execution_mode: "historic" + historic_cutoff_date for backfills.
  • Keep site scope narrow during testing before large runs.
  • Enable proxy_config.useApifyProxy for better success rates on anti-bot-protected domains.

SEO keywords

Apify news scraper, global news scraping, historical news scraper, real-time news scraping, article extraction API, structured news dataset, media monitoring scraper, web scraping for journalism, multilingual news scraping, category-based news scraping.

Compliance reminder

Use this Actor responsibly and in line with each target website's terms, robots directives, and applicable laws and regulations in your jurisdiction.