Rise of the Phoenix: Website Scraper avatar

Rise of the Phoenix: Website Scraper

Pricing

from $4.50 / 1,000 results

Go to Apify Store
Rise of the Phoenix: Website Scraper

Rise of the Phoenix: Website Scraper

Powerful Apify news scraper for real-time and historical article extraction across 800+ global publishers. Built with smart fallback crawling (Scrapling, PyDoll, Selenium), category targeting, proxy support, and clean JSON output with error analytics for reliable, scalable intelligence pipelines.

Pricing

from $4.50 / 1,000 results

Rating

0.0

(0)

Developer

Inus Grobler

Inus Grobler

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

The Rise of the Phoenix - Apify News Scraper

A high-scale Apify news scraper for real-time and historical news extraction across a large global publisher catalog.

Built for production data pipelines, monitoring, intelligence, media analysis, and research workflows.

Why This Actor

  • Scrapes current and historic articles from hundreds of tracked news sites
  • Uses resilient fallback fetching: scrapling -> pydoll -> selenium
  • Supports targeted site/category runs or broad catalog runs
  • Returns structured article output plus structured scrape error telemetry
  • Works with Apify Proxy for difficult sites

Apify Input Reference

Use these exact input keys in your Apify run:

Input fieldTypeRequiredDescription
sites_to_scrapearray[string]NoSelect one or more active catalog sites. Default run target is ["AP News"].
categories_to_scrapearray[string]NoManual category override values in format `Site Name
execution_modestringYescurrent or historic.
historic_cutoff_datestringRequired in historic modeISO timestamp cutoff, e.g. 2025-01-01T00:00:00Z.
max_items_per_siteintegerNoPer-site cap when no_items_limit is false (default 10).
no_items_limitbooleanNoIf true, ignores max_items_per_site.
proxy_configobjectNoApify proxy or custom proxy URLs.
site_category_filtersarray[object]NoAdvanced legacy override for explicit site-to-category mapping.

Input Examples

1) Current scraping (selected sites)

{
"sites_to_scrape": ["Reuters", "Gulf News"],
"execution_mode": "current",
"max_items_per_site": 50,
"no_items_limit": false,
"proxy_config": {
"useApifyProxy": true,
"apifyProxyGroups": ["RESIDENTIAL"]
}
}

2) Historic scraping (with cutoff)

{
"sites_to_scrape": ["The Punch"],
"execution_mode": "historic",
"historic_cutoff_date": "2025-01-01T00:00:00Z",
"no_items_limit": true,
"proxy_config": {
"useApifyProxy": true
}
}

3) Category-targeted scraping

{
"sites_to_scrape": ["Gulf News", "Reuters"],
"categories_to_scrape": [
"Gulf News|||https://gulfnews.com/business",
"Reuters|||https://www.reuters.com/world/"
],
"execution_mode": "current",
"max_items_per_site": 100
}

Output

This Actor writes:

  • Default dataset: successful article records
  • Named dataset error-log: failed URLs, tool fallback diagnostics, and extraction errors
  • Key-value store OUTPUT: run summary (successItemCount, errorItemCount, mode, and site scope)
  • Apify Output tab links: configured via .actor/output_schema.json for quick access to dataset items and run summary

Best Practices

  • Use execution_mode: "current" for daily monitoring and near-real-time ingestion.
  • Use execution_mode: "historic" with historic_cutoff_date for backfills.
  • Use categories_to_scrape for precise topical runs without editing catalog files.
  • Keep proxy_config.useApifyProxy enabled for better stability on protected domains.

Keywords

Apify news scraper, historical news scraping, web scraping API, article extraction, media monitoring, dataset automation, scalable scraping pipeline.