Ultimate News Scraper - Rise of the Phoenix avatar

Ultimate News Scraper - Rise of the Phoenix

Pricing

from $1.50 / 1,000 results

Go to Apify Store
Ultimate News Scraper - Rise of the Phoenix

Ultimate News Scraper - Rise of the Phoenix

Search a news archive by country, website, and publication date. Estimate result counts, fetch paginated historical articles, and export clean news datasets without running a live scrape.

Pricing

from $1.50 / 1,000 results

Rating

5.0

(1)

Developer

Inus Grobler

Inus Grobler

Maintained by Community

Actor stats

2

Bookmarked

3

Total users

2

Monthly active users

7 days ago

Last modified

Share

Global News Archive Search

Search historical news articles from a Supabase-powered news archive by country, website, and published date. This Apify Actor is built for fast article retrieval, result estimation, and cursor-based pagination without running a live scrape during the Actor run.

What this Actor does

  • Searches archived news articles already stored in Supabase
  • Filters results by countries or websites
  • Filters results by published_from and published_to
  • Supports estimate_only so you can check result size before fetching rows
  • Returns article data to the default Apify dataset
  • Supports continuation tokens for paging through large result sets

What this Actor does not do

  • It does not scrape websites live during the run
  • It does not open browsers or crawl pages on demand
  • It only returns data that already exists in the underlying archive

Best use cases

  • News monitoring
  • Media intelligence
  • Historical article lookup
  • Research workflows
  • Data enrichment pipelines
  • Country-level or source-level article exports

Quick start

  1. Choose countries or websites
  2. Set your date range
  3. Turn on estimate_only if you want a count first
  4. Run the Actor
  5. Read rows from the dataset and paging info from OUTPUT

Simple input examples

Search by country

{
"countries": ["Africa"],
"published_from": "1000 days",
"published_to": "0 days",
"max_results": 100
}

Search by website

{
"websites": ["Reuters", "AP News"],
"published_from": "2025-01-01",
"published_to": "2025-12-31",
"max_results": 100
}

Estimate results first

{
"countries": ["United States"],
"published_from": "30 days",
"published_to": "0 days",
"estimate_only": true,
"max_results": 500
}

Continue to the next page

{
"countries": ["Africa"],
"published_from": "1000 days",
"published_to": "0 days",
"max_results": 100,
"continuation_token": "{\"date_published\":\"2026-05-06T15:44:00+00:00\",\"url_hash\":\"81a489c65af24950956dd717c2f7b4be\"}"
}

Input guide

FieldTypeRequiredHow it works
countriesstring[]NoSearch by one or more countries. Use this or websites, not both.
websitesstring[]NoSearch by one or more source names. Use this or countries, not both.
published_fromstringNoStart of the date range. Supports ISO dates like 2025-01-01 and relative dates like 30 days.
published_tostringNoEnd of the date range. Supports ISO dates like 2025-12-31 and relative dates like 0 days.
estimate_onlybooleanNoIf true, the Actor returns a count estimate and no article rows.
max_resultsintegerNoMaximum number of rows to return. Default is 10. Maximum is 5000.
continuation_tokenstringNoUse the token from the previous run to fetch the next page.

Helpful defaults

  • If you provide neither countries nor websites, the Actor defaults to AP News
  • If you provide no dates, the Actor defaults to the last 10 days
  • If you provide only published_from, published_to defaults to 0 days
  • If you provide only published_to, published_from is derived automatically

Date format examples

  • 2025-01-01
  • 2025-01-01T00:00:00Z
  • 7 days
  • 30 days
  • 12 months
  • 2 years

Output

Article rows are pushed to the default Apify dataset.

Common dataset fields:

  • site_name
  • country
  • region
  • language
  • article_title
  • author
  • article_body
  • tags
  • date_published
  • article_url
  • main_image_url
  • seo_description

Example dataset item:

{
"site_name": "Africa News",
"country": "Africa",
"region": "Western Africa | Eastern Africa | Southern Africa | Middle Africa | Northern Africa",
"language": "en|fr",
"article_title": "Trump hosts Dutch royals at the White House for dinner and overnight stay | Africanews",
"author": null,
"article_body": "Normalized article text...",
"tags": [],
"date_published": "2026-05-06T16:13:00+00:00",
"article_url": "https://www.euronews.com/2026/04/14/trump-hosts-dutch-royals-at-the-white-house-for-dinner-and-overnight-stay",
"main_image_url": null,
"seo_description": null
}

OUTPUT record

Each run also writes a lightweight OUTPUT record with summary metadata.

Typical fields:

  • resultCount
  • hasMore
  • nextContinuationToken
  • filters
  • estimatedMatchCount in estimate mode
  • estimatedReturnedThisRun in estimate mode

Estimate mode

Use estimate_only: true when you want to see how many articles match before pulling rows.

In estimate mode:

  • no dataset rows are returned
  • the OUTPUT record includes the estimated match count
  • you can rerun the same input with estimate_only: false to fetch rows

Pagination

When hasMore is true, the OUTPUT record includes nextContinuationToken.

To fetch the next page:

  1. Copy nextContinuationToken from OUTPUT
  2. Use the same filters again
  3. Paste the token into continuation_token
  4. Run the Actor again

Why you might get zero results

  • The archive does not currently contain matching rows
  • The country or website filter is too narrow
  • The date range is too small
  • You requested a page beyond the available results

Production notes

  • The Actor is optimized for archive retrieval, not live crawling
  • Default run memory is 128 MB
  • Results are returned newest first
  • The Actor is suitable for production use when your Supabase archive is populated and DATABASE_URL is configured correctly