Pricing

from $1.00 / 1,000 results

Try for free

Go to Apify Store

News Source Crawler

Try for free

Given a news website URL, discover and extract articles with full metadata with title, authors, publish date, body text, top image, keywords, and summary. Works with any news site via sitemap or HTML discovery.

Pricing

from $1.00 / 1,000 results

Rating

5.0

(23)

Developer

Crawler Bros

Actor stats

Bookmarked

Total users

Monthly active users

21 days ago

Last modified

News Source Crawler — Extract Articles From Any News Website

Point this actor at any news website and it returns a clean, structured record for every article it can find: title, authors, publish date, full body text, top image, language, keywords, and a short summary.

What this actor does

Give the actor a single root URL — a publication's homepage, a category page, or an internal newsroom — and it discovers the site's article URLs and extracts rich metadata for each one. Article discovery uses the site's sitemaps first (/sitemap.xml, /sitemap-news.xml, /post-sitemap.xml, plus any sitemaps declared in robots.txt) and falls back to parsing the homepage for article-shaped links when sitemaps are missing. You can also pass a hub URL (e.g. a /technology or /markets section page) and the crawler will scope discovery to just that section.

For every article it fetches, the actor extracts metadata using a cascade of strategies — JSON-LD structured data first (the gold standard, used by most modern publications), then Open Graph / article:* meta tags, then a paragraph-concatenation fallback over the <article> or <main> element. Optional extras include a top-10 keyword list (stopword-filtered), a three-sentence auto-summary, and a boolean keyword filter with full AND / OR / NOT / parentheses support so you only keep articles that match your brief.

The actor defaults to a fast HTTP-only path with 40 supported languages for stopword-aware keyword extraction. When a publication blocks datacenter IPs, the built-in auto proxy fallback silently retries discovery and per-article fetches through a residential proxy — so a single run can transparently switch from fast direct fetches to proxy-resilient fetches on the hard sites without any manual tuning.

Key features

Sitemap-first article discovery — /sitemap.xml, /sitemap-news.xml, /post-sitemap.xml, and Sitemap: entries in robots.txt.
HTML fallback when sitemaps are missing — scans the homepage for article-shaped links.
Hub URL filter — point at a section page (e.g. /tech) to scope discovery to that section only.
Cascade metadata extraction — JSON-LD → Open Graph / article:* → <article> body fallback.
40-language support for keyword extraction and stopword filtering.
Boolean keyword filter with AND, OR, NOT, and parentheses.
Minimum word count filter — drop thin content automatically.
Top-image extraction from og:image / JSON-LD / first <img>.
Auto-summary — first three sentences or meta description fallback.
Auto proxy fallback — retries discovery and per-article fetches through residential proxy when the site blocks datacenter IPs.
Parallel fetching with configurable concurrency.

Input

Field	Type	Description
`websiteUrl`	string	Root URL of the news site (required).
`hubUrl`	string	Optional section URL (e.g. `/technology`) to narrow discovery to that part of the site.
`maxArticles`	integer	Maximum articles to extract (1–500, default 20).
`keywordFilter`	string	Optional boolean filter. Examples below.
`minWordCount`	integer	Drop articles shorter than this (default 100).
`concurrency`	integer	Parallel article fetches (1–20, default 5).
`extractKeywords`	boolean	Compute the top 10 keywords per article (default true).
`extractSummary`	boolean	Emit a three-sentence summary per article (default true).
`language`	enum	`auto` (default) or one of 40 language codes — `en`, `de`, `fr`, `es`, `it`, `pt`, `ja`, `ko`, `zh`, `ar`, and more.
`autoProxyFallback`	boolean	When true (default), retry via residential proxy if discovery or articles get blocked.

Example input

{
  "websiteUrl": "https://techcrunch.com",
  "hubUrl": "https://techcrunch.com/category/startups",
  "maxArticles": 50,
  "keywordFilter": "(AI OR machine learning) AND NOT crypto",
  "minWordCount": 200,
  "concurrency": 5,
  "language": "auto",
  "autoProxyFallback": true
}

Output

Each dataset item is one article:

{
  "articleUrl": "https://techcrunch.com/2026/04/22/example-ai-startup-raises-50m/",
  "articleTitle": "Example AI Startup Raises $50M Series B",
  "articleAuthors": ["Jane Doe"],
  "articlePublishDate": "2026-04-22T14:03:00+00:00",
  "articleText": "Example AI, the San Francisco-based startup…",
  "articleWordCount": 612,
  "articleTopImage": "https://techcrunch.com/.../hero.jpg",
  "articleLanguage": "en-US",
  "articleKeywords": ["ai", "startup", "funding", "series", "b"],
  "articleSummary": "Example AI raised a $50M Series B led by…",
  "sourceDomain": "techcrunch.com",
  "scrapedAt": "2026-04-24T10:15:00+00:00"
}

Field descriptions

articleUrl — canonical URL of the article.
articleTitle — headline.
articleAuthors — list of author names.
articlePublishDate — ISO 8601 publish timestamp.
articleText — full body text.
articleWordCount — word count of the body.
articleTopImage — hero image URL (absolute).
articleLanguage — ISO language code (en, en-US, de, ja, …).
articleKeywords — top-10 content words sorted by frequency.
articleSummary — three-sentence summary or meta description fallback.
sourceDomain — hostname of the origin site.
scrapedAt — ISO 8601 timestamp of extraction.

Fields the site doesn't publish are omitted rather than emitted as null, keeping records compact.

Use cases

Media monitoring — track every new article from a publication, filter by keyword, export to your own search index.
Content research — pull the last N articles from a niche industry outlet to feed a summarizer or LLM.
Dataset construction — build a training corpus of clean article text + metadata from any news source.
Competitive intelligence — watch a competitor's press-release feed and get notified when specific terms appear.
PR analytics — monitor coverage of a brand, product, or executive across dozens of outlets.

FAQ

Does it work on any news site? It works on most publications that expose a standard sitemap or have a reasonably conventional homepage. Paywalled sites may return login stubs, and heavily JavaScript-rendered single-page apps may require a browser-based actor.

How does the keyword filter work? The expression is parsed as a boolean tree. AND, OR, NOT (case-insensitive) plus parentheses are supported. Adjacent terms without an explicit operator are treated as AND. Each leaf term matches as a case-insensitive substring against title + body. Examples:

AI AND startup — both words must appear.
crypto OR bitcoin — either word.
AI AND NOT crypto — AI but not crypto.
(startup OR funding) AND SaaS — SaaS articles that also mention startup or funding.

Why are some articles missing fields? If the site does not publish a given piece of metadata (no JSON-LD, no og:image, no author meta), that field is omitted from the output. articleTitle and articleText are always present for successfully-extracted articles.

How many articles can one run return? Up to 500 via maxArticles. The crawler looks at up to 50 candidate URLs from the sitemap/homepage before extraction, so very small sites may yield fewer than requested.

Does it follow pagination / deep archives? The crawler walks sitemap index files (to two levels deep) and accepts every URL listed. For HTML fallback it scans the homepage only. To reach deeper archives, pass a hubUrl pointing at a category or tag page with dense article links.

What does autoProxyFallback do exactly? It's off the fast path by default. When a direct HTTP fetch to the homepage/sitemap or to an individual article returns a block/challenge response, the actor silently retries the request through a residential proxy session. No manual configuration needed.

Do I need a proxy? Usually no. Most news sites respond to direct HTTP fetches without issue. autoProxyFallback only kicks in when a site explicitly blocks the direct path, so you don't pay proxy costs on easy sites.

Known limitations

Paywalled content — fully paywalled articles return login stubs rather than full body text. The actor cannot bypass paywalls.
Heavy single-page apps — publications that render article bodies entirely in client-side JavaScript (rare for news, common for product sites) may return empty bodies. Use a browser-based scraper for those.
Homepage-only fallback — when no sitemap is available, discovery is limited to links visible on the homepage. Deep archives won't surface without a hubUrl.
Article detection is heuristic — a small number of link-heavy pages (index pages, tag pages) may occasionally be extracted as "articles" with short bodies. Use minWordCount to filter them out.
Language auto-detection is based on the HTML lang attribute and meta tags. If the site does not declare a language, auto-detection defaults to en.

News Website Crawler & Article Extractor

xtech/news-source-crawler

Scrape all articles from any news website. Extract full text, metadata, keywords, and summaries. Ideal for content analysis, research, and news aggregation.

Xtech

395

4.8

Fast News Content Scraper

datapilot/fast-news-content-scraper

Fast News Content Scraper Actor collects news articles using Fast News RSS and . It extracts title, URL, publish date, author, description, and full article text. Supports multiple queries, anti-bot delays, and outputs structured JSON with source site and scrape timestamp.

Data Pilot

Google News Scraper – News Monitoring & Article Data Extractor

epicscrapers/google-news-scraper

Extract title, URL, source, publish date, and thumbnail image etc. Perfect for news monitoring, research, and media tracking workflows.

Epic Scrapers

5.0

Google News Scraper

saswave/google-news-scraper

Google News data retrieval scraper. Get latest news. Extract date with title, link, source, publication date, image, authors. Improve your news aggregation, market research, and sentiment analysis efforts

SASWAVE

AP News Scraper

piotrv1001/ap-news-scraper

Scrape news articles from AP News hub and topic pages. Extract enriched article data including full body text, authors with bios, tags, sections, publish dates, and video metadata. Ideal for news monitoring, media research, and content analysis.

FalconScrape

Google News Scraper

lhotanova/google-news-scraper

Gets featured articles from Google News with title, link, source, publication date and image.

Kristýna Lhoťanová

3.1K

4.6

Yahoo News Scraper

piotrv1001/yahoo-news-scraper

Scrapes news articles from Yahoo News categories, extracting titles, authors, sources, publication dates, descriptions, images, and full article body text. Ideal for media monitoring, trend analysis, and news aggregation.

FalconScrape

Google News Scraper ( Fast And Cheap )

hung_pham_manh/google-news-scraper-fast-and-cheap

Gets featured articles from Google News with title, link, source, publication date and image.

Google News Scraper

muscular_quadruplet/google-news-scraper

Scrape Google News articles by keyword or topic. Get headlines, sources, publish dates, snippets. Monitor news mentions, track industry trends, build news aggregators. Real-time news scraping.

Do It

Google News Scraper

rupom888/google-news-scraper

Scrape Google News articles by keyword, topic (Technology, Business, Sports, etc.), or site. No API key needed. Returns title, URL, source, date, description, and image.