News Source Crawler
Pricing
from $1.00 / 1,000 results
News Source Crawler
Given a news website URL, discover and extract articles with full metadata with title, authors, publish date, body text, top image, keywords, and summary. Works with any news site via sitemap or HTML discovery.
Pricing
from $1.00 / 1,000 results
Rating
5.0
(13)
Developer
Crawler Bros
Actor stats
13
Bookmarked
2
Total users
1
Monthly active users
a day ago
Last modified
Categories
Share
News Source Crawler — Extract Articles From Any News Website
Point this actor at any news website and it returns a clean, structured record for every article it can find: title, authors, publish date, full body text, top image, language, keywords, and a short summary.
What this actor does
Give the actor a single root URL — a publication's homepage, a category page, or an internal newsroom — and it discovers the site's article URLs and extracts rich metadata for each one. Article discovery uses the site's sitemaps first (/sitemap.xml, /sitemap-news.xml, /post-sitemap.xml, plus any sitemaps declared in robots.txt) and falls back to parsing the homepage for article-shaped links when sitemaps are missing. You can also pass a hub URL (e.g. a /technology or /markets section page) and the crawler will scope discovery to just that section.
For every article it fetches, the actor extracts metadata using a cascade of strategies — JSON-LD structured data first (the gold standard, used by most modern publications), then Open Graph / article:* meta tags, then a paragraph-concatenation fallback over the <article> or <main> element. Optional extras include a top-10 keyword list (stopword-filtered), a three-sentence auto-summary, and a boolean keyword filter with full AND / OR / NOT / parentheses support so you only keep articles that match your brief.
The actor defaults to a fast HTTP-only path with 40 supported languages for stopword-aware keyword extraction. When a publication blocks datacenter IPs, the built-in auto proxy fallback silently retries discovery and per-article fetches through a residential proxy — so a single run can transparently switch from fast direct fetches to proxy-resilient fetches on the hard sites without any manual tuning.
Key features
- Sitemap-first article discovery —
/sitemap.xml,/sitemap-news.xml,/post-sitemap.xml, andSitemap:entries inrobots.txt. - HTML fallback when sitemaps are missing — scans the homepage for article-shaped links.
- Hub URL filter — point at a section page (e.g.
/tech) to scope discovery to that section only. - Cascade metadata extraction — JSON-LD → Open Graph /
article:*→<article>body fallback. - 40-language support for keyword extraction and stopword filtering.
- Boolean keyword filter with
AND,OR,NOT, and parentheses. - Minimum word count filter — drop thin content automatically.
- Top-image extraction from
og:image/ JSON-LD / first<img>. - Auto-summary — first three sentences or meta description fallback.
- Auto proxy fallback — retries discovery and per-article fetches through residential proxy when the site blocks datacenter IPs.
- Parallel fetching with configurable concurrency.
Input
| Field | Type | Description |
|---|---|---|
websiteUrl | string | Root URL of the news site (required). |
hubUrl | string | Optional section URL (e.g. /technology) to narrow discovery to that part of the site. |
maxArticles | integer | Maximum articles to extract (1–500, default 20). |
keywordFilter | string | Optional boolean filter. Examples below. |
minWordCount | integer | Drop articles shorter than this (default 100). |
concurrency | integer | Parallel article fetches (1–20, default 5). |
extractKeywords | boolean | Compute the top 10 keywords per article (default true). |
extractSummary | boolean | Emit a three-sentence summary per article (default true). |
language | enum | auto (default) or one of 40 language codes — en, de, fr, es, it, pt, ja, ko, zh, ar, and more. |
autoProxyFallback | boolean | When true (default), retry via residential proxy if discovery or articles get blocked. |
Example input
{"websiteUrl": "https://techcrunch.com","hubUrl": "https://techcrunch.com/category/startups","maxArticles": 50,"keywordFilter": "(AI OR machine learning) AND NOT crypto","minWordCount": 200,"concurrency": 5,"language": "auto","autoProxyFallback": true}
Output
Each dataset item is one article:
{"articleUrl": "https://techcrunch.com/2026/04/22/example-ai-startup-raises-50m/","articleTitle": "Example AI Startup Raises $50M Series B","articleAuthors": ["Jane Doe"],"articlePublishDate": "2026-04-22T14:03:00+00:00","articleText": "Example AI, the San Francisco-based startup…","articleWordCount": 612,"articleTopImage": "https://techcrunch.com/.../hero.jpg","articleLanguage": "en-US","articleKeywords": ["ai", "startup", "funding", "series", "b"],"articleSummary": "Example AI raised a $50M Series B led by…","sourceDomain": "techcrunch.com","scrapedAt": "2026-04-24T10:15:00+00:00"}
Field descriptions
articleUrl— canonical URL of the article.articleTitle— headline.articleAuthors— list of author names.articlePublishDate— ISO 8601 publish timestamp.articleText— full body text.articleWordCount— word count of the body.articleTopImage— hero image URL (absolute).articleLanguage— ISO language code (en,en-US,de,ja, …).articleKeywords— top-10 content words sorted by frequency.articleSummary— three-sentence summary or meta description fallback.sourceDomain— hostname of the origin site.scrapedAt— ISO 8601 timestamp of extraction.
Fields the site doesn't publish are omitted rather than emitted as null, keeping records compact.
Use cases
- Media monitoring — track every new article from a publication, filter by keyword, export to your own search index.
- Content research — pull the last N articles from a niche industry outlet to feed a summarizer or LLM.
- Dataset construction — build a training corpus of clean article text + metadata from any news source.
- Competitive intelligence — watch a competitor's press-release feed and get notified when specific terms appear.
- PR analytics — monitor coverage of a brand, product, or executive across dozens of outlets.
FAQ
Does it work on any news site? It works on most publications that expose a standard sitemap or have a reasonably conventional homepage. Paywalled sites may return login stubs, and heavily JavaScript-rendered single-page apps may require a browser-based actor.
How does the keyword filter work?
The expression is parsed as a boolean tree. AND, OR, NOT (case-insensitive) plus parentheses are supported. Adjacent terms without an explicit operator are treated as AND. Each leaf term matches as a case-insensitive substring against title + body. Examples:
AI AND startup— both words must appear.crypto OR bitcoin— either word.AI AND NOT crypto— AI but not crypto.(startup OR funding) AND SaaS— SaaS articles that also mention startup or funding.
Why are some articles missing fields?
If the site does not publish a given piece of metadata (no JSON-LD, no og:image, no author meta), that field is omitted from the output. articleTitle and articleText are always present for successfully-extracted articles.
How many articles can one run return?
Up to 500 via maxArticles. The crawler looks at up to 50 candidate URLs from the sitemap/homepage before extraction, so very small sites may yield fewer than requested.
Does it follow pagination / deep archives?
The crawler walks sitemap index files (to two levels deep) and accepts every URL listed. For HTML fallback it scans the homepage only. To reach deeper archives, pass a hubUrl pointing at a category or tag page with dense article links.
What does autoProxyFallback do exactly?
It's off the fast path by default. When a direct HTTP fetch to the homepage/sitemap or to an individual article returns a block/challenge response, the actor silently retries the request through a residential proxy session. No manual configuration needed.
Do I need a proxy?
Usually no. Most news sites respond to direct HTTP fetches without issue. autoProxyFallback only kicks in when a site explicitly blocks the direct path, so you don't pay proxy costs on easy sites.
Known limitations
- Paywalled content — fully paywalled articles return login stubs rather than full body text. The actor cannot bypass paywalls.
- Heavy single-page apps — publications that render article bodies entirely in client-side JavaScript (rare for news, common for product sites) may return empty bodies. Use a browser-based scraper for those.
- Homepage-only fallback — when no sitemap is available, discovery is limited to links visible on the homepage. Deep archives won't surface without a
hubUrl. - Article detection is heuristic — a small number of link-heavy pages (index pages, tag pages) may occasionally be extracted as "articles" with short bodies. Use
minWordCountto filter them out. - Language auto-detection is based on the HTML
langattribute and meta tags. If the site does not declare a language, auto-detection defaults toen.