News & RSS Feed Scraper
Pricing
$6.99/month + usage
News & RSS Feed Scraper
News & RSS Feed Scraper is a powerful tool that extracts structured article data from any RSS/Atom feed. Perfect for news aggregators, content analysis, and AI training pipeline
Pricing
$6.99/month + usage
Rating
0.0
(0)
Developer

Scrape Pilot
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
4 days ago
Last modified
Categories
Share
A powerful, production‑ready news and RSS scraper that extracts structured article data from popular RSS feeds (e.g., TechCrunch, The Verge, CNN) or any custom RSS/Atom feed. Designed to be used as an Apify Actor or as a standalone Node.js module, it delivers clean, consistent output – perfect for news aggregators, content analysis, or AI training pipelines.
👉 Focus keyword: news and rss scraper – built for speed, reliability, and ease of use.
✨ Features
- 📡 Preset feeds – Instantly scrape top tech news sites (TechCrunch, The Verge, Wired, etc.) with zero configuration.
- 🔧 Custom feeds – Provide your own RSS/Atom feed URL and scrape any news source.
- 📰 Full article fetching – Optionally fetch the complete article content (HTML or plain text) from the original URL.
- 🖼️ Image extraction – Automatically extracts the main image from each article.
- 🧹 Clean output – Consistent schema with fields:
title,url,description,author,category,published,image. - 🌐 Proxy support – Built‑in Apify proxy with residential, datacenter, or custom proxy configurations to avoid blocking.
- 📊 Result limits – Control the number of articles returned per run.
- ⚡ Fast & scalable – Built on top of Axios and rss-parser with concurrency control.
- 🔍 SEO friendly – Output is ready to be indexed or fed into downstream tools.
🚀 How It Works
- Provide input – Choose a preset feed or supply a custom RSS URL.
- Scraping – The actor fetches the RSS feed, parses entries, and extracts metadata.
- Optional full content – If enabled, it navigates to each article's URL and fetches the full HTML (or text) content.
- Proxy rotation – For large runs or geo‑restricted feeds, residential proxies ensure high success rates.
- Output – Returns a clean JSON array of articles ready for your application.
📥 Input Schema
The actor accepts the following input fields. All fields are optional; defaults are sensible.
| Field | Type | Default | Description |
|---|---|---|---|
fetch_full_articles | Boolean | false | If true, the actor will fetch the full HTML content of each article from its original URL. |
preset_feed | String | "techcrunch" | Choose from a list of pre‑configured feeds: "techcrunch", "theverge", "wired", "cnn", "bbc". If empty, you must provide a custom feed URL. |
custom_feed_url | String | "" | Your own RSS/Atom feed URL. Overrides preset_feed if provided. |
proxyConfiguration | Object | { "useApifyProxy": true, "apifyProxyGroups": ["RESIDENTIAL"] } | Proxy settings. See Proxy Configuration for details. |
max_results | Integer | 20 | Maximum number of articles to return (1–100). |
full_content_type | String | "html" | When fetch_full_articles is true, choose "html" (raw HTML) or "text" (plain text). |
timeout_secs | Integer | 30 | Timeout for each article fetch (in seconds). |
Example Input (JSON)
{"fetch_full_articles": false,"preset_feed": "techcrunch","proxyConfiguration": {"useApifyProxy": true,"apifyProxyGroups": ["RESIDENTIAL"]},"max_results": 20}
📤 Output Format
The actor returns an array of objects, each representing a news article. Below is the schema:
| Field | Type | Description |
|---|---|---|
type | String | Always "rss" for this actor. |
source | String | Domain name of the feed source (e.g., "techcrunch.com"). |
title | String | Article headline. |
url | String | Direct link to the full article. |
description | String | Excerpt or summary from the RSS feed. |
published | String (ISO 8601) or null | Publication date and time, if available. |
author | String or null | Author name(s). |
category | String | Comma‑separated categories/tags. |
image | String or null | URL of the main article image. |
full_content | String or null | Only present if fetch_full_articles=true. Contains the full article HTML or text. |
Example Output (JSON)
[{"type": "rss","source": "techcrunch.com","title": "‘Not built right the first time’ — Musk’s xAI is starting over again, again","url": "https://techcrunch.com/2025/03/15/not-built-right-the-first-time-musks-xai-is-starting-over-again-again/","description": "The AI lab is revamping its effort to build an AI coding tool, with two new executives joining from Cursor.","published": null,"author": "Tim Fernholz","category": "AI, cursor, Elon Musk","image": null},{"type": "rss","source": "techcrunch.com","title": "Lawyer behind AI psychosis cases warns of mass casualty risks","url": "https://techcrunch.com/2025/03/15/lawyer-behind-ai-psychosis-cases-warns-of-mass-casualty-risks/","description": "AI chatbots have been linked to suicides for years. Now one lawyer says they are showing up in mass casualty cases too, and the technology is moving faster than the safeguards.","published": null,"author": "Rebecca Bellan","category": "AI, ai delusions, ai psychosis","image": null}// ... more articles up to max_results]
🛠️ Usage
▶️ Run on Apify Console
- Go to Apify Console and open the Actor page for News & RSS Feed Scraper.
- Click "Run".
- Fill in the input fields (or use the default).
- Click "Start" and wait for results.
🔌 Run via Apify API (cURL)
curl -X POST "https://api.apify.com/v2/acts/your-username~news-rss-scraper/runs?token=<YOUR_API_TOKEN>" \-H "Content-Type: application/json" \-d '{"fetch_full_articles": false,"preset_feed": "techcrunch","max_results": 10}'
📦 Use as a Node.js Module
Install the package:
$npm install news-rss-scraper
Then use it in your code:
const { scrapeNews } = require('news-rss-scraper');(async () => {const results = await scrapeNews({fetch_full_articles: false,preset_feed: 'techcrunch',max_results: 5});console.log(results);})();
🌐 Proxy Configuration
To avoid IP‑based blocking, especially for high‑volume scraping, you can configure proxies. The actor integrates seamlessly with Apify Proxy.
| Property | Type | Description |
|---|---|---|
useApifyProxy | Boolean | If true, enables Apify Proxy. Default: true. |
apifyProxyGroups | Array | Proxy groups: ["RESIDENTIAL"], ["DATACENTER"], or ["SHADER"]. Residential is recommended for news sites. |
proxyUrls | Array | Custom proxy URLs (e.g., ["http://user:pass@proxy.example.com:8080"]). Ignored if useApifyProxy is true. |
Example with custom proxies:
{"proxyConfiguration": {"useApifyProxy": false,"proxyUrls": ["http://user:pass@123.45.67.89:8080"]}}
🧪 Advanced Options
| Option | Description |
|---|---|
custom_feed_url | If you need a feed not in the preset list, provide its full URL here. |
full_content_type | When fetch_full_articles=true, choose "html" (raw HTML) or "text" (plain text stripped of tags). |
timeout_secs | Timeout for each individual article fetch. Increase if sites are slow. |
❓ FAQ / Troubleshooting
Q: Why are some articles missing images?
A: Not all RSS feeds include image metadata. The actor tries to extract images from the <enclosure> tag or the <media:content> tag. If none exist, the image field will be null.
Q: How can I scrape a non‑English news site?
A: Simply provide its RSS feed URL in custom_feed_url. The actor works with any valid RSS/Atom feed regardless of language.
Q: I'm getting blocked / timeouts.
A: Enable residential proxies (apifyProxyGroups: ["RESIDENTIAL"]) and reduce max_results to stay under rate limits. You can also increase timeout_secs.
Q: Can I run this Actor for free?
A: On Apify, each run consumes platform credits. Check Apify pricing for details. A small number of runs may be covered by the free tier.