News & RSS Feed Scraper avatar

News & RSS Feed Scraper

Pricing

$6.99/month + usage

Go to Apify Store
News & RSS Feed Scraper

News & RSS Feed Scraper

News & RSS Feed Scraper is a powerful tool that extracts structured article data from any RSS/Atom feed. Perfect for news aggregators, content analysis, and AI training pipeline

Pricing

$6.99/month + usage

Rating

0.0

(0)

Developer

Scrape Pilot

Scrape Pilot

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Share

Apify Actor LICENSE Node.js CI npm version PRs Welcome

A powerful, production‑ready news and RSS scraper that extracts structured article data from popular RSS feeds (e.g., TechCrunch, The Verge, CNN) or any custom RSS/Atom feed. Designed to be used as an Apify Actor or as a standalone Node.js module, it delivers clean, consistent output – perfect for news aggregators, content analysis, or AI training pipelines.

👉 Focus keyword: news and rss scraper – built for speed, reliability, and ease of use.


✨ Features

  • 📡 Preset feeds – Instantly scrape top tech news sites (TechCrunch, The Verge, Wired, etc.) with zero configuration.
  • 🔧 Custom feeds – Provide your own RSS/Atom feed URL and scrape any news source.
  • 📰 Full article fetching – Optionally fetch the complete article content (HTML or plain text) from the original URL.
  • 🖼️ Image extraction – Automatically extracts the main image from each article.
  • 🧹 Clean output – Consistent schema with fields: title, url, description, author, category, published, image.
  • 🌐 Proxy support – Built‑in Apify proxy with residential, datacenter, or custom proxy configurations to avoid blocking.
  • 📊 Result limits – Control the number of articles returned per run.
  • ⚡ Fast & scalable – Built on top of Axios and rss-parser with concurrency control.
  • 🔍 SEO friendly – Output is ready to be indexed or fed into downstream tools.

🚀 How It Works

  1. Provide input – Choose a preset feed or supply a custom RSS URL.
  2. Scraping – The actor fetches the RSS feed, parses entries, and extracts metadata.
  3. Optional full content – If enabled, it navigates to each article's URL and fetches the full HTML (or text) content.
  4. Proxy rotation – For large runs or geo‑restricted feeds, residential proxies ensure high success rates.
  5. Output – Returns a clean JSON array of articles ready for your application.

📥 Input Schema

The actor accepts the following input fields. All fields are optional; defaults are sensible.

FieldTypeDefaultDescription
fetch_full_articlesBooleanfalseIf true, the actor will fetch the full HTML content of each article from its original URL.
preset_feedString"techcrunch"Choose from a list of pre‑configured feeds: "techcrunch", "theverge", "wired", "cnn", "bbc". If empty, you must provide a custom feed URL.
custom_feed_urlString""Your own RSS/Atom feed URL. Overrides preset_feed if provided.
proxyConfigurationObject{ "useApifyProxy": true, "apifyProxyGroups": ["RESIDENTIAL"] }Proxy settings. See Proxy Configuration for details.
max_resultsInteger20Maximum number of articles to return (1–100).
full_content_typeString"html"When fetch_full_articles is true, choose "html" (raw HTML) or "text" (plain text).
timeout_secsInteger30Timeout for each article fetch (in seconds).

Example Input (JSON)

{
"fetch_full_articles": false,
"preset_feed": "techcrunch",
"proxyConfiguration": {
"useApifyProxy": true,
"apifyProxyGroups": ["RESIDENTIAL"]
},
"max_results": 20
}

📤 Output Format

The actor returns an array of objects, each representing a news article. Below is the schema:

FieldTypeDescription
typeStringAlways "rss" for this actor.
sourceStringDomain name of the feed source (e.g., "techcrunch.com").
titleStringArticle headline.
urlStringDirect link to the full article.
descriptionStringExcerpt or summary from the RSS feed.
publishedString (ISO 8601) or nullPublication date and time, if available.
authorString or nullAuthor name(s).
categoryStringComma‑separated categories/tags.
imageString or nullURL of the main article image.
full_contentString or nullOnly present if fetch_full_articles=true. Contains the full article HTML or text.

Example Output (JSON)

[
{
"type": "rss",
"source": "techcrunch.com",
"title": "‘Not built right the first time’ — Musk’s xAI is starting over again, again",
"url": "https://techcrunch.com/2025/03/15/not-built-right-the-first-time-musks-xai-is-starting-over-again-again/",
"description": "The AI lab is revamping its effort to build an AI coding tool, with two new executives joining from Cursor.",
"published": null,
"author": "Tim Fernholz",
"category": "AI, cursor, Elon Musk",
"image": null
},
{
"type": "rss",
"source": "techcrunch.com",
"title": "Lawyer behind AI psychosis cases warns of mass casualty risks",
"url": "https://techcrunch.com/2025/03/15/lawyer-behind-ai-psychosis-cases-warns-of-mass-casualty-risks/",
"description": "AI chatbots have been linked to suicides for years. Now one lawyer says they are showing up in mass casualty cases too, and the technology is moving faster than the safeguards.",
"published": null,
"author": "Rebecca Bellan",
"category": "AI, ai delusions, ai psychosis",
"image": null
}
// ... more articles up to max_results
]

🛠️ Usage

▶️ Run on Apify Console

  1. Go to Apify Console and open the Actor page for News & RSS Feed Scraper.
  2. Click "Run".
  3. Fill in the input fields (or use the default).
  4. Click "Start" and wait for results.

🔌 Run via Apify API (cURL)

curl -X POST "https://api.apify.com/v2/acts/your-username~news-rss-scraper/runs?token=<YOUR_API_TOKEN>" \
-H "Content-Type: application/json" \
-d '{
"fetch_full_articles": false,
"preset_feed": "techcrunch",
"max_results": 10
}'

📦 Use as a Node.js Module

Install the package:

$npm install news-rss-scraper

Then use it in your code:

const { scrapeNews } = require('news-rss-scraper');
(async () => {
const results = await scrapeNews({
fetch_full_articles: false,
preset_feed: 'techcrunch',
max_results: 5
});
console.log(results);
})();

🌐 Proxy Configuration

To avoid IP‑based blocking, especially for high‑volume scraping, you can configure proxies. The actor integrates seamlessly with Apify Proxy.

PropertyTypeDescription
useApifyProxyBooleanIf true, enables Apify Proxy. Default: true.
apifyProxyGroupsArrayProxy groups: ["RESIDENTIAL"], ["DATACENTER"], or ["SHADER"]. Residential is recommended for news sites.
proxyUrlsArrayCustom proxy URLs (e.g., ["http://user:pass@proxy.example.com:8080"]). Ignored if useApifyProxy is true.

Example with custom proxies:

{
"proxyConfiguration": {
"useApifyProxy": false,
"proxyUrls": ["http://user:pass@123.45.67.89:8080"]
}
}

🧪 Advanced Options

OptionDescription
custom_feed_urlIf you need a feed not in the preset list, provide its full URL here.
full_content_typeWhen fetch_full_articles=true, choose "html" (raw HTML) or "text" (plain text stripped of tags).
timeout_secsTimeout for each individual article fetch. Increase if sites are slow.

❓ FAQ / Troubleshooting

Q: Why are some articles missing images?

A: Not all RSS feeds include image metadata. The actor tries to extract images from the <enclosure> tag or the <media:content> tag. If none exist, the image field will be null.

Q: How can I scrape a non‑English news site?

A: Simply provide its RSS feed URL in custom_feed_url. The actor works with any valid RSS/Atom feed regardless of language.

Q: I'm getting blocked / timeouts.

A: Enable residential proxies (apifyProxyGroups: ["RESIDENTIAL"]) and reduce max_results to stay under rate limits. You can also increase timeout_secs.

Q: Can I run this Actor for free?

A: On Apify, each run consumes platform credits. Check Apify pricing for details. A small number of runs may be covered by the free tier.