RSS Feed Scraper — Atom, Podcast & Multi-Feed
Pricing
Pay per event
RSS Feed Scraper — Atom, Podcast & Multi-Feed
Parse and convert any RSS or Atom feed to a clean dataset — title, link, author, published date, summary, full HTML content, tags, GUID — export to JSON or CSV. A drop-in RSS feed parser for RSS 2.0, Atom 1.0, and the content:encoded / dc:creator extensions.
Pricing
Pay per event
Rating
0.0
(0)
Developer
DevilScrapes
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
0
Monthly active users
2 days ago
Last modified
Categories
Share
🎯 What this scrapes
RSS and Atom are still the most reliable way to subscribe to a publication. This Actor parses any feed URL — news site, blog, podcast, GitHub release feed, Reddit, Substack, Medium per-user — and writes one row per item. Output is normalised across RSS and Atom dialects so downstream code never needs to care which format it received.
Feed sources that work out of the box:
- News publishers (New York Times, BBC, Reuters — any site that vends an RSS endpoint)
- Podcast directories and individual show feeds (
<enclosure>tags parsed automatically) - GitHub release and commit feeds
- Reddit
.rssand.jsoncommunity feeds - Substack and Medium per-author feeds
- Google Alerts export feeds
- Any custom-built Atom or RSS 2.0 / RSS 1.0 feed
🔥 Features
- 🛡️ Browser fingerprint rotation —
curl-cffiimpersonates real Chrome / Firefox / Safari TLS handshakes so the target sees a browser, not a Python script. - 🌐 Residential proxy rotation via Apify Proxy — fresh session and exit IP on every block or 429 response.
- 🔁 Retries with exponential backoff on
408 / 429 / 5xx— up to 5 attempts per feed,Retry-Afterheader honoured. - 🧱 Rate-limit-aware pacing — when a feed host pushes back, we slow down and surface exactly what was collected before the limit hit.
- 🧊 Clean, typed dataset rows — Pydantic-validated, ISO-8601 timestamps, stable IDs, exportable as JSON / CSV / Excel straight from Apify Console.
- 💰 Pay-Per-Event pricing — you pay only for results that land in your dataset. No data, no charge (beyond the small actor-start fee).
- 📡 Multi-feed batching — pass a list of URLs; the Actor fetches and normalises them all in one run, deduplicating by GUID.
- 📝 Full HTML content — when a feed publishes
content:encodedor Atomcontent, we capture the full body alongside the summary, not just a truncated snippet.
💡 Use cases
- News aggregation dashboard — pull 20 publications into one stream and pipe to Slack, Discord, or a webhook.
- Brand monitoring — track every Google Alerts RSS feed for your company name, product, or competitors.
- Content automation — feed company-blog RSS into a translation pipeline, summary LLM, or newsletter tool.
- Podcast RSS parser — podcast RSS is standard RSS with
<enclosure>tags; this Actor surfaces the episode link, title, author, and published date for every episode in the feed. - LLM-ready news digest — pass structured rows straight to an LLM pipeline; ISO-8601 timestamps and clean HTML make chunking predictable.
- RSS-to-Google-Sheets / Notion / Airtable — export via Apify's native integration or the API; no glue code required.
- Feed archival — schedule the Actor daily to build a rolling archive of feeds that don't publish full history.
⚙️ How to use it
- Click Try for free at the top of the page.
- Paste one or more RSS / Atom feed URLs into the
feedUrlsfield — one per line. - Optionally set
maxItemsPerFeedand toggleincludeContentfor full HTML. - Click Start. Output streams into the run's dataset in real time.
- Export from Storage → Dataset as JSON, CSV, or Excel — or call the Apify API from your own code.
For scheduled runs, use Apify Schedules (cron syntax) so the Actor refreshes your dataset on your preferred cadence.
📥 Input
| Field | Type | Required | Default | Notes |
|---|---|---|---|---|
feedUrls | array | yes | ["https://news.ycombinator.com/rss"] | List of RSS / Atom feed URLs. One URL per item. Each URL produces one or more dataset rows. |
maxItemsPerFeed | integer | no | 50 | Cap on items pulled from a single feed. Set to 0 for no limit. |
includeContent | boolean | no | true | When true, includes the full HTML body (content:encoded / Atom content). When false, summary only. |
userAgent | string | no | "DevilScrapesBot/1.0 (+https://apify.com/DevilScrapes)" | Custom User-Agent string. Default identifies as Devil Scrapes RSS reader. |
proxyConfiguration | object | no | {"useApifyProxy": false} | Some publishers rate-limit scrapers. Apify Proxy provides sticky sessions and IP rotation when needed. |
Example input
{"feedUrls": ["https://news.ycombinator.com/rss","https://feeds.arstechnica.com/arstechnica/index"],"maxItemsPerFeed": 25,"includeContent": true,"proxyConfiguration": {"useApifyProxy": false}}
📤 Output
Every row is one feed item. All fields follow Pydantic validation — no nulls where a value existed, no phantom fields.
| Field | Type | Notes |
|---|---|---|
feed_url | string | Source feed URL passed in feedUrls. |
feed_title | string | null | Feed channel title. |
feed_format | string | "rss" or "atom". |
item_id | string | null | Item GUID (RSS) or id (Atom). Used for deduplication. |
title | string | Item headline. |
link | string | Item permalink URL. |
author | string | null | Author from dc:creator or Atom author/name. |
summary | string | null | Short description / atom:summary. |
content_html | string | null | Full HTML body when the feed includes it. |
categories | array | Item tags / categories (empty array if none). |
published | string | null | Publish timestamp in ISO-8601 format. |
updated | string | null | Updated timestamp in ISO-8601 format. |
scraped_at | string | ISO-8601 timestamp for when this row was recorded. |
Example output
{"feed_url": "https://news.ycombinator.com/rss","feed_title": "Hacker News","feed_format": "rss","item_id": "https://news.ycombinator.com/item?id=48000000","title": "Show HN: Building a hosted RSS parser for the post-LLM web","link": "https://news.ycombinator.com/item?id=48000000","author": null,"summary": "A discussion about ...","content_html": null,"categories": [],"published": "2026-05-15T20:00:00+00:00","updated": null,"scraped_at": "2026-06-01T09:00:00+00:00"}
💰 Pricing
Pay-Per-Event — you pay only when these events fire:
| Event | USD | What it is |
|---|---|---|
actor-start | $0.005 | One-off warm-up charge per run |
result | $0.001 | Per dataset item written |
Example: 1 000 items at the rates above ≈ $1.00.
No subscription, no monthly minimum, no card to start — Apify gives every new account $5 of free credit, which covers your first 5 000 rows.
🚧 Limitations
- Paginated feeds — we don't follow
<link rel="next">paged feeds automatically. Pass each page URL explicitly if you need full history. - JavaScript-rendered feeds — feeds that require JavaScript to load are not supported. You would need a browser-based Actor for those.
- Malformed XML —
feedparseris lenient and handles most broken XML, but severely corrupted feeds may yield partial or empty results. The run surfaces a warning, not a silent empty dataset. - Rate-limiting by feed hosts — heavily scraped feeds (e.g. Reddit) may enforce per-IP rate limits. Enable Apify Proxy in
proxyConfigurationto rotate IPs.
❓ FAQ
Is this the same as an rss parser api?
Functionally, yes — you call it via the Apify API (or the Console UI), pass feed URLs, and get back structured JSON. The difference is that we handle the messy parts a bare HTTP client doesn't: malformed XML, charset detection, multi-dialect normalisation, and the network-level blocks that make your home-rolled parser fail on 1 in 20 feeds.
Does this handle podcasts?
Yes — podcast RSS is standard RSS with <enclosure> tags. This Actor is a capable podcast RSS parser: the enclosure URL (the audio file) appears in the link field for each episode row, alongside the episode title, author, and published date.
What about atom feed parser support?
Full Atom 1.0 support is built in. The feed_format field tells you which dialect was parsed. Both RSS and Atom rows share the same output schema, so your downstream code needs no format-specific logic.
Why is content_html empty for some feeds?
Some publishers deliberately publish summary-only feeds to drive clicks to their site. The full body lives on the publisher's page, not in the feed XML. We surface what the feed provides — no fabrication.
What if a feed URL returns an error?
The Actor logs the failure with the HTTP status code, marks that feed as errored in the status message, and continues processing the remaining URLs. You never get a silent empty dataset — partial success is surfaced explicitly.
Can I run this on a schedule?
Yes. Use Apify Schedules to trigger a run on any cron cadence. Pair it with a named dataset to accumulate a rolling archive without overwriting previous results.
Does it deduplicate items across runs?
Within a single run, items are deduplicated by GUID / Atom id. Across runs, deduplication is your responsibility — filter by item_id in your downstream pipeline or use a named dataset with upsert logic.
💬 Your feedback
Spotted a bug, hit a weird edge case, or need a new field? Open an issue on the Actor's Issues tab on Apify Console — we ship fixes weekly and we read every report.