RSS Feed Scraper — Atom, Podcast & Multi-Feed avatar

RSS Feed Scraper — Atom, Podcast & Multi-Feed

Pricing

Pay per event

Go to Apify Store
RSS Feed Scraper — Atom, Podcast & Multi-Feed

RSS Feed Scraper — Atom, Podcast & Multi-Feed

Parse and convert any RSS or Atom feed to a clean dataset — title, link, author, published date, summary, full HTML content, tags, GUID — export to JSON or CSV. A drop-in RSS feed parser for RSS 2.0, Atom 1.0, and the content:encoded / dc:creator extensions.

Pricing

Pay per event

Rating

0.0

(0)

Developer

DevilScrapes

DevilScrapes

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

0

Monthly active users

2 days ago

Last modified

Categories

Share


🎯 What this scrapes

RSS and Atom are still the most reliable way to subscribe to a publication. This Actor parses any feed URL — news site, blog, podcast, GitHub release feed, Reddit, Substack, Medium per-user — and writes one row per item. Output is normalised across RSS and Atom dialects so downstream code never needs to care which format it received.

Feed sources that work out of the box:

  • News publishers (New York Times, BBC, Reuters — any site that vends an RSS endpoint)
  • Podcast directories and individual show feeds (<enclosure> tags parsed automatically)
  • GitHub release and commit feeds
  • Reddit .rss and .json community feeds
  • Substack and Medium per-author feeds
  • Google Alerts export feeds
  • Any custom-built Atom or RSS 2.0 / RSS 1.0 feed

🔥 Features

  • 🛡️ Browser fingerprint rotationcurl-cffi impersonates real Chrome / Firefox / Safari TLS handshakes so the target sees a browser, not a Python script.
  • 🌐 Residential proxy rotation via Apify Proxy — fresh session and exit IP on every block or 429 response.
  • 🔁 Retries with exponential backoff on 408 / 429 / 5xx — up to 5 attempts per feed, Retry-After header honoured.
  • 🧱 Rate-limit-aware pacing — when a feed host pushes back, we slow down and surface exactly what was collected before the limit hit.
  • 🧊 Clean, typed dataset rows — Pydantic-validated, ISO-8601 timestamps, stable IDs, exportable as JSON / CSV / Excel straight from Apify Console.
  • 💰 Pay-Per-Event pricing — you pay only for results that land in your dataset. No data, no charge (beyond the small actor-start fee).
  • 📡 Multi-feed batching — pass a list of URLs; the Actor fetches and normalises them all in one run, deduplicating by GUID.
  • 📝 Full HTML content — when a feed publishes content:encoded or Atom content, we capture the full body alongside the summary, not just a truncated snippet.

💡 Use cases

  • News aggregation dashboard — pull 20 publications into one stream and pipe to Slack, Discord, or a webhook.
  • Brand monitoring — track every Google Alerts RSS feed for your company name, product, or competitors.
  • Content automation — feed company-blog RSS into a translation pipeline, summary LLM, or newsletter tool.
  • Podcast RSS parser — podcast RSS is standard RSS with <enclosure> tags; this Actor surfaces the episode link, title, author, and published date for every episode in the feed.
  • LLM-ready news digest — pass structured rows straight to an LLM pipeline; ISO-8601 timestamps and clean HTML make chunking predictable.
  • RSS-to-Google-Sheets / Notion / Airtable — export via Apify's native integration or the API; no glue code required.
  • Feed archival — schedule the Actor daily to build a rolling archive of feeds that don't publish full history.

⚙️ How to use it

  1. Click Try for free at the top of the page.
  2. Paste one or more RSS / Atom feed URLs into the feedUrls field — one per line.
  3. Optionally set maxItemsPerFeed and toggle includeContent for full HTML.
  4. Click Start. Output streams into the run's dataset in real time.
  5. Export from Storage → Dataset as JSON, CSV, or Excel — or call the Apify API from your own code.

For scheduled runs, use Apify Schedules (cron syntax) so the Actor refreshes your dataset on your preferred cadence.

📥 Input

FieldTypeRequiredDefaultNotes
feedUrlsarrayyes["https://news.ycombinator.com/rss"]List of RSS / Atom feed URLs. One URL per item. Each URL produces one or more dataset rows.
maxItemsPerFeedintegerno50Cap on items pulled from a single feed. Set to 0 for no limit.
includeContentbooleannotrueWhen true, includes the full HTML body (content:encoded / Atom content). When false, summary only.
userAgentstringno"DevilScrapesBot/1.0 (+https://apify.com/DevilScrapes)"Custom User-Agent string. Default identifies as Devil Scrapes RSS reader.
proxyConfigurationobjectno{"useApifyProxy": false}Some publishers rate-limit scrapers. Apify Proxy provides sticky sessions and IP rotation when needed.

Example input

{
"feedUrls": [
"https://news.ycombinator.com/rss",
"https://feeds.arstechnica.com/arstechnica/index"
],
"maxItemsPerFeed": 25,
"includeContent": true,
"proxyConfiguration": {
"useApifyProxy": false
}
}

📤 Output

Every row is one feed item. All fields follow Pydantic validation — no nulls where a value existed, no phantom fields.

FieldTypeNotes
feed_urlstringSource feed URL passed in feedUrls.
feed_titlestring | nullFeed channel title.
feed_formatstring"rss" or "atom".
item_idstring | nullItem GUID (RSS) or id (Atom). Used for deduplication.
titlestringItem headline.
linkstringItem permalink URL.
authorstring | nullAuthor from dc:creator or Atom author/name.
summarystring | nullShort description / atom:summary.
content_htmlstring | nullFull HTML body when the feed includes it.
categoriesarrayItem tags / categories (empty array if none).
publishedstring | nullPublish timestamp in ISO-8601 format.
updatedstring | nullUpdated timestamp in ISO-8601 format.
scraped_atstringISO-8601 timestamp for when this row was recorded.

Example output

{
"feed_url": "https://news.ycombinator.com/rss",
"feed_title": "Hacker News",
"feed_format": "rss",
"item_id": "https://news.ycombinator.com/item?id=48000000",
"title": "Show HN: Building a hosted RSS parser for the post-LLM web",
"link": "https://news.ycombinator.com/item?id=48000000",
"author": null,
"summary": "A discussion about ...",
"content_html": null,
"categories": [],
"published": "2026-05-15T20:00:00+00:00",
"updated": null,
"scraped_at": "2026-06-01T09:00:00+00:00"
}

💰 Pricing

Pay-Per-Event — you pay only when these events fire:

EventUSDWhat it is
actor-start$0.005One-off warm-up charge per run
result$0.001Per dataset item written

Example: 1 000 items at the rates above ≈ $1.00.

No subscription, no monthly minimum, no card to start — Apify gives every new account $5 of free credit, which covers your first 5 000 rows.

🚧 Limitations

  • Paginated feeds — we don't follow <link rel="next"> paged feeds automatically. Pass each page URL explicitly if you need full history.
  • JavaScript-rendered feeds — feeds that require JavaScript to load are not supported. You would need a browser-based Actor for those.
  • Malformed XMLfeedparser is lenient and handles most broken XML, but severely corrupted feeds may yield partial or empty results. The run surfaces a warning, not a silent empty dataset.
  • Rate-limiting by feed hosts — heavily scraped feeds (e.g. Reddit) may enforce per-IP rate limits. Enable Apify Proxy in proxyConfiguration to rotate IPs.

❓ FAQ

Is this the same as an rss parser api?

Functionally, yes — you call it via the Apify API (or the Console UI), pass feed URLs, and get back structured JSON. The difference is that we handle the messy parts a bare HTTP client doesn't: malformed XML, charset detection, multi-dialect normalisation, and the network-level blocks that make your home-rolled parser fail on 1 in 20 feeds.

Does this handle podcasts?

Yes — podcast RSS is standard RSS with <enclosure> tags. This Actor is a capable podcast RSS parser: the enclosure URL (the audio file) appears in the link field for each episode row, alongside the episode title, author, and published date.

What about atom feed parser support?

Full Atom 1.0 support is built in. The feed_format field tells you which dialect was parsed. Both RSS and Atom rows share the same output schema, so your downstream code needs no format-specific logic.

Why is content_html empty for some feeds?

Some publishers deliberately publish summary-only feeds to drive clicks to their site. The full body lives on the publisher's page, not in the feed XML. We surface what the feed provides — no fabrication.

What if a feed URL returns an error?

The Actor logs the failure with the HTTP status code, marks that feed as errored in the status message, and continues processing the remaining URLs. You never get a silent empty dataset — partial success is surfaced explicitly.

Can I run this on a schedule?

Yes. Use Apify Schedules to trigger a run on any cron cadence. Pair it with a named dataset to accumulate a rolling archive without overwriting previous results.

Does it deduplicate items across runs?

Within a single run, items are deduplicated by GUID / Atom id. Across runs, deduplication is your responsibility — filter by item_id in your downstream pipeline or use a named dataset with upsert logic.

💬 Your feedback

Spotted a bug, hit a weird edge case, or need a new field? Open an issue on the Actor's Issues tab on Apify Console — we ship fixes weekly and we read every report.