Pricing

from $1.00 / 1,000 results

RSS & Atom Feed to JSON

Convert any RSS or Atom feed into clean, structured JSON. Handles both formats and normalizes them to one schema with parsed dates, authors, content, and categories.

Pricing

from $1.00 / 1,000 results

Rating

0.0

(0)

Developer

Nicolas van Arkens

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

.dockerignore

.git
.venv
__pycache__
*.pyc
storage
test_*.py

requirements.txt

1apify < 3.0
2pydantic >= 2.8, < 2.12
3browserforge == 1.2.3
4httpx

.actor/actor.json

{
    "actorSpecification": 1,
    "name": "rss-to-json",
    "title": "RSS & Atom Feed to JSON",
    "description": "Convert any RSS or Atom feed into clean, structured JSON. Handles both formats and normalizes them to one schema with parsed dates, authors, content, and categories.",
    "version": "0.1",
    "buildTag": "latest",
    "input": "./input_schema.json",
    "dockerfile": "./Dockerfile",
    "storages": { "dataset": "./dataset_schema.json" }
}

.actor/dataset_schema.json

{
    "actorSpecification": 1,
    "views": {
        "overview": {
            "title": "Overview",
            "transformation": { "fields": ["title","author","published","categories","link","feedTitle"] },
            "display": {
                "component": "table",
                "properties": {
                    "title": { "label": "Title", "format": "text" },
                    "author": { "label": "Author", "format": "text" },
                    "published": { "label": "Published", "format": "text" },
                    "categories": { "label": "Categories", "format": "array" },
                    "link": { "label": "Link", "format": "link" },
                    "feedTitle": { "label": "Feed", "format": "text" }
                }
            }
        }
    }
}

.actor/Dockerfile

FROM apify/actor-python:3.12
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
COPY . ./
CMD ["python", "-m", "src"]

.actor/input_schema.json

{
    "title": "RSS & Atom Feed to JSON",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "feedUrls": {
            "title": "Feed URLs",
            "type": "array",
            "description": "One or more RSS or Atom feed URLs to convert. Works with news, blogs, podcasts, YouTube channel feeds, Reddit (.rss), and more.",
            "editor": "stringList",
            "prefill": ["https://hnrss.org/frontpage"]
        },
        "maxItemsPerFeed": {
            "title": "Max items per feed",
            "type": "integer",
            "description": "Maximum number of items to return from each feed.",
            "default": 100,
            "minimum": 1,
            "maximum": 1000
        },
        "includeContent": {
            "title": "Include full content",
            "type": "boolean",
            "description": "Include the full article body (content:encoded / Atom content) when present. Turn off for lighter output with just summaries.",
            "default": true
        }
    },
    "required": []
}

.actor/README.md

1# RSS & Atom Feed to JSON 📡
2
3Turn **any RSS or Atom feed** into clean, structured JSON. Point it at one or many feed URLs — news sites, blogs, podcasts, YouTube channels, Reddit, job boards, price-drop feeds — and get back tidy, normalized records you can drop straight into a database, spreadsheet, automation, or LLM pipeline.
4
5## Why use it
6
7- 🔄 **One schema for everything** — RSS 2.0 and Atom feeds come out in the *same* clean format, so your downstream code never has to care which one it was
8- 🗓️ **Normalized dates** — RSS (RFC 822) and Atom (ISO 8601) dates both converted to clean ISO 8601 UTC
9- 📝 **Full content** — pulls `content:encoded` / Atom content, not just the summary (toggleable)
10- 👤 **Authors, categories, media** — extracted even from namespaced fields (`dc:creator`, enclosures)
11- 📚 **Many feeds at once** — batch any number of feed URLs in a single run
12
13## Use cases
14
15- **Automation** — feed RSS into Zapier, Make, Slack, or webhooks as structured data
16- **Content aggregation** — combine many sources into one normalized stream
17- **Monitoring** — watch news, releases, or job/price feeds and act on changes
18- **AI/LLM pipelines** — clean article text and metadata ready for summarization or RAG
19- **Datasets & dashboards** — collect feed data over time for analysis
20
21## Input
22
23| Field | Description |
24|-------|-------------|
25| **Feed URLs** | One or more RSS/Atom feed URLs. |
26| **Max items per feed** | Cap items returned per feed. |
27| **Include full content** | Include the full article body, or just summaries. |
28
29## Output
30
31```json
32{
33  "title": "First Post",
34  "link": "https://example.com/1",
35  "summary": "A short summary here.",
36  "content": "<p>Full body content</p>",
37  "author": "Jane Doe",
38  "published": "2025-06-10T09:00:00+00:00",
39  "updated": null,
40  "categories": ["tech", "news"],
41  "guid": "https://example.com/1",
42  "mediaUrl": "https://example.com/img.jpg",
43  "feedTitle": "Example News",
44  "feedUrl": "https://example.com/feed"
45}
46```
47
48Export to JSON, CSV, or Excel, or pull via the Apify API. Connect to Sheets, Slack, Notion, Zapier, or Make.
49
50## Notes
51
52- Supports RSS 2.0, Atom, and RSS 1.0/RDF feeds.
53- Independent tool; feeds remain the property of their publishers — respect each source's terms of use.

src/main.py

1"""RSS & Atom Feed to JSON — Apify Actor.
2
3Fetches one or more RSS 2.0 or Atom feeds and normalizes every item into a
4single clean JSON schema, regardless of source format. Handles the messy
5reality of feeds: RSS vs Atom element names, multiple date formats, namespaced
6elements (content:encoded, dc:creator, media:*), and CDATA.
7
8The value here is the normalization: the same tidy record shape whether the
9source is RSS 2.0 or Atom, so downstream code never has to branch on format.
10"""
11
12from __future__ import annotations
13
14import asyncio
15import re
16from datetime import datetime, timezone
17from email.utils import parsedate_to_datetime
18from xml.etree import ElementTree as ET
19
20import httpx
21from apify import Actor
22
23NS = {
24    "atom": "http://www.w3.org/2005/Atom",
25    "content": "http://purl.org/rss/1.0/modules/content/",
26    "dc": "http://purl.org/dc/elements/1.1/",
27    "media": "http://search.yahoo.com/mrss/",
28}
29
30
31def _strip(s: str | None) -> str | None:
32    if s is None:
33        return None
34    s = s.strip()
35    return s or None
36
37
38def _normalize_date(raw: str | None) -> str | None:
39    """Convert RSS (RFC 822) or Atom (ISO 8601) dates to ISO 8601 UTC."""
40    if not raw:
41        return None
42    raw = raw.strip()
43    # Try RFC 822 (RSS): 'Tue, 10 Jun 2025 09:00:00 GMT'
44    try:
45        dt = parsedate_to_datetime(raw)
46        if dt is not None:
47            if dt.tzinfo is None:
48                dt = dt.replace(tzinfo=timezone.utc)
49            return dt.astimezone(timezone.utc).isoformat()
50    except (TypeError, ValueError):
51        pass
52    # Try ISO 8601 (Atom): '2025-06-10T09:00:00Z'
53    try:
54        dt = datetime.fromisoformat(raw.replace("Z", "+00:00"))
55        if dt.tzinfo is None:
56            dt = dt.replace(tzinfo=timezone.utc)
57        return dt.astimezone(timezone.utc).isoformat()
58    except ValueError:
59        return raw  # return as-is if unparseable, don't lose data
60
61
62def _localname(tag: str) -> str:
63    return tag.rsplit("}", 1)[-1]
64
65
66def _find_text(el, names: list[str]) -> str | None:
67    """Find first child whose local name matches any of `names`."""
68    for child in el:
69        if _localname(child.tag) in names and child.text:
70            return _strip(child.text)
71    return None
72
73
74def parse_rss_item(item) -> dict:
75    """Parse an RSS 2.0 <item>."""
76    # Link is plain text in RSS.
77    link = _find_text(item, ["link"])
78    # content:encoded for full content, else description.
79    content = item.find("content:encoded", NS)
80    content_text = _strip(content.text) if content is not None and content.text else None
81    description = _find_text(item, ["description"])
82    creator = item.find("dc:creator", NS)
83    author = (_strip(creator.text) if creator is not None and creator.text else None) or _find_text(item, ["author"])
84    # categories
85    cats = [_strip(c.text) for c in item if _localname(c.tag) == "category" and c.text]
86    # guid
87    guid = _find_text(item, ["guid"])
88    # enclosure (media)
89    media_url = None
90    for c in item:
91        if _localname(c.tag) == "enclosure" and c.get("url"):
92            media_url = c.get("url")
93            break
94    return {
95        "title": _find_text(item, ["title"]),
96        "link": link,
97        "summary": description,
98        "content": content_text,
99        "author": author,
100        "published": _normalize_date(_find_text(item, ["pubDate", "date"])),
101        "updated": None,
102        "categories": [c for c in cats if c],
103        "guid": guid,
104        "mediaUrl": media_url,
105    }
106
107
108def parse_atom_entry(entry) -> dict:
109    """Parse an Atom <entry>."""
110    # Link: prefer rel=alternate, else first link's href.
111    link = None
112    for ln in entry.findall("atom:link", NS):
113        if ln.get("rel") in (None, "alternate") and ln.get("href"):
114            link = ln.get("href")
115            break
116    if link is None:
117        any_link = entry.find("atom:link", NS)
118        link = any_link.get("href") if any_link is not None else None
119
120    content_el = entry.find("atom:content", NS)
121    content_text = _strip(content_el.text) if content_el is not None and content_el.text else None
122    summary = _find_text(entry, ["summary"])
123
124    authors = [
125        _strip(n.text)
126        for n in entry.findall("atom:author/atom:name", NS)
127        if n is not None and n.text
128    ]
129    cats = [c.get("term") for c in entry.findall("atom:category", NS) if c.get("term")]
130
131    return {
132        "title": _find_text(entry, ["title"]),
133        "link": link,
134        "summary": summary,
135        "content": content_text,
136        "author": authors[0] if authors else None,
137        "published": _normalize_date(_find_text(entry, ["published", "issued"])),
138        "updated": _normalize_date(_find_text(entry, ["updated", "modified"])),
139        "categories": cats,
140        "guid": _find_text(entry, ["id"]),
141        "mediaUrl": None,
142    }
143
144
145def parse_feed(xml_text: str) -> tuple[dict, list[dict], str]:
146    """Parse a feed -> (feed_meta, items, feed_type)."""
147    root = ET.fromstring(xml_text)
148    tag = _localname(root.tag)
149
150    if tag == "rss":
151        channel = root.find("channel")
152        if channel is None:
153            return {}, [], "rss"
154        meta = {
155            "feedTitle": _find_text(channel, ["title"]),
156            "feedLink": _find_text(channel, ["link"]),
157            "feedDescription": _find_text(channel, ["description"]),
158        }
159        items = [parse_rss_item(it) for it in channel.findall("item")]
160        return meta, items, "rss"
161
162    if tag == "feed":  # Atom
163        meta = {
164            "feedTitle": _find_text(root, ["title"]),
165            "feedLink": None,
166            "feedDescription": _find_text(root, ["subtitle"]),
167        }
168        for ln in root.findall("atom:link", NS):
169            if ln.get("rel") in (None, "alternate") and ln.get("href"):
170                meta["feedLink"] = ln.get("href")
171                break
172        items = [parse_atom_entry(e) for e in root.findall("atom:entry", NS)]
173        return meta, items, "atom"
174
175    # RDF / RSS 1.0 fallback: items are direct children in RDF namespace.
176    items = [parse_rss_item(it) for it in root if _localname(it.tag) == "item"]
177    return {"feedTitle": None}, items, "rdf"
178
179
180async def main() -> None:
181    async with Actor:
182        actor_input = await Actor.get_input() or {}
183        urls = [u.strip() for u in (actor_input.get("feedUrls", []) or []) if u and u.strip()]
184        max_items = int(actor_input.get("maxItemsPerFeed", 100))
185        include_content = bool(actor_input.get("includeContent", True))
186
187        if not urls:
188            Actor.log.warning("No feed URLs provided. Add one or more RSS/Atom feed URLs.")
189            await Actor.push_data([])
190            return
191
192        Actor.log.info(f"Fetching {len(urls)} feed(s).")
193
194        async with httpx.AsyncClient(
195            timeout=40.0, follow_redirects=True,
196            headers={"User-Agent": "scrapeworks-rss-to-json/0.1", "Accept": "application/rss+xml, application/atom+xml, application/xml, text/xml"},
197        ) as client:
198            for url in urls:
199                xml_text = None
200                for attempt in range(1, 4):
201                    try:
202                        resp = await client.get(url)
203                        resp.raise_for_status()
204                        xml_text = resp.text
205                        break
206                    except httpx.HTTPError as exc:
207                        Actor.log.warning(f"Fetch attempt {attempt} failed for {url}: {exc}")
208                        if attempt < 3:
209                            await asyncio.sleep(attempt * 2)
210
211                if xml_text is None:
212                    await Actor.push_data([{"feedUrl": url, "error": "Failed to fetch feed"}])
213                    continue
214
215                try:
216                    meta, items, ftype = parse_feed(xml_text)
217                except ET.ParseError as exc:
218                    Actor.log.warning(f"Could not parse {url}: {exc}")
219                    await Actor.push_data([{"feedUrl": url, "error": f"Invalid feed XML: {exc}"}])
220                    continue
221
222                Actor.log.info(f"  {url}: {ftype} feed, {len(items)} items")
223                batch = []
224                for it in items[:max_items]:
225                    if not include_content:
226                        it.pop("content", None)
227                    it["feedUrl"] = url
228                    it["feedTitle"] = meta.get("feedTitle")
229                    batch.append(it)
230                if batch:
231                    await Actor.push_data(batch)
232
233        Actor.log.info("Done.")

src/main.py

1import asyncio
2from .main import main
3
4asyncio.run(main())

Universal RSS/Atom Feed Reader

blazing_stake/rss-feed-reader

Parse any RSS or Atom feed into clean JSON: title, link, author, date, categories, content, media. Handles both formats. For content monitoring and aggregation.

Mehmet Kut

RSS Feed Scraper

ef12/rss-scraper

Fetch and parse any RSS or Atom feed into structured JSON. Get titles, links, descriptions, authors, dates, and categories.

Daniel Wilson

RSS & News Feed Extractor - Articles to JSON/CSV

pear_fight/rss-news-feed-extractor-articles-to-json-csv

Parse any RSS or Atom feed into clean, structured article data: title, link, author, publish date, categories, summary and full content. Handles both RSS and Atom formats. Perfect for news monitoring, content aggregation and feeding data pipelines. Export to JSON, CSV, Excel.

Harald

RSS & Atom Feed to JSON Scraper

andok/rss-parser

Monitor blogs, news sites, and podcasts. Convert any RSS or Atom feed into structured JSON data for instant content syndication.

Andok

RSS to JSON — Structured Feed Data for AI

wsgcjj/rss-to-json

Convert any RSS or Atom feed to clean structured JSON. Perfect for AI agents, content aggregation, news monitoring, and data pipelines.

陈俊杰

RSS Feed Scraper — Atom, Podcast & Multi-Feed

devilscrapes/rss-feed-scraper

Parse and convert any RSS or Atom feed to a clean dataset — title, link, author, published date, summary, full HTML content, tags, GUID — export to JSON or CSV. A drop-in RSS feed parser for RSS 2.0, Atom 1.0, and the content:encoded / dc:creator extensions.

DevilScrapes

RSS Feed Scraper - RSS & Atom Data

benthepythondev/rss-feed-scraper

Scrape RSS and Atom feeds into structured records with title, URL, author, publish date, categories, image and summary.

Ben

RSS Feed Scraper & Monitor — Any Feed to JSON, CSV, Excel

q_services/rss-feed-monitor

Turn any RSS or Atom feed into clean structured data. Keyword filtering, deduplication across feeds. Perfect for content monitoring.

Q Services

RSS & Atom Feed Aggregator

mahogany_songbird/rss-feed-aggregator

Parse RSS/Atom feeds into structured items.

Britton Furness

RSS Feed Parser â€” Convert Any RSS or Atom Feed to Clean JSON

eliai/rss-feed-parser

RSS feed parser for developers and AI agents: pass any RSS or Atom feed URL as input and get back clean, structured JSON items (title, link, date, and other feed fields). Pay per result â€” cost scales with items parsed, nothing hidden.