RSS & Atom Feeds to RAG Markdown Chunks
Under maintenancePricing
Pay per event
RSS & Atom Feeds to RAG Markdown Chunks
Under maintenanceTurn RSS/Atom feeds into full-article clean Markdown + token-bounded RAG chunks for embeddings & vector DBs. sha256 cross-item dedup means you pay only for net-new articles, not syndicated copies. Robots honored; ownership-gated.
Pricing
Pay per event
Rating
0.0
(0)
Developer
Adam
Maintained by CommunityActor stats
0
Bookmarked
1
Total users
0
Monthly active users
2 days ago
Last modified
Categories
Share
RSS & Atom Feeds to RAG Markdown
Turn RSS and Atom feeds into RAG-ready Markdown chunks, with content-hash deduplication so you only pay for net-new articles.
What it does
For each feed URL you supply, this Actor:
- Fetches the feed and parses it into normalized items (
title,link,guid,isoDate). Both RSS<item>(with<link>text</link>) and Atom<entry>(with<link href="...">) layouts are supported. Fields that are not present in the feed are leftnull— nothing is invented. - Computes a deterministic
sha256content hash for each item and compares it against a cross-run seen-hash snapshot kept in the Actor's key-value store, so already-processed items are skipped as duplicates. - For each net-new item, fetches the linked article, cleans the HTML down to plain Markdown (scripts/styles/tags stripped), and splits it into token-bounded chunks via the shared chunker.
- Emits one
chunkrecord per chunk, carrying the source URL, feed item GUID, chunk text, token count, content hash, and the extracted feed fields. - Writes one
run_summaryrecord with the run totals.
Feeds or articles that fail to fetch are skipped (and not billed), so one bad URL never fails the whole run.
Input
| Field | Type | Required | Description |
|---|---|---|---|
feedUrls | array | Yes | Public RSS or Atom feed URLs you are authorized to read. |
chunking | object | No | { maxTokens, overlapTokens } for token-bounded chunking. Defaults: maxTokens 512, overlapTokens 64. |
ownership_attestation | boolean | Yes | You must confirm you are authorized to fetch the supplied feeds and their linked articles. The run is rejected before any work or billing if this is not true. |
Output
Every record has a record_type field.
chunk — one per emitted RAG chunk:
| Field | Type | Description |
|---|---|---|
record_type | string | Always "chunk". |
source_url | string | The article URL the chunk came from. |
feed_item_guid | string | null | The feed item GUID/id (or null if the feed omitted it). |
chunk_index | integer | Zero-based index of this chunk within its article. |
chunk_text | string | The chunk's Markdown text. |
token_count | integer | Estimated token count of the chunk (never exceeds maxTokens). |
content_hash | string | sha256:<64 hex> hash of the chunk text. |
extracted_fields | object | The feed fields extracted for the item (title, link, guid, isoDate); absent fields are null. |
run_summary — exactly one per run:
| Field | Type | Description |
|---|---|---|
record_type | string | Always "run_summary". |
items_in_feed | integer | Total feed items seen across all feeds. |
articles_fetched | integer | Net-new articles fetched and chunked. |
duplicates_skipped | integer | Items skipped because their content hash was already seen. |
Pricing
Pay-Per-Event:
| Event | When it fires | Price |
|---|---|---|
actor_run_start | Once per run, after the gates pass | $0.02 |
article_processed | Per net-new article fetched, cleaned, and chunked | $0.008 |
field_extracted | Per non-null structured field extracted from a feed item | $0.004 |
Duplicates skipped by content hash are not billed — you only pay for net-new work.
Example: a feed with 5 net-new articles, each with 4 fields = $0.02 + 5 x $0.008 + 20 x $0.004 = $0.14.
Why this Actor
- Deterministic and idempotent. Feed parsing, hashing, dedup, and chunking are pure functions; the same feed yields the same content hashes every run, so you can cache and detect changes safely.
- Content-hash dedup across runs. A key-value seen-hash snapshot means re-running a feed only processes (and bills) genuinely new items.
- No hallucination. Item fields come straight from the feed XML; missing fields are
null, never fabricated. There are no LLM calls and no API keys. - Pre-chunked for RAG. Output is already split into bounded chunks with stable hashes, ready to embed.
About
This Actor is AI-authored and operated under the publisher's LLC. Actor.charge() is used only to bill the customer for the Pay-Per-Event units described above — the Actor has no payout or money-out capability of any kind.