RSS & Atom Feeds to RAG Markdown Chunks avatar

RSS & Atom Feeds to RAG Markdown Chunks

Under maintenance

Pricing

Pay per event

Go to Apify Store
RSS & Atom Feeds to RAG Markdown Chunks

RSS & Atom Feeds to RAG Markdown Chunks

Under maintenance

Turn RSS/Atom feeds into full-article clean Markdown + token-bounded RAG chunks for embeddings & vector DBs. sha256 cross-item dedup means you pay only for net-new articles, not syndicated copies. Robots honored; ownership-gated.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Adam

Adam

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

2 days ago

Last modified

Share

RSS & Atom Feeds to RAG Markdown

Turn RSS and Atom feeds into RAG-ready Markdown chunks, with content-hash deduplication so you only pay for net-new articles.

What it does

For each feed URL you supply, this Actor:

  1. Fetches the feed and parses it into normalized items (title, link, guid, isoDate). Both RSS <item> (with <link>text</link>) and Atom <entry> (with <link href="...">) layouts are supported. Fields that are not present in the feed are left null — nothing is invented.
  2. Computes a deterministic sha256 content hash for each item and compares it against a cross-run seen-hash snapshot kept in the Actor's key-value store, so already-processed items are skipped as duplicates.
  3. For each net-new item, fetches the linked article, cleans the HTML down to plain Markdown (scripts/styles/tags stripped), and splits it into token-bounded chunks via the shared chunker.
  4. Emits one chunk record per chunk, carrying the source URL, feed item GUID, chunk text, token count, content hash, and the extracted feed fields.
  5. Writes one run_summary record with the run totals.

Feeds or articles that fail to fetch are skipped (and not billed), so one bad URL never fails the whole run.

Input

FieldTypeRequiredDescription
feedUrlsarrayYesPublic RSS or Atom feed URLs you are authorized to read.
chunkingobjectNo{ maxTokens, overlapTokens } for token-bounded chunking. Defaults: maxTokens 512, overlapTokens 64.
ownership_attestationbooleanYesYou must confirm you are authorized to fetch the supplied feeds and their linked articles. The run is rejected before any work or billing if this is not true.

Output

Every record has a record_type field.

chunk — one per emitted RAG chunk:

FieldTypeDescription
record_typestringAlways "chunk".
source_urlstringThe article URL the chunk came from.
feed_item_guidstring | nullThe feed item GUID/id (or null if the feed omitted it).
chunk_indexintegerZero-based index of this chunk within its article.
chunk_textstringThe chunk's Markdown text.
token_countintegerEstimated token count of the chunk (never exceeds maxTokens).
content_hashstringsha256:<64 hex> hash of the chunk text.
extracted_fieldsobjectThe feed fields extracted for the item (title, link, guid, isoDate); absent fields are null.

run_summary — exactly one per run:

FieldTypeDescription
record_typestringAlways "run_summary".
items_in_feedintegerTotal feed items seen across all feeds.
articles_fetchedintegerNet-new articles fetched and chunked.
duplicates_skippedintegerItems skipped because their content hash was already seen.

Pricing

Pay-Per-Event:

EventWhen it firesPrice
actor_run_startOnce per run, after the gates pass$0.02
article_processedPer net-new article fetched, cleaned, and chunked$0.008
field_extractedPer non-null structured field extracted from a feed item$0.004

Duplicates skipped by content hash are not billed — you only pay for net-new work.

Example: a feed with 5 net-new articles, each with 4 fields = $0.02 + 5 x $0.008 + 20 x $0.004 = $0.14.

Why this Actor

  • Deterministic and idempotent. Feed parsing, hashing, dedup, and chunking are pure functions; the same feed yields the same content hashes every run, so you can cache and detect changes safely.
  • Content-hash dedup across runs. A key-value seen-hash snapshot means re-running a feed only processes (and bills) genuinely new items.
  • No hallucination. Item fields come straight from the feed XML; missing fields are null, never fabricated. There are no LLM calls and no API keys.
  • Pre-chunked for RAG. Output is already split into bounded chunks with stable hashes, ready to embed.

About

This Actor is AI-authored and operated under the publisher's LLC. Actor.charge() is used only to bill the customer for the Pay-Per-Event units described above — the Actor has no payout or money-out capability of any kind.