URL List to RAG & Vector JSONL avatar

URL List to RAG & Vector JSONL

Pricing

$1.00 / 1,000 url converted to vector jsonls

Go to Apify Store
URL List to RAG & Vector JSONL

URL List to RAG & Vector JSONL

Paste a curated URL list and get clean Markdown, document JSONL, vector chunks, ingest manifest, and failed URL report.

Pricing

$1.00 / 1,000 url converted to vector jsonls

Rating

0.0

(0)

Developer

Orbiscribe Labs

Orbiscribe Labs

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Use this Actor when you already know the URLs you want to ingest and need a controlled conversion step for a vector database or RAG pipeline.

It fetches each public URL, extracts readable content, creates stable chunks, and writes JSONL artifacts plus a manifest so failed URLs are easy to inspect. It does not crawl around the site unless you explicitly provide more URLs.

What you get

  • Dataset rows for document records and chunks.
  • Clean Markdown, main text, headings, links, canonical URL, content hash, and output preset metadata.
  • Key-value outputs: RAG_CHUNKS_JSONL, VECTOR_CHUNKS_JSONL, DOCUMENTS_JSONL, INGEST_MANIFEST, FAILED_URLS, MARKDOWN_BUNDLE, BUYER_BRIEF, and RUN_SUMMARY.

Common workflows

  • Convert a curated URL export into vector-store JSONL.
  • Reprocess a known page list without a crawler wandering through the site.
  • Send failed URLs to a cleanup queue.
  • Keep document and chunk records side by side for debugging retrieval.

Input

Provide urls and choose an outputPreset such as openai_vector_store, langchain, llamaindex, pinecone, qdrant, or generic_jsonl. The preset is included in chunk metadata so downstream jobs can route or transform the JSONL.

Use includeUrlPatterns, excludeUrlPatterns, maxUrls, chunkSizeChars, and chunkOverlapChars to control scope, cost, and chunk shape. The default run processes three public Apify docs URLs so a first Store run produces real Markdown and JSONL without extra setup.

{
"urls": [
{ "url": "https://docs.apify.com/academy/getting-started" },
{ "url": "https://docs.apify.com/academy/web-scraping-for-beginners" },
{ "url": "https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/actor-description" }
],
"outputPreset": "openai_vector_store",
"includeUrlPatterns": ["/academy/"],
"excludeUrlPatterns": [],
"maxUrls": 3,
"chunkSizeChars": 2500,
"chunkOverlapChars": 250,
"dryRun": false
}

Pricing

Recommended monetization: Pay per Event at $0.001 per vector-jsonl-url.

When pay-per-event pricing is enabled, dry runs are uncharged and free-plan callers get the first 25 processed sources without this Actor's custom event charge. Users should still set Apify spending limits before large batches.

Limits and compliance

Public URLs only. This Actor does not bypass logins, paywalls, robots policies, or access controls. It is intentionally a URL-list converter, not a broad crawler.