Pricing

from $3.00 / 1,000 results

Ecommerce RAG Ingestion Engine

Stop feeding your LLMs noisy HTML and irrelevant UI clutter. The Ecommerce RAG Ingestion Engine is a production-ready Apify Actor designed to transform entire ecommerce domains with a specialized focus on Shopify into clean, AI-ready knowledge bases.

Pricing

from $3.00 / 1,000 results

Rating

0.0

(0)

Developer

Blukaze Automations

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

Ecommerce Store Knowledgebase Scraper

An advanced, production-ready Apify Actor designed to scrape ecommerce websites (with a strong focus on Shopify) to extract both structured product catalogs and AI-ready RAG chunks.

This actor is purpose-built for AI teams, agencies, and ecommerce brands that need to ingest a domain’s content into LLM support bots, semantic search pipelines, merchandising copilots, or internal vector databases.

Why use this over a generic crawler?

Ecommerce-Aware: It automatically identifies and classifies URLs into Product, Collection, Policy, FAQ, and Blog page types.
Shopify Optimized: Employs specific logic to dig into native Shopify payload structures (e.g. product.json payloads, variant variables) enabling reliable, resilient extractions compared to CSS selectors.
LLM & RAG Native: It doesn't just dump HTML. It cleans pages, stripping headers, footers, and noise, and intelligently chunks the main text content via Markdown boundary detection, estimating token sizes automatically.
Configurable Outputs: Choose to output catalog data, knowledge chunks, or both depending on your pipeline requirements.

Ideal Use Cases

AI Support Bots: Train OpenAI, Anthropic, or dialogue systems on the exact latest return policies, shipping times, and FAQs directly from the source.
Semantic Product Search: Ingest products datasets into Pinecone, Weaviate, or Qdrant to enable rich natural language querying over catalogs.
Knowledge Ingestion Pipelines: Integrate directly with Make, n8n, or Langchain to automatically synchronize website knowledge to your enterprise RAG stack.

Output Modes

The actor allows you to selectively choose what output formats you desire via outputMode using the includeSections controls:

1. Catalog Only

Extracts clean, normalized product data pushed to the products dataset.

{
  "url": "https://example.com/products/cool-shirt",
  "title": "Cool Graphic T-Shirt",
  "price": 29.99,
  "currency": "USD",
  "availability": "in_stock",
  "brand": "Example Brand",
  "variants": [
    {"title": "Small", "price": 29.99, "sku": "SHIRT-S"},
    {"title": "Large", "price": 29.99, "sku": "SHIRT-L"}
  ]
}

2. Knowledge Only (RAG Chunks)

Pushes clean text chunks properly bounded and estimated for tokens into the knowledge_chunks dataset.

{
  "url": "https://example.com/policies/shipping-policy",
  "title": "Shipping Policy - International Shipping",
  "section_type": "policy_section",
  "source_kind": "policy",
  "text": "We offer international shipping to over 100 countries. Standard international shipping takes 10-15 business days...",
  "token_estimate": 45
}

Input Configuration

Field	Type	Description
`startUrls`	Array	URLs or Sitemaps to begin crawling.
`includeSections`	Array	Types of pages to process (`products`, `policies`, `faq`, etc.)
`outputMode`	String	`catalog_only`, `knowledge_only`, or `both`.
`maxPages`	Integer	Hard limit of pages to traverse (cost control).
`chunkSizeTokens`	Integer	Target maximum token size for RAG chunks.

Limitations

Render Heavy Sites: Currently operates on a high-speed CheerioCrawler. Highly complex SPAs without Server-Side Rendering (SSR) might have some data hidden. Shopify and most modern ecommerce platforms provide SSR or JSON-LD which this actor consumes natively.
Paywalls/Logins: Does not support bypassing logins or captchas outside of Apify's standard proxy rotation parameters.

MCP Website RAG Ingestion Tool

glowing_glove/mcp-website-rag-ingestion-tool

MCP-ready website ingestion for AI agents using a prompt-first interface that turns public webpages into source-linked RAG chunks with markdown and token estimates.

Ushba Khan

Ecommerce Price Tracker

sweet_rebel/ecommerce-price-tracker

Rajat Sharda

AI RAG Feeder V2

mickeywmoore/ai-rag-feeder-v2

Turn any website into AI-ready Markdown. Scrapes entire domains, removes ads/clutter, and formats text specifically for RAG pipelines and LLM training data.

Mickey Moore

RAG Data Ingestion: Website to AI Knowledge Base

0xysn/rag-data-ingestion-website-to-ai-knowledge-base

Master complex documentation with a premium scraper that flattens Shadow DOM and handles modern web components. Delivers clean, token-accurate Markdown pre-chunked for immediate RAG ingestion into Pinecone, Weaviate, or LangChain. Optimized for high-fidelity LLM training data.

tekk

Ecommerce Store Intelligence API

soft_but_savage/ecommerce-store-intelligence

Enrich ecommerce store domains with platform detection, contact data, socials, tech signals, and a qualification score.

Soft But Savage

RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases

adinfosys-labs/rag-ready-web-scraper-smart-chunker-for-ai-knowledge-bases

RAG-ready web scraper that collects, cleans, deduplicates, filters, and chunks web content into structured datasets for AI pipelines. Generates high-quality knowledge-base data optimized for LLMs, embeddings, and vector databases

Artashes Arakelyan

AI Web Reader (RAG Ready)

viinaysonii/ai-web-reader-rag-ready

Convert any webpage into clean, structured, AI-ready Markdown. Removes ads, images, and UI noise, normalizes content, and outputs data optimized for LLMs, RAG pipelines, and AI agents. Fast, scalable, and built for real-world AI workflows.

URL to Markdown MCP

reverberant_equality/mcp-url-to-markdown

Convert any web page to clean markdown for AI agents. Uses Firefox Reader Mode engine for content extraction. Perfect for RAG pipelines, research, and LLM content ingestion.

Jordan C

RAG-Ready Website Crawler — Clean Content for LLMs & Vector DBs

yourwingman/rag-ready-crawler

Crawl websites and output clean, chunked content optimized for RAG pipelines, LLM training data, and vector databases. Built for AI knowledge bases and semantic search.