Pricing

from $2.00 / 1,000 html extractions

Refinery — HTML to LLM Text (Cut RAG Token Cost)

HTML to LLM text cleaner for RAG pipelines. Strip scripts, nav & layout junk after Firecrawl or your fetch. BeautifulSoup-alternative speed. $0.002/page · 3 README demos in Console.

Pricing

from $2.00 / 1,000 html extractions

Rating

0.0

(0)

Developer

Lare Labs

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

Refinery — HTML to LLM text cleaner for RAG pipelines

Apify Actor that cleans bloated HTML before chunking and embedding — strip scripts, nav, ads, and layout junk from pages you already fetched.
Pay $0.002/page · ~2–8ms per page (Rust core, after your crawler runs).

Refinery pipeline: raw HTML to clean JSON for RAG

Reduce LLM token cost — HTML cleaner for RAG

Token reduction: bloated HTML vs clean text after Refinery

News-style homepages and heavy DOM pages often waste tokens on chrome. Refinery returns main-body text plus word_count so you can budget embeddings — up to ~97% fewer estimated tokens on bloated HTML (your mileage varies).

Scraped timelines and comment threads ship messy DOM — scripts, sidebars, widgets. Refinery keeps post body text and normalizes @mentions / #hashtags for RAG chunking without paying for layout noise.

Social and feed HTML: mentions and hashtags preserved as clean text

Paste raw_payload from your scraper, or pass URLs if you already fetch HTML elsewhere.

Apify Console output — clean text and word count

Run Try actor with the prefilled example.com URL — each dataset row includes text, word_count, and timing:

Apify dataset output: clean text, word count, and timing

Bulk HTML cleaning for crawl batches

Send many URLs in one run — each row gets text, word_count, and processing_time_ms. Ideal after a sitemap pass, Firecrawl export, or Apify crawler dataset.

Bulk URL mode: many pages in, dataset rows out

{
  "urls": [
    "https://example.com",
    "https://www.bbc.com/news",
    "https://httpbin.org/html"
  ],
  "removeScripts": true,
  "removeStyles": true,
  "includeMetadata": true
}

Who uses this HTML text extractor

RAG and agent builders cutting OpenAI / Anthropic token spend on page HTML
Scrape pipelines that already fetch HTML (Firecrawl, Crawl4AI, Playwright, Apify Web Scraper)
Teams replacing per-worker BeautifulSoup with a fast HTML parser API on Apify

Refinery is not a web crawler. It is an HTML-to-text preprocessing step after fetch.

Your crawler → raw HTML → Refinery → clean text → chunk → embed → vector DB → LLM

Try the HTML to LLM cleaner (3 demos)

Demo 1 is prefilled in Console. Paste Demo 2 or Demo 3 to see different modes.

Demo 1 — Quick URL

{
  "urls": ["https://example.com"],
  "removeScripts": true,
  "removeStyles": true,
  "includeMetadata": true
}

Demo 2 — Bloated news homepage

{
  "urls": ["https://www.bbc.com/news"],
  "removeScripts": true,
  "removeStyles": true,
  "includeMetadata": true
}

Demo 3 — Paste HTML (middleware)

{
  "raw_payload": "<html><head><script>gtag('event')</script></head><body><nav>Home · Pricing</nav><article><h1>Update</h1><p>Clean before embedding.</p></article></body></html>",
  "removeScripts": true,
  "removeStyles": true,
  "includeMetadata": true
}

Output — text, word_count, language, timing

{
  "text": "Example Product Page\nEnterprise AI Infrastructure...",
  "language": "en",
  "word_count": 12,
  "content_type": "web",
  "processing_time_ms": 19.29,
  "success": true
}

Field	Use it for
`text`	Chunking, embeddings, LLM context
`word_count`	Cost estimates
`processing_time_ms`	Latency monitoring

Firecrawl, Crawl4AI, and BeautifulSoup alternative

Refinery in your stack: crawler, clean, vector DB, LLM

You already use…	Refinery's job
Firecrawl, Crawl4AI	Clean their HTML before chunking — fetch with them, clean with Refinery
Apify Web Scraper, Website Content Crawler	Clean the `html` field in your dataset
BeautifulSoup (self-hosted)	Same job, ~281× faster hot path in our benchmarks — pay per page on Apify instead of worker CPU

Pricing — HTML extraction on Apify

Event	Cost
HTML extraction	$0.002 / page
~1,000 pages	~$2.05

Integrate via Apify API (JavaScript and Python)

JavaScript

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: process.env.APIFY_TOKEN });
const run = await client.actor('larelabs/refinery-html-to-llm-cleaner').call({
  urls: ['https://example.com'],
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items[0].text, items[0].word_count);

Python

from apify_client import ApifyClient
client = ApifyClient(os.environ["APIFY_TOKEN"])
run = client.actor("larelabs/refinery-html-to-llm-cleaner").call(
    run_input={"urls": ["https://example.com"]}
)
print(next(client.dataset(run["defaultDatasetId"]).iterate_items()))

FAQ — HTML cleaning for LLM and RAG

Is Refinery a replacement for Firecrawl or Crawl4AI?

No. Fetch with Firecrawl or Crawl4AI, then clean with Refinery. Refinery does not crawl URLs on its own schedule — it strips noise from HTML you already have (or fetches URLs you pass in this run).

How do I reduce RAG token cost from bloated HTML?

Run Refinery on raw HTML before chunking and embedding. Use word_count in the output to estimate savings. Remove scripts, styles, nav, and footer chrome so embeddings only see article body text.

Is this a BeautifulSoup alternative for HTML text extraction?

Yes — same preprocessing job (HTML → clean text), implemented in Rust for low latency. Use it when you want a managed Apify step instead of BeautifulSoup on every worker.

Can I clean HTML after Apify Web Scraper or Website Content Crawler?

Yes. Pass each page's HTML via raw_payload, or pipe URLs from your crawl. Refinery returns plain text ready for chunking.

Does Refinery handle JavaScript SPAs?

Only if you pass rendered HTML from a browser crawler (Playwright, Puppeteer, Firecrawl). Refinery cleans DOM; it does not execute JavaScript.

No login or feed scraping. Pass saved timeline HTML via raw_payload — Refinery extracts post text and normalizes @mentions / #hashtags.

Support — LareLabs

LareLabs · Apify Store listing · Console

Rust core · Apify Actor · Update WebPs in assets/store/, upload PNGs to Imgur, edit image_urls.json, then run embed + sync scripts.

Agent Ready Data Cleaner

topnetworks/agent-ready-data-cleaner

Clean and token-optimise HTML, JSON, scraped text, or URLs for LLM pipelines. Strip boilerplate, chunk by semantics, get token counts — feed your agents clean data, not nav bars.

Les

Website Content Crawler — Text, Markdown & HTML for AI/LLM

hichemdev/website-content-crawler

Crawl any website and extract clean text, Markdown, and HTML from every page — ready for LLM, RAG, and AI ingestion.

Hichem Ben Moussa

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Logiover

RAG Post Processor - Text Cleaner & Chunker for LLM Pipelines

jalicia/rag-post-processor

Clean and chunk scraped text for RAG and LLM pipelines. Strips HTML, collapses whitespace, splits into overlapping chunks ready for embedding. Works standalone or chained after any scraper. Per-row billing.

Jordan Wagner

Text Extractor from HTML

anaselgamed/text-extractor

Extract clean plain text from any HTML content. Strip tags, scripts, and boilerplate automatically. Essential for NLP, content analysis, and data pipelines.

Anas Hossam

HTML to Markdown/Text

wowo51/html-to-md

Convert html to md or txt. Perfect for AI agents that need to cut expensive LLM costs.

Warren Harding

Website to Markdown Scraper - HTML to MD for RAG API

pink_comic/website-content-to-markdown

Convert web pages and HTML to clean Markdown for RAG, LLM training, AI knowledge bases, and content migration. Strips nav, ads, scripts, and styling while preserving structure. Bulk output includes word/link/image counts.

Ava Torres

LLM-Ready Web Extractor

phantom_horse/my-actor-1

Turn any web page into clean, LLM-ready Markdown. Strips scripts, nav, and page chrome, then converts the main content to tidy Markdown with title, meta description, and token counts. Perfect for AI prompts and RAG ingestion pipelines.

NATNAEL FIKRE

LLM-Ready Web Scraper – RAG & Vertical Data Extraction

conceivable_extension/llm-ready-web-scraper

Scrapes any URL and returns clean LLM-ready content. Strips ads, nav, and boilerplate. Returns markdown, chunked text, token estimates, and metadata. Vertical modes for Legal, Medical, Property, E-commerce, Research, and News. Firecrawl alternative at $0.005 per URL.

joseph fadero

PDF Extractor: PDF → Clean Markdown + JSON for LLM/RAG

boxbox10/pdf-extractor

Turn any PDF URL into clean, LLM-ready Markdown + structured JSON (title, metadata, per-page text, page count, word count, token count). Perfect for RAG pipelines, AI agents, and LLM document ingestion.