HTML to Markdown — clean conversion, boilerplate stripping
Pricing
from $2.00 / 1,000 results
HTML to Markdown — clean conversion, boilerplate stripping
Convert scraped HTML into clean Markdown and plain text: headings, nested lists, links, images, code blocks, blockquotes, and tables. Drops scripts, styles, and structural boilerplate (nav/footer/aside) so only content remains. Pure parsing, no LLM cost.
Pricing
from $2.00 / 1,000 results
Rating
0.0
(0)
Developer
Shinobu Otani
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
6 days ago
Last modified
Categories
Share
HTML to Markdown
Convert scraped HTML into clean Markdown and plain text — pure parsing, no LLM cost. Pairs well with crawlers upstream and with Doc Structure Extractor or RAG Text Chunker downstream.
What it does
- Headings, paragraphs, nested lists, links, images, emphasis, inline code, fenced code blocks, blockquotes, simple tables, horizontal rules.
- Always drops
<script>,<style>and other non-content tags; drops structural boilerplate (nav,footer,aside,form) by default so only the article content remains. - Extracts the page title (
<title>, falling back to the first<h1>). - Also returns a plain-text rendering and basic stats.
Input
{"documents": ["<html><body><h1>Guide</h1><p>Hello <strong>world</strong></p></body></html>"],"drop_boilerplate": true,"include_links": true,"include_images": true}
Output (one dataset item per document)
{"title": "Guide","markdown": "# Guide\n\nHello **world**","text": "Guide\n\nHello world","stats": {"blocks": 2, "characters": 26, "words": 3},"document_index": 0}
Usage
Feed it raw HTML from any crawler run, then chunk the resulting Markdown for RAG, index the plain text for search, or store the Markdown directly.