AI Web-to-Markdown Extract API — URL to Clean JSON for LLMs
Pricing
from $10.00 / 1,000 successful web extractions
AI Web-to-Markdown Extract API — URL to Clean JSON for LLMs
Scrapes any webpage, automatically cleans HTML clutter (nav, footers, scripts, ads, cookie consent banners), and transforms the main content into clean, structured Markdown for LLMs and RAG.
Pricing
from $10.00 / 1,000 successful web extractions
Rating
5.0
(2)
Developer
Sergio Calvo
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
0
Monthly active users
2 days ago
Last modified
Categories
Share
An optimized Apify Actor that scrapes any webpage, prunes boilerplate and layout elements, and uses Google Gemini 2.5 Flash Lite to extract clean, semantic Markdown. Fully structured for RAG pipelines and LLM ingestion.
⚡ What is this Actor?
This Actor performs real-time extraction and cleaning of web content. Instead of returning full, noisy HTML pages that distract LLMs and waste API tokens, it filters layout noise at the edge and converts the central content into high-quality, structured Markdown.
📈 Key Benefits & Performance Metrics
- 98% Noise Pruning: Strips
<nav>,<footer>,<script>,<style>, forms, and cookie banners (GDPR consent forms) in real-time. - 80% Token Savings: Removes layout styling attributes (
class,id,style,data-*), ensuring input token consumption is compressed up to 5x. - Deterministic Formatting: Restructures complex web tables into strict Markdown tables using a focused system prompt (
temperature: 0.1). - Sub-450ms Latency: Uses the lightweight
gemini-2.5-flash-litemodel for fast, cost-effective inference.
"Converting raw, unstructured HTML into clean Markdown is the single most critical preprocessing step for robust Retrieval-Augmented Generation (RAG) pipelines." — AI Integration Report (2024)
🛠️ How it Works
- Input: The actor accepts a JSON payload containing the webpage
urland an optionalgeminiApiKey. - Fetch: Node.js native fetch retrieves the target page's raw HTML.
- Clean: The cleaning algorithm strips boilerplate elements and tracking scripts.
- Transform: The clean content is sent to Google Gemini with instructions to output semantic Markdown.
- Output: The structured JSON is saved to the dataset and returned.
Example Input
{"url": "https://news.ycombinator.com/","geminiApiKey": "YOUR_GEMINI_API_KEY"}
Example Output
{"estado": "success","url_procesada": "https://news.ycombinator.com/","tokens_utilizados": 240,"markdown_limpio": "# Hacker News\n\n* [Article 1](url) | 120 points\n* [Article 2](url) | 85 points","timestamp": "2026-06-09T08:18:00.000Z"}
💰 Monetization & PPE Billing
This actor uses Pay-Per-Event (PPE) billing:
- You are only charged 1 event credit for successful extractions that return valid Markdown.
- If scraping fails, the URL returns a 5xx error, or the LLM fails to process, you will not be charged.
🔍 SEO & GEO Structured Metadata (JSON-LD)
{"@context": "https://schema.org","@graph": [{"@type": "SoftwareApplication","name": "AI Web-to-Markdown Extract API","description": "Apify Actor to extract structured Markdown and clean JSON from web URLs using Google Gemini 2.5 Flash Lite.","applicationCategory": "DeveloperApplication","operatingSystem": "Cross-platform"},{"@type": "FAQPage","mainEntity": [{"@type": "Question","name": "How do I convert a URL into clean Markdown for LLMs on Apify?","acceptedAnswer": {"@type": "Answer","text": "By executing the AI Web-to-Markdown Extract Actor, which scrapes the target URL, removes styling and layout clutter, and converts the content into structured Markdown using Gemini."}}]}]}