Pricing

from $2.00 / 1,000 markdown results

RAG Markdown Cleaner

Transform web pages into RAG-ready Markdown with smart chunking, metadata, code detection & quality scoring. Production-tested deduplication. Fully open-source (Apache 2.0)—review code, contribute, or self-host. Turn messy HTML into embedding-ready knowledge instantly.

Pricing

from $2.00 / 1,000 markdown results

Rating

0.0

(0)

Developer

Mohamed khalil Zouitni

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

Web to Markdown Converter for RAG

Convert web pages into clean, token-efficient Markdown optimized for Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) systems with advanced features for production use.

🚀 Features

Core Extraction

Clean Content Extraction: Uses Mozilla's Readability algorithm to extract only the main article content, removing navigation, footers, ads, and other clutter
Markdown Conversion: Converts HTML to clean, readable Markdown format
Smart Link Resolution: Automatically converts relative URLs to absolute URLs for functional references

RAG Optimization

Smart Semantic Chunking: Automatically splits content into embedding-friendly chunks (configurable size, default 1000 chars)
- Preserves heading hierarchy as context breadcrumbs
- Configurable overlap between chunks for better continuity
- Token estimates for each chunk
Metadata Extraction: Captures author, publish date, keywords, description, language, and content type
Code Block Detection: Identifies and extracts code blocks with language detection
- Separate tracking of fenced code blocks and inline code
- Line counts and language identification

Quality & Deduplication

Content Quality Metrics:
- Text density, word count, reading time
- Paragraph and sentence analysis
- Structure scoring (presence of lists, headings, links)
Deduplication Hashing: SHA256 content hashing and similarity hashing for duplicate detection
Production Ready: Docker-based, robust error handling, deployable to Apify platform

📋 Use Cases

Building RAG knowledge bases from web content with optimal chunk sizes
Preparing training data for LLMs with quality metrics
Creating clean documentation from web pages with preserved structure
Batch processing articles with deduplication
Web content archival in Markdown format with full metadata

🎯 Input

The actor accepts the following input parameters:

`start_urls` (required)

Array of URL objects to scrape and convert.

Example:

[
  { "url": "https://example.com/article1" },
  { "url": "https://example.com/article2" }
]

`include_links` (optional, default: `true`)

📤 Output

The actor pushes results to the default Apify Dataset. Each scraped URL produces a comprehensive object:

{
  "url": "https://example.com/article",
  "title": "Article Title",
  "markdown_content": "# Article Title\n\nClean markdown content...",
  "chunks": [
    {
      "chunk_id": 1,
      "content": "First semantic chunk...",
      "heading_context": "Introduction > Background",
      "char_count": 850,
      "estimated_tokens": 212
    }
  ],
  "metadata": {
    "url": "https://example.com/article",
    "domain": "example.com",
    "scraped_at": "2025-12-11T01:23:45.123Z",
    "author": "John Doe",
    "publish_date": "2024-01-15",
    "last_modified": "2024-03-20",
    "language": "en",
    "keywords": ["web scraping", "python", "rag"],
    "description": "Article description from meta tags",
    "content_type": "blog"
  },
  "code_blocks": {
    "fenced_blocks": [
      {
        "language": "python",
        "code": "def example():\n    pass",
        "lines": 2
      }
    ],
    "inline_code_count": 15,
    "has_code": true
  },
  "quality_metrics": {
    "text_density": 0.75,
    "paragraph_count": 25,
    "word_count": 1500,
    "sentence_count": 85,
    "avg_sentence_length": 18.5,
    "reading_time_minutes": 8,
    "has_lists": true,
    "has_headings": true,
    "has_links": true,
    "structure_score": 1.0
  },
  "hashes": {
    "content_hash": "a1b2c3d4e5f6...",
    "similarity_hash": "1234567890abcdef"
  },
  "total_chunks": 3,
  "total_chars": 2500,
  "estimated_tokens": 625
}

Output Fields

Basic Info:

url: The original URL that was scraped
title: The extracted page title
markdown_content: The main content converted to Markdown format (with absolute URLs)

Chunks (for RAG):

chunks: Array of semantic chunks optimized for embeddings
- chunk_id: Sequential identifier
- content: The chunk text
- heading_context: Breadcrumb trail of headings (e.g., "Chapter 1 > Section 2")
- char_count: Character count
- estimated_tokens: Approximate token count (1 token ≈ 4 chars)

Metadata:

domain: Extracted domain name
scraped_at: ISO timestamp of scraping
author: Page author (from meta tags or JSON-LD)
publish_date: Publication date
last_modified: Last modification date
language: Two-letter language code
keywords: Array of keywords from meta tags
description: Page description
content_type: Detected type (blog, documentation, wiki, product, general)

Code Blocks:

fenced_blocks: Array of code blocks with language and line count
inline_code_count: Number of inline code snippets
has_code: Boolean indicating presence of code

Quality Metrics:

text_density: Ratio of text to HTML (higher = cleaner content)
word_count: Total words
reading_time_minutes: Estimated reading time
structure_score: 0-1 score based on presence of lists, headings, and links

Deduplication:

content_hash: SHA256 hash for exact duplicate detection
similarity_hash: Normalized hash for near-duplicate detection

The actor pushes results to the default Apify Dataset. Each scraped URL produces one object:

{
  "url": "https://example.com/article",
  "title": "Article Title",
  "markdown_content": "# Article Title\n\nClean markdown content..."
}

Output Fields

url: The original URL that was scraped

💡 Tips

Chunking: Default chunk size is 1000 characters with 100 character overlap - ideal for most embedding models
Rate Limiting: For large batches, consider adding delays between requests
Custom Headers: The actor includes a User-Agent header to avoid blocks
Timeout: Default timeout is 30 seconds per URL
Error Handling: Failed URLs are logged but don't stop the actor

📝 Example Use in RAG Pipeline

# After running the actor, use the output in your RAG system
from apify_client import ApifyClient

client = ApifyClient(token='YOUR_TOKEN')
run = client.actor('YOUR_ACTOR_ID').call(run_input={
    'start_urls': [
        {'url': 'https://docs.example.com/page1'},
        {'url': 'https://docs.example.com/page2'}
    ],
    'include_links': False
})

# Get dataset items with all enhancements
for item in client.dataset(run['defaultDatasetId']).iterate_items():
    # Filter by quality
    if item['quality_metrics']['structure_score'] < 0.5:
        continue  # Skip low-quality content
    
    # Check for duplicates using similarity hash
    if item['hashes']['similarity_hash'] in seen_hashes:
        continue  # Skip duplicate
    seen_hashes.add(item['hashes']['similarity_hash'])
    
    # Process each chunk for embeddings
    for chunk in item['chunks']:
        # Use heading context for better retrieval
        context = f"{item['metadata']['domain']} - {chunk['heading_context']}"
        
        # Embed with your preferred model
        embedding = embed_model.encode(chunk['content'])
        
        # Store in vector database with metadata
        vector_db.upsert(
            id=f"{item['url']}#{chunk['chunk_id']}",
            vector=embedding,
            metadata={
                'url': item['url'],
                'title': item['title'],
                'context': context,
                'content_type': item['metadata']['content_type'],
                'has_code': item['code_blocks']['has_code'],
                'chunk_tokens': chunk['estimated_tokens']
            }
        )
```**Run locally**
```bash
apify run

Testing Locally

Create a .actor/INPUT.json file:

{
  "start_urls": [
    { "url": "https://example.com" }
  ],
  "include_links": true
}

Then run:

$python src/main.py

🚢 Deployment

Deploy to Apify

Install Apify CLI

$npm install -g apify-cli

Login to Apify

$apify login

Push to Apify

$apify push

Docker Build

To build the Docker image locally:

$docker build -t web-to-markdown .

📦 Dependencies

apify: Apify SDK for Python
requests: HTTP library for fetching web pages
readability-lxml: Extract main content from HTML
markdownify: Convert HTML to Markdown
lxml: XML/HTML processing library

🔧 Configuration

The actor is configured via:

.actor/input_schema.json: Defines the input UI in Apify Console
Dockerfile: Container configuration
requirements.txt: Python dependencies

💡 Tips

Rate Limiting: For large batches, consider adding delays between requests
Custom Headers: The actor includes a User-Agent header to avoid blocks
Timeout: Default timeout is 30 seconds per URL
Error Handling: Failed URLs are logged but don't stop the actor

📝 Example Use in RAG Pipeline

# After running the actor, use the output in your RAG system
from apify_client import ApifyClient

client = ApifyClient(token='YOUR_TOKEN')
run = client.actor('YOUR_ACTOR_ID').call(run_input={
    'start_urls': [
        {'url': 'https://docs.example.com/page1'},
        {'url': 'https://docs.example.com/page2'}
    ],
    'include_links': False
})

# Get dataset items
for item in client.dataset(run['defaultDatasetId']).iterate_items():
    # Feed markdown_content to your embedding model
    process_for_rag(item['markdown_content'])

🐛 Troubleshooting

No output produced:

Check that URLs are accessible and return HTML content
Verify input format matches the schema

Markdown quality issues:

Some pages may have poor HTML structure
Readability works best with article-style content

Timeout errors:

Increase timeout in requests.get() if needed
Check network connectivity

📄 License

This project is open source and available for use in RAG and LLM applications.

🤝 Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

📧 Support

For issues or questions, please open a GitHub issue or contact via Apify platform.

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

Dev with Bobby

RAG Spider - Web to Markdown Crawler for AI Training Data

lenient_grove/RAG-Spider

Enterprise-grade web crawler that converts messy websites into clean, chunked Markdown for AI systems. Uses Mozilla Readability for 95% cleaner extraction than competitors. Outputs RAG-ready data with metadata and token estimates. Perfect for building knowledge bases and training AI chatbots.

Tejas Rawool

5.0

Rag Content Chunker

labrat011/rag-content-chunker

Turn raw text, Markdown, or Apify datasets into token-perfect RAG chunks with deterministic IDs, source metadata, and a billing-ready summary—ready for embeddings or vector DBs without extra glue code.

Mick

Universal Knowledge Base Scraper (RAG Ready)

actums/universal-rag-scraper

Turn any Help Center into LLM-ready Markdown. Supports Zendesk, Intercom, Docusaurus, and generic sites. Perfect for RAG and AI Agents.

Actums

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds — perfect for AI training data, RAG pipelines, and content archiving.

SmartApi

5.0

Docs To Rag

gabrielaxy/docs-to-rag

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

Gabriel Antony Xaviour

RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases

adinfosys-labs/rag-ready-web-scraper-smart-chunker-for-ai-knowledge-bases

RAG-ready web scraper that collects, cleans, deduplicates, filters, and chunks web content into structured datasets for AI pipelines. Generates high-quality knowledge-base data optimized for LLMs, embeddings, and vector databases

Artashes Arakelyan

Web Scraper RAG Ready

traorealexy/Web-Sraper-RAG-Ready

Turn any website into clean, token-efficient Markdown ready for RAG and LLM pipelines. Removes boilerplate, handles JavaScript rendering, and outputs structured JSON for LangChain, LlamaIndex, and vector databases.