RAG Markdown Cleaner

Pricing

from $2.00 / 1,000 markdown results

RAG Markdown Cleaner

Transform web pages into RAG-ready Markdown with smart chunking, metadata, code detection & quality scoring. Production-tested deduplication. Fully open-source (Apache 2.0)—review code, contribute, or self-host. Turn messy HTML into embedding-ready knowledge instantly.

Pricing

from $2.00 / 1,000 markdown results

Rating

0.0

(0)

Developer

Mohamed khalil Zouitni

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

24 days ago

Last modified

.gitignore

# Python
venv/
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg

# Apify
storage/
apify_storage/
.crawlee/

# IDEs
.vscode/
.idea/
*.swp
*.swo
*~

# OS
.DS_Store
Thumbs.db

# Environment
.env
.env.local

# Internal files
INTERNAL.md
v2_entity_extraction.py

Dockerfile

# Use the official Apify Python base image
FROM apify/actor-python:3.11

# Copy all files to /usr/src/app (the working directory is set in the base image)
COPY . ./

# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Run the actor
CMD ["python3", "src/main.py"]

LICENSE

Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   APPENDIX: How to apply the Apache License to your work.

      To apply the Apache License to your work, attach the following
      boilerplate notice, with the fields enclosed by brackets "[]"
      replaced with your own identifying information. (Don't include
      the brackets!)  The text should be enclosed in the appropriate
      comment syntax for the file format. We also recommend that a
      file or class name and description of purpose be included on the
      same "printed page" as the copyright notice for easier
      identification within third-party archives.

   Copyright [yyyy] [name of copyright owner]

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.

requirements.txt

1apify>=2.0.0
2crawlee>=0.3.0
3requests>=2.31.0
4readability-lxml>=0.8.1
5markdownify>=0.11.6
6lxml>=4.9.0
7httpx>=0.24.0

.actor/INPUT.json

{
  "start_urls": [
    {
      "url": "https://docs.python.org/3/tutorial/introduction.html"
    },
    {
      "url": "https://docs.python.org/3/tutorial/controlflow.html"
    }
  ],
  "include_links": true
}

.actor/actor.json

{
  "actorSpecification": 1,
  "name": "rag-markdown-cleaner",
  "title": "Web to Markdown Converter for RAG",
  "description": "Convert web pages to clean Markdown optimized for RAG and LLM systems with smart chunking, metadata extraction, and quality metrics",
  "version": "1.0",
  "buildTag": "latest",
  "categories": ["TOOLS", "SCRAPING"],
  "seo": {
    "title": "Web to Markdown Converter for RAG - Clean Content Extraction",
    "description": "Convert web pages into clean, token-efficient Markdown optimized for LLMs and RAG systems. Features smart semantic chunking, metadata extraction, quality metrics, and deduplication."
  },
  "environmentVariables": {},
  "dockerfile": "./Dockerfile",
  "readme": "./README.md",
  "input": "./input_schema.json",
  "platform": "python",
  "storages": {
    "dataset": {
      "actorSpecification": 1,
      "title": "RAG-Ready Markdown",
      "views": {
        "overview": {
          "title": "Overview",
          "transformation": {
            "fields": [
              "url",
              "title",
              "total_chunks",
              "total_chars",
              "estimated_tokens"
            ]
          },
          "display": {
            "component": "table",
            "properties": {
              "url": {
                "label": "Source URL",
                "format": "link"
              },
              "title": {
                "label": "Page Title",
                "format": "text"
              },
              "total_chunks": {
                "label": "Chunks",
                "format": "number"
              },
              "total_chars": {
                "label": "Characters",
                "format": "number"
              },
              "estimated_tokens": {
                "label": "Est. Tokens",
                "format": "number"
              }
            }
          }
        },
        "detailed": {
          "title": "Detailed View",
          "transformation": {
            "fields": [
              "url",
              "title",
              "markdown_content",
              "chunks",
              "metadata",
              "quality_metrics"
            ]
          },
          "display": {
            "component": "table",
            "properties": {
              "url": {
                "label": "Source URL",
                "format": "link"
              },
              "title": {
                "label": "Page Title",
                "format": "text"
              },
              "markdown_content": {
                "label": "Markdown Content",
                "format": "text"
              },
              "chunks": {
                "label": "Content Chunks",
                "format": "array"
              },
              "metadata": {
                "label": "Page Metadata",
                "format": "object"
              },
              "quality_metrics": {
                "label": "Quality Metrics",
                "format": "object"
              }
            }
          }
        }
      }
    }
  }
}

.actor/dataset_schema.json

{
    "actorSpecification": 1,
    "title": "RAG Markdown Output",
    "description": "Clean, chunked markdown optimized for RAG/LLM systems with metadata and quality metrics",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "url": {
            "type": "string",
            "title": "URL",
            "description": "The source URL of the scraped page",
            "example": "https://docs.python.org/3/tutorial/introduction.html"
        },
        "title": {
            "type": "string",
            "title": "Page Title",
            "description": "The extracted page title",
            "example": "3. An Informal Introduction to Python"
        },
        "markdown_content": {
            "type": "string",
            "title": "Markdown Content",
            "description": "The full cleaned markdown content with source metadata",
            "editor": "textarea"
        },
        "chunks": {
            "type": "array",
            "title": "Content Chunks",
            "description": "Semantic chunks optimized for embeddings",
            "items": {
                "type": "object",
                "properties": {
                    "chunk_id": {
                        "type": "integer",
                        "description": "Sequential chunk identifier"
                    },
                    "content": {
                        "type": "string",
                        "description": "The chunk text content"
                    },
                    "heading_context": {
                        "type": "string",
                        "description": "Breadcrumb trail of headings (e.g., 'Chapter 1 > Section 2')"
                    },
                    "char_count": {
                        "type": "integer",
                        "description": "Character count of this chunk"
                    },
                    "estimated_tokens": {
                        "type": "integer",
                        "description": "Estimated token count for this chunk"
                    }
                }
            }
        },
        "metadata": {
            "type": "object",
            "title": "Page Metadata",
            "description": "Extracted metadata from the page",
            "properties": {
                "url": {
                    "type": "string",
                    "description": "Source URL"
                },
                "domain": {
                    "type": "string",
                    "description": "Extracted domain name"
                },
                "scraped_at": {
                    "type": "string",
                    "description": "ISO timestamp of scraping"
                },
                "author": {
                    "type": ["string", "null"],
                    "description": "Page author if available"
                },
                "publish_date": {
                    "type": ["string", "null"],
                    "description": "Publication date if available"
                },
                "last_modified": {
                    "type": ["string", "null"],
                    "description": "Last modification date if available"
                },
                "language": {
                    "type": "string",
                    "description": "Two-letter language code"
                },
                "keywords": {
                    "type": "array",
                    "description": "Extracted keywords (max 10)",
                    "items": {
                        "type": "string"
                    }
                },
                "description": {
                    "type": ["string", "null"],
                    "description": "Page description from meta tags"
                },
                "content_type": {
                    "type": "string",
                    "description": "Detected content type",
                    "enum": ["blog", "documentation", "wiki", "product", "general"]
                }
            }
        },
        "code_blocks": {
            "type": "object",
            "title": "Code Blocks",
            "description": "Extracted code blocks with language detection",
            "properties": {
                "fenced_blocks": {
                    "type": "array",
                    "description": "Array of fenced code blocks",
                    "items": {
                        "type": "object",
                        "properties": {
                            "language": {
                                "type": "string",
                                "description": "Programming language"
                            },
                            "code": {
                                "type": "string",
                                "description": "The code content"
                            },
                            "lines": {
                                "type": "integer",
                                "description": "Number of lines"
                            }
                        }
                    }
                },
                "inline_code_count": {
                    "type": "integer",
                    "description": "Number of inline code snippets"
                },
                "has_code": {
                    "type": "boolean",
                    "description": "Whether any code was detected"
                }
            }
        },
        "quality_metrics": {
            "type": "object",
            "title": "Quality Metrics",
            "description": "Content quality and structure metrics",
            "properties": {
                "text_density": {
                    "type": "number",
                    "description": "Ratio of text to HTML (0-1, higher is better)"
                },
                "paragraph_count": {
                    "type": "integer",
                    "description": "Number of substantial paragraphs"
                },
                "word_count": {
                    "type": "integer",
                    "description": "Total word count"
                },
                "sentence_count": {
                    "type": "integer",
                    "description": "Total sentence count"
                },
                "avg_sentence_length": {
                    "type": "number",
                    "description": "Average words per sentence"
                },
                "reading_time_minutes": {
                    "type": "integer",
                    "description": "Estimated reading time in minutes"
                },
                "has_lists": {
                    "type": "boolean",
                    "description": "Whether content contains lists"
                },
                "has_headings": {
                    "type": "boolean",
                    "description": "Whether content contains headings"
                },
                "has_links": {
                    "type": "boolean",
                    "description": "Whether content contains links"
                },
                "structure_score": {
                    "type": "number",
                    "description": "Overall structure quality (0-1)"
                }
            }
        },
        "hashes": {
            "type": "object",
            "title": "Deduplication Hashes",
            "description": "Hashes for duplicate detection",
            "properties": {
                "content_hash": {
                    "type": "string",
                    "description": "SHA256 hash for exact duplicate detection"
                },
                "similarity_hash": {
                    "type": "string",
                    "description": "Normalized hash for near-duplicate detection"
                }
            }
        },
        "total_chunks": {
            "type": "integer",
            "title": "Total Chunks",
            "description": "Number of semantic chunks created",
            "example": 6
        },
        "total_chars": {
            "type": "integer",
            "title": "Total Characters",
            "description": "Total character count of the cleaned markdown",
            "example": 13619
        },
        "estimated_tokens": {
            "type": "integer",
            "title": "Estimated Tokens",
            "description": "Estimated LLM tokens (1 token ≈ 4 chars)",
            "example": 3404
        }
    },
    "views": {
        "overview": {
            "title": "Overview",
            "description": "Key metrics and source information",
            "properties": [
                "url",
                "title",
                "total_chunks",
                "estimated_tokens",
                "total_chars"
            ],
            "displayType": "table"
        },
        "quality": {
            "title": "Quality Metrics",
            "description": "Content quality and structure analysis",
            "properties": [
                "url",
                "quality_metrics",
                "code_blocks",
                "metadata"
            ],
            "displayType": "table"
        },
        "chunks": {
            "title": "Content Chunks",
            "description": "Semantic chunks for RAG systems",
            "properties": [
                "url",
                "chunks"
            ],
            "displayType": "table"
        },
        "full": {
            "title": "Full Content",
            "description": "Complete markdown output",
            "properties": [
                "url",
                "title",
                "markdown_content"
            ],
            "displayType": "table"
        }
    }
}

.actor/input_schema.json

{
  "title": "Web to Markdown Converter for RAG",
  "type": "object",
  "schemaVersion": 1,
  "properties": {
    "start_urls": {
      "title": "Start URLs",
      "type": "array",
      "description": "List of URLs to scrape and convert to Markdown",
      "editor": "requestListSources",
      "prefill": [
        {
          "url": "https://example.com"
        }
      ]
    },
    "include_links": {
      "title": "Include Links",
      "type": "boolean",
      "description": "Whether to preserve hyperlinks in the Markdown output. If false, links will be converted to plain text.",
      "default": true,
      "editor": "checkbox"
    }
  },
  "required": ["start_urls"]
}

.actor/output_schema.json

{
  "title": "Markdown Output",
  "type": "object",
  "properties": {
    "outputType": {
      "type": "string",
      "title": "Output Type",
      "description": "Type of output storage",
      "default": "dataset",
      "enum": ["dataset"]
    }
  }
}

src/init.py

1"""Main entry point package initialization."""

src/main.py

1"""Entry point for running the Actor as a module."""
2
3import asyncio
4from src.main import main
5
6if __name__ == '__main__':
7    asyncio.run(main())

src/main.py

1"""Apify Actor for converting web pages to clean Markdown for LLMs and RAG.
2
3This Actor scrapes web pages and converts them into clean, token-efficient Markdown
4optimized for Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) systems.
5
6To build Apify Actors, utilize the Apify SDK toolkit, read more at the official documentation:
7https://docs.apify.com/sdk/python
8"""
9
10from __future__ import annotations
11
12import re
13import hashlib
14from typing import List, Dict, Any, Optional
15from datetime import datetime
16from urllib.parse import urlparse, urljoin
17
18from apify import Actor
19from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
20from readability import Document
21from markdownify import markdownify as md
22
23
24def estimate_tokens(text: str) -> int:
25    """Estimate token count (rough approximation: 1 token ≈ 4 chars)."""
26    return len(text) // 4
27
28
29def create_chunks(markdown: str, max_chunk_size: int = 1000, overlap: int = 100) -> List[Dict[str, Any]]:
30    """
31    Split markdown content into semantic chunks while preserving context.
32    
33    Args:
34        markdown: The markdown content to chunk
35        max_chunk_size: Maximum characters per chunk
36        overlap: Number of characters to overlap between chunks
37        
38    Returns:
39        List of chunk dictionaries with metadata
40    """
41    chunks = []
42    
43    # Split by headings to preserve semantic structure
44    sections = re.split(r'(\n#{1,6}\s+.+\n)', markdown)
45    
46    current_chunk = ""
47    current_heading_context = []
48    chunk_id = 1
49    
50    for i, section in enumerate(sections):
51        # Check if this is a heading
52        heading_match = re.match(r'\n(#{1,6})\s+(.+)\n', section)
53        
54        if heading_match:
55            level = len(heading_match.group(1))
56            heading_text = heading_match.group(2).strip()
57            
58            # Update heading context (breadcrumb trail)
59            current_heading_context = current_heading_context[:level-1]
60            current_heading_context.append(heading_text)
61            
62            # Add heading to current chunk
63            if len(current_chunk) + len(section) < max_chunk_size:
64                current_chunk += section
65            else:
66                # Save current chunk if it has content
67                if current_chunk.strip():
68                    chunks.append({
69                        'chunk_id': chunk_id,
70                        'content': current_chunk.strip(),
71                        'heading_context': ' > '.join(current_heading_context[:-1]) if len(current_heading_context) > 1 else '',
72                        'char_count': len(current_chunk),
73                        'estimated_tokens': estimate_tokens(current_chunk)
74                    })
75                    chunk_id += 1
76                
77                # Start new chunk with overlap
78                if overlap > 0 and current_chunk:
79                    overlap_text = current_chunk[-overlap:]
80                    current_chunk = overlap_text + section
81                else:
82                    current_chunk = section
83        else:
84            # Regular content
85            if not section.strip():
86                continue
87                
88            # If adding this would exceed max size, split it
89            if len(current_chunk) + len(section) > max_chunk_size:
90                # Save current chunk
91                if current_chunk.strip():
92                    chunks.append({
93                        'chunk_id': chunk_id,
94                        'content': current_chunk.strip(),
95                        'heading_context': ' > '.join(current_heading_context),
96                        'char_count': len(current_chunk),
97                        'estimated_tokens': estimate_tokens(current_chunk)
98                    })
99                    chunk_id += 1
100                
101                # Start new chunk with overlap
102                if overlap > 0 and current_chunk:
103                    overlap_text = current_chunk[-overlap:]
104                    current_chunk = overlap_text + section
105                else:
106                    current_chunk = section
107            else:
108                current_chunk += section
109    
110    # Add final chunk
111    if current_chunk.strip():
112        chunks.append({
113            'chunk_id': chunk_id,
114            'content': current_chunk.strip(),
115            'heading_context': ' > '.join(current_heading_context),
116            'char_count': len(current_chunk),
117            'estimated_tokens': estimate_tokens(current_chunk)
118        })
119    
120    return chunks
121
122
123def extract_metadata(soup, url: str) -> Dict[str, Any]:
124    """Extract metadata from the page."""
125    metadata = {
126        'url': url,
127        'domain': urlparse(url).netloc,
128        'scraped_at': datetime.utcnow().isoformat() + 'Z',
129    }
130    
131    # Extract author
132    author = None
133    author_meta = soup.find('meta', {'name': re.compile(r'author', re.I)})
134    if author_meta:
135        author = author_meta.get('content')
136    if not author:
137        # Try JSON-LD
138        json_ld = soup.find('script', {'type': 'application/ld+json'})
139        if json_ld:
140            try:
141                import json
142                data = json.loads(json_ld.string)
143                if isinstance(data, dict):
144                    author = data.get('author', {}).get('name')
145            except:
146                pass
147    metadata['author'] = author
148    
149    # Extract publish date
150    publish_date = None
151    date_meta = soup.find('meta', {'property': 'article:published_time'}) or \
152                soup.find('meta', {'name': re.compile(r'publish|date', re.I)})
153    if date_meta:
154        publish_date = date_meta.get('content')
155    metadata['publish_date'] = publish_date
156    
157    # Extract last modified
158    modified_date = None
159    modified_meta = soup.find('meta', {'property': 'article:modified_time'}) or \
160                    soup.find('meta', {'name': 'last-modified'})
161    if modified_meta:
162        modified_date = modified_meta.get('content')
163    metadata['last_modified'] = modified_date
164    
165    # Extract language
166    lang = soup.find('html').get('lang', 'en') if soup.find('html') else 'en'
167    metadata['language'] = lang[:2]  # Just the language code
168    
169    # Extract keywords
170    keywords = []
171    keywords_meta = soup.find('meta', {'name': re.compile(r'keywords', re.I)})
172    if keywords_meta:
173        keywords_content = keywords_meta.get('content', '')
174        keywords = [k.strip() for k in keywords_content.split(',') if k.strip()]
175    metadata['keywords'] = keywords[:10]  # Limit to 10
176    
177    # Extract description
178    desc_meta = soup.find('meta', {'name': 'description'}) or \
179                soup.find('meta', {'property': 'og:description'})
180    metadata['description'] = desc_meta.get('content') if desc_meta else None
181    
182    # Detect content type
183    content_type = 'general'
184    if '/blog/' in url or '/article/' in url or '/post/' in url:
185        content_type = 'blog'
186    elif '/docs/' in url or '/documentation/' in url:
187        content_type = 'documentation'
188    elif '/product/' in url or '/shop/' in url:
189        content_type = 'product'
190    elif '/wiki/' in url:
191        content_type = 'wiki'
192    metadata['content_type'] = content_type
193    
194    return metadata
195
196
197def resolve_relative_links(markdown: str, base_url: str) -> str:
198    """Convert relative URLs to absolute URLs in markdown links."""
199    # Pattern to match markdown links: [text](url)
200    def replace_link(match):
201        text = match.group(1)
202        url = match.group(2)
203        
204        # Skip if already absolute or an anchor
205        if url.startswith(('http://', 'https://', '#', 'mailto:', 'tel:')):
206            return match.group(0)
207        
208        # Convert relative to absolute
209        absolute_url = urljoin(base_url, url)
210        return f'[{text}]({absolute_url})'
211    
212    # Replace all markdown links
213    markdown = re.sub(r'\[([^\]]+)\]\(([^)]+)\)', replace_link, markdown)
214    
215    return markdown
216
217
218def extract_code_blocks(markdown: str) -> List[Dict[str, str]]:
219    """Extract code blocks from markdown."""
220    code_blocks = []
221    
222    # Pattern for fenced code blocks with optional language
223    pattern = r'```(\w+)?\n(.*?)```'
224    matches = re.findall(pattern, markdown, re.DOTALL)
225    
226    for lang, code in matches:
227        code_blocks.append({
228            'language': lang if lang else 'text',
229            'code': code.strip(),
230            'lines': len(code.strip().split('\n'))
231        })
232    
233    # Also find inline code blocks
234    inline_pattern = r'`([^`]+)`'
235    inline_matches = re.findall(inline_pattern, markdown)
236    
237    return {
238        'fenced_blocks': code_blocks,
239        'inline_code_count': len(inline_matches),
240        'has_code': len(code_blocks) > 0 or len(inline_matches) > 0
241    }
242
243
244def calculate_quality_metrics(markdown: str, html_length: int) -> Dict[str, Any]:
245    """Calculate content quality metrics."""
246    # Text density (markdown vs HTML ratio)
247    text_density = len(markdown) / max(html_length, 1)
248    
249    # Count paragraphs (non-empty lines that aren't headers/lists)
250    lines = [l.strip() for l in markdown.split('\n') if l.strip()]
251    paragraphs = [l for l in lines if not l.startswith(('#', '-', '*', '>'))]
252    paragraph_count = len([p for p in paragraphs if len(p) > 50])
253    
254    # Average sentence length
255    sentences = re.split(r'[.!?]+', markdown)
256    sentences = [s.strip() for s in sentences if len(s.strip()) > 10]
257    avg_sentence_length = sum(len(s.split()) for s in sentences) / max(len(sentences), 1)
258    
259    # Reading time (average reading speed: 200 words/minute)
260    word_count = len(markdown.split())
261    reading_time_minutes = max(1, round(word_count / 200))
262    
263    # Check for lists and structure
264    has_lists = bool(re.search(r'^\s*[-*+]\s', markdown, re.MULTILINE))
265    has_headings = bool(re.search(r'^#{1,6}\s', markdown, re.MULTILINE))
266    has_links = bool(re.search(r'\[.+\]\(.+\)', markdown))
267    
268    return {
269        'text_density': round(text_density, 2),
270        'paragraph_count': paragraph_count,
271        'word_count': word_count,
272        'sentence_count': len(sentences),
273        'avg_sentence_length': round(avg_sentence_length, 1),
274        'reading_time_minutes': reading_time_minutes,
275        'has_lists': has_lists,
276        'has_headings': has_headings,
277        'has_links': has_links,
278        'structure_score': sum([has_lists, has_headings, has_links]) / 3.0  # 0-1 score
279    }
280
281
282def generate_content_hashes(markdown: str) -> Dict[str, str]:
283    """Generate hashes for deduplication."""
284    # Full content hash (SHA256)
285    content_hash = hashlib.sha256(markdown.encode('utf-8')).hexdigest()
286    
287    # Similarity hash (first 16 chars of SHA256 of normalized content)
288    # Normalize: lowercase, remove extra whitespace
289    normalized = re.sub(r'\s+', ' ', markdown.lower().strip())
290    similarity_hash = hashlib.sha256(normalized.encode('utf-8')).hexdigest()[:16]
291    
292    return {
293        'content_hash': content_hash,
294        'similarity_hash': similarity_hash
295    }
296
297
298async def main() -> None:
299    """Define the main entry point for the Apify Actor.
300
301    This coroutine is executed using `asyncio.run()`, so it must remain an asynchronous function for proper execution.
302    Asynchronous execution is required for communication with Apify platform.
303    """
304    async with Actor:
305        # Get input from Apify
306        actor_input = await Actor.get_input() or {}
307        
308        start_urls = [
309            url.get('url') for url in actor_input.get('start_urls', [{'url': 'https://apify.com'}])
310        ]
311        include_links = actor_input.get('include_links', True)
312        
313        # Exit if no start URLs are provided
314        if not start_urls:
315            Actor.log.info('No URLs provided in start_urls, exiting...')
316            await Actor.exit()
317        
318        Actor.log.info(f'Processing {len(start_urls)} URLs')
319        Actor.log.info(f'Include links: {include_links}')
320        
321        # Create a crawler
322        crawler = BeautifulSoupCrawler(
323            max_requests_per_crawl=len(start_urls),
324        )
325        
326        # Define the request handler
327        @crawler.router.default_handler
328        async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
329            url = context.request.url
330            Actor.log.info(f'Scraping {url}...')
331            Actor.log.info(f'Starting content extraction for {url}')
332            
333            try:
334                soup = context.soup
335                Actor.log.info(f'BeautifulSoup parsed, HTML length: {len(str(soup))} chars')
336                
337                # Store original HTML length for quality metrics
338                original_html_length = len(str(soup))
339                
340                # Extract title
341                title = soup.title.string if soup.title else 'No Title'
342                
343                # Extract metadata
344                metadata = extract_metadata(soup, url)
345                
346                # Remove unwanted elements
347                for element in soup(['script', 'style', 'nav', 'header', 'footer', 
348                                    'iframe', 'noscript', 'svg', 'button', 'form']):
349                    element.decompose()
350                
351                # Remove common noise selectors (cookie banners, ads, etc.)
352                noise_selectors = [
353                    '[class*="cookie"]', '[class*="banner"]', '[class*="popup"]',
354                    '[class*="modal"]', '[id*="cookie"]', '[class*="ad-"]',
355                    '[class*="advertisement"]', '[aria-label*="cookie"]'
356                ]
357                for selector in noise_selectors:
358                    for element in soup.select(selector):
359                        element.decompose()
360                
361                # Try readability first for article-like pages
362                use_readability = True
363                try:
364                    html_content = str(soup)
365                    doc = Document(html_content)
366                    clean_html = doc.summary()
367                    
368                    # Convert to markdown
369                    markdown_content = md(
370                        clean_html,
371                        heading_style='ATX',
372                        bullets='-',
373                        strip=['script', 'style', 'img'],
374                        convert=None if include_links else ['a'],
375                        escape_asterisks=False,
376                        escape_underscores=False
377                    )
378                    
379                    # If readability gives very short output, fall back to full body
380                    if len(markdown_content.strip()) < 300:
381                        raise ValueError("Content too short, using full body")
382                        
383                except Exception:
384                    # Fallback: Convert entire body to markdown
385                    use_readability = False
386                    Actor.log.info(f'Using full body content for {url}')
387                    
388                    # Get main content or body
389                    main_content = soup.find('main') or soup.find('article') or soup.body or soup
390                    
391                    markdown_content = md(
392                        str(main_content),
393                        heading_style='ATX',
394                        bullets='-',
395                        strip=['script', 'style', 'img'],
396                        convert=None if include_links else ['a'],
397                        escape_asterisks=False,
398                        escape_underscores=False
399                    )
400                
401                # Clean up the markdown
402                lines = []
403                prev_empty = False
404                
405                for line in markdown_content.split('\n'):
406                    # Strip whitespace
407                    line = line.strip()
408                    
409                    # Skip very short lines that are likely noise (unless they're headers)
410                    if len(line) < 3 and not line.startswith('#'):
411                        continue
412                    
413                    # Skip lines with only special characters or numbers
414                    if line and all(c in '.-_*[](){}|\\/' for c in line):
415                        continue
416                    
417                    # Avoid consecutive empty lines
418                    if not line:
419                        if not prev_empty:
420                            lines.append('')
421                            prev_empty = True
422                    else:
423                        lines.append(line)
424                        prev_empty = False
425                
426                markdown_content = '\n'.join(lines).strip()
427                
428                # Resolve relative links to absolute
429                markdown_content = resolve_relative_links(markdown_content, url)
430                
431                # Extract code blocks
432                code_info = extract_code_blocks(markdown_content)
433                
434                # Calculate quality metrics
435                quality_metrics = calculate_quality_metrics(markdown_content, original_html_length)
436                
437                # Generate deduplication hashes
438                hashes = generate_content_hashes(markdown_content)
439                
440                # Create semantic chunks
441                chunks = create_chunks(markdown_content, max_chunk_size=1000, overlap=100)
442                
443                # --- SAFEGUARD START ---
444                # Don't charge users for empty/useless results (e.g., JS-only pages, blank pages, cookie walls)
445                MIN_CONTENT_LENGTH = 100  # Lowered from 200 to allow shorter pages
446                
447                Actor.log.info(f'Extracted markdown length: {len(markdown_content)} chars')
448                
449                if len(markdown_content.strip()) < MIN_CONTENT_LENGTH:
450                    Actor.log.warning(
451                        f'⚠️  Skipping {url}: Content too short ({len(markdown_content)} chars). '
452                        f'Minimum required: {MIN_CONTENT_LENGTH} chars. User not charged.'
453                    )
454                    return  # Exit without pushing to dataset = no charge
455                
456                # Additional quality check: ensure we have meaningful content, not just errors
457                error_indicators = [
458                    'enable javascript',
459                    'javascript is disabled',
460                    'please enable cookies',
461                    'access denied',
462                    '403 forbidden',
463                    '404 not found',
464                    'page not found'
465                ]
466                content_lower = markdown_content.lower()
467                if any(indicator in content_lower for indicator in error_indicators):
468                    if len(markdown_content) < 500:  # Only skip if it's mostly error message
469                        Actor.log.warning(
470                            f'⚠️  Skipping {url}: Appears to be an error page or requires JavaScript. User not charged.'
471                        )
472                        return
473                # --- SAFEGUARD END ---
474                
475                Actor.log.info(f'✅ Content passed quality checks for {url}')
476                
477                # Add metadata at the top of full content
478                final_content = f"**Source:** {url}\n\n---\n\n{markdown_content}"
479                
480                # Prepare data with chunks
481                data = {
482                    'url': url,
483                    'title': title,
484                    'markdown_content': final_content,
485                    'chunks': chunks,
486                    'metadata': metadata,
487                    'code_blocks': code_info,
488                    'quality_metrics': quality_metrics,
489                    'hashes': hashes,
490                    'total_chunks': len(chunks),
491                    'total_chars': len(markdown_content),
492                    'estimated_tokens': estimate_tokens(markdown_content)
493                }
494                
495                method = "Readability" if use_readability else "Full Body"
496                Actor.log.info(f'Successfully converted [{method}]: {title} ({len(final_content)} chars, {len(chunks)} chunks)')
497                
498                # Push to dataset
499                await context.push_data(data)
500                
501            except Exception as e:
502                Actor.log.error(f'Error processing {url}: {str(e)}')
503        
504        # Run the crawler
505        await crawler.run(start_urls)
506
507
508# Run the Actor
509if __name__ == '__main__':
510    import asyncio
511    asyncio.run(main())

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

RAG Spider - Web to Markdown Crawler for AI Training Data

lenient_grove/RAG-Spider

Enterprise-grade web crawler that converts messy websites into clean, chunked Markdown for AI systems. Uses Mozilla Readability for 95% cleaner extraction than competitors. Outputs RAG-ready data with metadata and token estimates. Perfect for building knowledge bases and training AI chatbots.

Tejas Rawool

5.0

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds — perfect for AI training data, RAG pipelines, and content archiving.

SmartApi

5.0

Docs To Rag

gabrielaxy/docs-to-rag

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

Gabriel Antony Xaviour

PDF to Markdown RAG-Ready

hedelka/pdf-to-markdown-rag

Premium PDF scraper that preserves tables and structure. Optimized for RAG.

Dmitry Goncharov

Website to Clean Markdown (AI & RAG Ready)

ahmed_jasarevic/website-to-clean-markdown-ai-rag-ready

Convert any website into clean, noise-free Markdown. Perfect for training LLMs, building Custom GPTs, and RAG pipelines. Save 80% on OpenAI tokens by stripping HTML junk.

Ahmed Jasarevic

Universal RAG Web Scraper

express_kingfisher/rag-web-scraper

Turn any website into clean, LLM-ready Markdown. Automatically strips ads, navigation, and noise using Mozilla Readability. Perfect for feeding data to ChatGPT, Claude, or Vector Databases (RAG).

Prince Raj

AI RAG Feeder V2

mickeywmoore/ai-rag-feeder-v2

Turn any website into AI-ready Markdown. Scrapes entire domains, removes ads/clutter, and formats text specifically for RAG pipelines and LLM training data.

Mickey Moore

Html to Markdown Converter

antonio_espresso/html-to-markdown-converter

Crawl a target URL and convert its HTML content into clean, structured Markdown with optional heading-based chunking.

Antonio Blago

Ai Ready Web Page To Markdown Converter

mustafa.irshaid.113/ai-ready-web-page-to-markdown-converter

Convert any webpage into structured Markdown and HTML using just a URL. Get the page title, link, and content—perfect for SEO, devs, and AI crawlers. Fast, clean, and ideal for repurposing or analysis. Start turning websites into Markdown instantly.

Mustafa Irshaid

RAG Markdown Cleaner

RAG Markdown Cleaner

.gitignore

Dockerfile

LICENSE

requirements.txt

.actor/INPUT.json

.actor/actor.json

.actor/dataset_schema.json

.actor/input_schema.json

.actor/output_schema.json

src/__init__.py

src/__main__.py

src/main.py

You might also like

Web-to-Markdown Generator for AI & RAG Pipelines

RAG Spider - Web to Markdown Crawler for AI Training Data

Website To Markdown

Docs To Rag

PDF to Markdown RAG-Ready

Website to Clean Markdown (AI & RAG Ready)

Universal RAG Web Scraper

AI RAG Feeder V2

Html to Markdown Converter

Ai Ready Web Page To Markdown Converter

.gitignore

Dockerfile

LICENSE

requirements.txt

.actor/INPUT.json

.actor/actor.json

.actor/dataset_schema.json

.actor/input_schema.json

.actor/output_schema.json

src/__init__.py

src/__main__.py

src/main.py

src/init.py

src/main.py

src/init.py

src/main.py