Pricing

$20.00/month + usage

Go to Store

Markdown Header Text Splitter

Try for free

Developed by

CodePoetry

Split Markdown into structured chunks using header hierarchy. Built with LangChain, it preserves metadata for RAG, documentation, and analysis. Configure headers, strip content, and integrate with vector databases. Ideal for AI workflows.

0.0 (0)

Pricing

$20.00/month + usage

Total users

Monthly users

Runs succeeded

>99%

Last modified

4 months ago

Open source

Agents

.dockerignore

.git
.mise.toml
.nvim.lua
storage

# The rest is copied from https://github.com/github/gitignore/blob/main/Python.gitignore

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
#   For a library or package, you might want to ignore these files since the code is
#   intended to run in multiple environments; otherwise, check them in:
.python-version

# pdm
#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
#   in version control.
#   https://pdm.fming.dev/latest/usage/project/#working-with-version-control
.pdm.toml
.pdm-python
.pdm-build/

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
#  and can be added to the global gitignore or merged into this file.  For a more nuclear
#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
.idea/

.gitignore

.mise.toml
.nvim.lua
storage

# The rest is copied from https://github.com/github/gitignore/blob/main/Python.gitignore

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
#   For a library or package, you might want to ignore these files since the code is
#   intended to run in multiple environments; otherwise, check them in:
.python-version

# pdm
#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
#   in version control.
#   https://pdm.fming.dev/latest/usage/project/#working-with-version-control
.pdm.toml
.pdm-python
.pdm-build/

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
#  and can be added to the global gitignore or merged into this file.  For a more nuclear
#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
.idea/

# Added by Apify CLI
node_modules

requirements.txt

1annotated-types==0.7.0
2anyio==4.8.0
3apify==2.4.0
4apify_client==1.9.2
5apify_shared==1.3.1
6Brotli==1.1.0
7browserforge==1.2.3
8cachetools==5.5.2
9certifi==2025.1.31
10cffi==1.17.1
11charset-normalizer==3.4.1
12click==8.1.8
13colorama==0.4.6
14crawlee==0.6.3
15cryptography==44.0.2
16docutils==0.21.2
17eval_type_backport==0.2.2
18filelock==3.17.0
19greenlet==3.1.1
20h11==0.14.0
21h2==4.2.0
22hpack==4.1.0
23httpcore==1.0.7
24httpx==0.28.1
25hyperframe==6.1.0
26idna==3.10
27jsonpatch==1.33
28jsonpointer==3.0.0
29langchain==0.3.20
30langchain-core==0.3.43
31langchain-text-splitters==0.3.6
32langsmith==0.3.13
33lazy-object-proxy==1.10.0
34markdown-it-py==3.0.0
35mdurl==0.1.2
36more-itertools==10.6.0
37multidict==6.1.0
38orjson==3.10.15
39packaging==24.2
40propcache==0.3.0
41psutil==7.0.0
42pycparser==2.22
43pydantic==2.10.6
44pydantic-settings==2.6.1
45pydantic_core==2.27.2
46pyee==12.1.1
47Pygments==2.19.1
48python-dotenv==1.0.1
49PyYAML==6.0.2
50requests==2.32.3
51requests-file==2.1.0
52requests-toolbelt==1.0.0
53rich==13.9.4
54setuptools==76.0.0
55sniffio==1.3.1
56sortedcollections==2.1.0
57sortedcontainers==2.4.0
58SQLAlchemy==2.0.38
59tenacity==9.0.0
60tldextract==5.1.3
61typing_extensions==4.12.2
62urllib3==2.3.0
63websockets==15.0.1
64wheel==0.45.1
65yarl==1.18.3
66zstandard==0.23.0

.actor/Dockerfile

# First, specify the base Docker image.
# You can see the Docker images from Apify at https://hub.docker.com/r/apify/.
# You can also use any other image from Docker Hub.
FROM apify/actor-python:3.13

# Second, copy just requirements.txt into the Actor image,
# since it should be the only file that affects the dependency install in the next step,
# in order to speed up the build
COPY requirements.txt ./

# Install the packages specified in requirements.txt,
# Print the installed Python version, pip version
# and all installed packages with their versions for debugging
RUN echo "Python version:" \
    && python --version \
    && echo "Pip version:" \
    && pip --version \
    && echo "Installing dependencies:" \
    && pip install -r requirements.txt \
    && echo "All installed Python packages:" \
    && pip freeze

# Next, copy the remaining files and directories with the source code.
# Since we do this after installing the dependencies, quick build will be really fast
# for most source file changes.
COPY . ./

# Use compileall to ensure the runnability of the Actor Python code.
RUN python3 -m compileall -q .

# Create and run as a non-root user.
RUN useradd --create-home apify && \
    chown -R apify:apify ./  && \
    chown -R apify:apify /usr/local/lib/python*
USER apify


# Specify how to launch the source code of your Actor.
# By default, the "python3 -m ." command is run
CMD ["python3", "-m", "src"]

.actor/INPUT_SCHEMA.json

{
  "title": "Markdown Splitter",
  "description": "Splits Markdown text into chunks based on headers",
  "type": "object",
  "schemaVersion": 1,
  "properties": {
    "markdown_text": {
      "title": "Markdown Text",
      "type": "string",
      "description": "The Markdown content to process",
      "editor": "textarea",
      "default": "# Header 1\nHello, World!\n## Header 2\nGoodbye, World!"
    },
    "headers_to_split_on": {
      "title": "Headers to Split On",
      "type": "array",
      "description": "Header levels (e.g., ['#', '##'])",
      "editor": "stringList",
      "default": ["#", "##", "###", "####", "#####", "######"]
    },
    "strip_headers": {
      "title": "Strip Headers",
      "type": "boolean",
      "description": "Remove headers from chunk content",
      "default": true
    }
  },
  "required": ["markdown_text"]
}

.actor/actor.json

{
  "actorSpecification": 1,
  "name": "markdown-splitter",
  "title": "Markdown Header Text Splitter",
  "description": "Splits Markdown documents into chunks based on header hierarchy",
  "version": "1.0",
  "buildTag": "latest",
  "dockerfile": "./Dockerfile",
  "storages": {
    "dataset": {
      "actorSpecification": 1,
      "fields": {
        "type": "object",
        "properties": {
          "chunks": {
            "type": "array",
            "items": {
              "type": "object",
              "properties": {
                "content": { "type": "string" },
                "metadata": { "type": "object" }
              }
            }
          },
          "error": { "type": "string" }
        },
        "required": ["chunks"]
      }
    }
  }
}

src/init.py

src/main.py

1import asyncio
2
3from .main import main
4
5# Execute the Actor entry point.
6asyncio.run(main())

src/main.py

1from apify import Actor
2from langchain.text_splitter import MarkdownHeaderTextSplitter
3from .models import MarkdownSplitterInput, MarkdownSplitterOutput
4
5
6async def main():
7    async with Actor:
8        try:
9            input_data = await Actor.get_input() or {}
10            params = MarkdownSplitterInput(**input_data)
11
12            # Convert headers to LangChain format
13            headers = []
14            for h in params.headers_to_split_on:
15                level = h.count("#")
16                headers.append((h, f"Header {level}"))
17
18            # Split Markdown
19            splitter = MarkdownHeaderTextSplitter(
20                headers_to_split_on=headers, strip_headers=params.strip_headers
21            )
22            documents = splitter.split_text(params.markdown_text)
23
24            # Format output
25            chunks = []
26            for doc in documents:
27                chunks.append({"content": doc.page_content, "metadata": doc.metadata})
28
29            await Actor.push_data(
30                MarkdownSplitterOutput(chunks=chunks).model_dump(exclude_none=True)
31            )
32
33        except Exception as e:
34            error_msg = f"Actor failed: {str(e)}"
35            Actor.log.error(error_msg)
36            await Actor.fail(exception=e, status_message=error_msg)

src/models.py

1from pydantic import BaseModel, Field, field_validator
2from typing import List, Optional
3
4
5class MarkdownSplitterInput(BaseModel):
6    markdown_text: str = Field(..., description="Markdown content to split")
7    headers_to_split_on: List[str] = Field(
8        default=["#", "##", "###", "####", "#####", "######"],
9        description="Header levels to split on (e.g., ['#', '##'])",
10    )
11    strip_headers: bool = Field(True, description="Remove headers from chunks")
12
13    @field_validator("headers_to_split_on")
14    @classmethod
15    def validate_headers(cls, v: List[str]) -> List[str]:
16        for h in v:
17            stripped = h.strip()
18            if (
19                not stripped.startswith("#")
20                or not stripped.replace("#", "").strip() == ""
21            ):
22                raise ValueError(f"Invalid header format: {h}. Use '#', '##', etc.")
23        return v
24
25
26class MarkdownSplitterOutput(BaseModel):
27    chunks: List[dict] = Field(..., description="Processed chunks with metadata")
28    error: Optional[str] = Field(None, description="Error details")

src/py.typed

Webpage to Markdown

extremescrapes/webpage-to-markdown

This actor cost-effectively converts websites into structured markdown optimized for AI processing. It extracts webpage content, formats it into clean markdown, and ensures compatibility with AI models.

Extreme Scrapes

Ai Ready Web Page To Markdown Converter

mustafa.irshaid.113/ai-ready-web-page-to-markdown-converter

Convert any webpage into structured Markdown and HTML using just a URL. Get the page title, link, and content—perfect for SEO, devs, and AI crawlers. Fast, clean, and ideal for repurposing or analysis. Start turning websites into Markdown instantly.

Mustafa Irshaid

Website Content Crawler

apify/website-content-crawler

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Apify

64K

4.3

AI Website Content Markdown Scraper

quaking_pail/ai-website-content-markdown-scraper

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.

AI_Builder

607

4.3

Html To Markdown Converter 📄

powerful_bachelor/html-to-markdown-converter

📄✨ HTML to Markdown Converter transforms web pages into clean, portable Markdown. Simply input a URL to extract content while preserving structure, formatting, and media elements.🔄 Perfect for content repurposing, documentation, and creating readable, platform-independent text from any webpage! 🚀

Powerful Bachelor

🔥fireScraper AI Prompt Website Content Markdown Scraper

mohamedgb00714/fireScraper-AI-prompt-Website-Content-Markdown-Scraper

fireScrape AI is an advanced web scraper built with Crawlee and Puppeteer. It crawls websites, extracts meaningful content, converts it into Markdown, then runs your custom prompt on the extracted text—ideal for generating enriched datasets, summaries or analyses for LLMs and AI pipelines

mohamed el hadi msaid

5.0

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

🚀 Transform web content into clean, LLM-ready Markdown! 📘 Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! 🌐📝🧠

EasyApi

5.0

Html to Markdown Converter

antonio_espresso/html-to-markdown-converter

Crawl a target URL and convert its HTML content into clean, structured Markdown with optional heading-based chunking.

Antonio Blago

Dynamic Markdown Scraper

louisdeconinck/dynamic-markdown-scraper

Effortlessly feed LLM AIs with clean Markdown using our advanced web scraper. Seamlessly scrape dynamic, JavaScript-rendered websites while preserving original formatting. Ideal for AI training, documentation, and content migration.

Louis Deconinck

5.0

🔥 FireScrape AI Website Content Markdown Scraper

mohamedgb00714/fireScraper-AI-Website-Content-Markdown-Scraper

Advanced web scraper powered by Crawlee and Puppeteer — extracts website content, converts it to Markdown, and structures it for LLM training datasets.

mohamed el hadi msaid

3.8