Japanese Text Normalizer — NFKC, kana, whitespace, sentences
Pricing
Pay per usage
Go to Apify Store

Japanese Text Normalizer — NFKC, kana, whitespace, sentences
Normalize Japanese text for data pipelines: Unicode NFKC (full/half-width unification), wave-dash unification, whitespace cleanup, hiragana/katakana conversion, Japanese-aware sentence splitting, and per-script character stats.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Shinobu Otani
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
Japanese Text Normalizer
Clean and normalize Japanese text for search indexes, datasets, and LLM pipelines — deterministic, instant, no LLM cost.
What it does
- Unicode NFKC: full-width alphanumerics → ASCII (
Claude→Claude), half-width katakana → full-width (ガイド→ガイド) - Wave-dash unification:
~(U+FF5E) →〜(U+301C), without touching real ASCII tildes in paths/URLs - Whitespace cleanup: collapses space runs (including ideographic spaces), trims line ends, collapses 3+ blank lines, normalizes CRLF
- Kana conversion: hiragana ↔ katakana (optional)
- Sentence segmentation: Japanese-aware (
。!?with closing-quote handling) plus Latin punctuation - Character statistics: per-script counts (hiragana / katakana / kanji / ASCII / digits) before and after
Input
{"texts": ["Claude Codeで開発する。「すごい」と思った。"],"kana": "none","split_sentences": true}
Output (one dataset item per text)
{"text": "Claude Codeで開発する。「すごい」と思った。","changed": true,"sentences": ["Claude Codeで開発する。", "「すごい」と思った。"],"sentence_count": 2,"stats_before": {"hiragana": 8, "katakana": 0, "kanji": 4, "...": "..."},"stats_after": {"...": "..."}}
Typical uses
- Preprocessing scraped Japanese text before indexing or embedding
- Unifying mixed full-width/half-width product data
- Sentence-level dataset construction from raw Japanese prose