Pricing

from $5.00 / 1,000 results

Japanese Text Normalizer — NFKC, kana, whitespace, sentences

Normalize Japanese text for data pipelines: Unicode NFKC (full/half-width unification), wave-dash unification, whitespace cleanup, hiragana/katakana conversion, Japanese-aware sentence splitting, and per-script character stats.

Pricing

from $5.00 / 1,000 results

Rating

0.0

(0)

Developer

Shinobu Otani

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

Japanese Text Normalizer

Clean and normalize Japanese text for search indexes, datasets, and LLM pipelines — deterministic, instant, no LLM cost.

What it does

Unicode NFKC: full-width alphanumerics → ASCII (Ｃｌａｕｄｅ → Claude), half-width katakana → full-width (ｶﾞｲﾄﾞ → ガイド)
Wave-dash unification: ～ (U+FF5E) → 〜 (U+301C), without touching real ASCII tildes in paths/URLs
Whitespace cleanup: collapses space runs (including ideographic spaces), trims line ends, collapses 3+ blank lines, normalizes CRLF
Kana conversion: hiragana ↔ katakana (optional)
Sentence segmentation: Japanese-aware (。！？ with closing-quote handling) plus Latin punctuation
Character statistics: per-script counts (hiragana / katakana / kanji / ASCII / digits) before and after

Input

{
    "texts": ["Ｃｌａｕｄｅ　Ｃｏｄｅで開発する。「すごい」と思った。"],
    "kana": "none",
    "split_sentences": true
}

Output (one dataset item per text)

{
    "text": "Claude Codeで開発する。「すごい」と思った。",
    "changed": true,
    "sentences": ["Claude Codeで開発する。", "「すごい」と思った。"],
    "sentence_count": 2,
    "stats_before": {"hiragana": 8, "katakana": 0, "kanji": 4, "...": "..."},
    "stats_after": {"...": "..."}
}

Typical uses

Preprocessing scraped Japanese text before indexing or embedding
Unifying mixed full-width/half-width product data
Sentence-level dataset construction from raw Japanese prose

Scraped Text Sanitizer

nibble/scraped-text-sanitizer

Fix mojibake, decode HTML entities, strip HTML tags, NFKC-normalize, and collapse whitespace in text records you supply.

Simon Fletcher

RAG Text Chunker — heading & sentence aware, Japanese ready

shoebill-dev27/rag-text-chunker

Split Markdown or plain text into retrieval-ready chunks for RAG pipelines: cuts at headings, packs whole sentences up to a size limit with optional overlap, and tags every chunk with its heading breadcrumb. Handles Japanese sentence boundaries. No LLM cost.

Shinobu Otani

Data Cleaner & Normalizer (JSON/CSV)

zenomastro/data-cleaner-normalizer

Clean and normalize JSON/CSV data: trim whitespace, lowercase emails, normalize phone numbers and dates, drop empty values/rows, and deduplicate by a field.

Rosario Vitale

Entity Extractor — emails, URLs, phones, dates (regex, no LLM)

shoebill-dev27/entity-extractor

Extract structured entities from free text: email addresses, URLs, phone numbers (incl. Japanese formats and full-width digits), dates (ISO, slash, Japanese 年月日) and IP addresses. Deterministic regex extraction with per-kind counts — fast, cheap, no LLM.

Shinobu Otani

Google Maps Japan Scraper — Email + Business Leads

totaka/google-maps-japan-scraper

Extract Japanese business leads from Google Maps — name, address, phone, email, website, rating and GPS. Emails auto-extracted from websites. Works in English and Japanese. $0.001/result.

Thomas Gharbi

Japanese Name Generator

conduit/japanese-name-generator

Generate authentic Japanese names with cultural context, meanings, and proper linguistic formatting. Perfect for creative projects, research, and educational purposes.

Conduit

Image to Text OCR: Chinese, Japanese & 50+ Languages

raional/ocr-text-extractor

Extract text from photos, screenshots, signs and scans. Strong Chinese and Japanese OCR, plus 50+ languages. Returns text, confidence scores and bounding boxes. Self-contained, no third-party API.

Raion Al

Japanese Web Scraper - Yahoo News, Rakuten, Suumo, Tabelog

project_bbb/japanese-web-scraper

Scrape major Japanese websites: Yahoo! Japan News, Rakuten, Suumo, Tabelog. Full Shift_JIS/EUC-JP encoding support, cookie wall bypass, and JP pagination handling. Structured JSON output with optional romaji transliteration for non-Japanese data consumers.

BBB & Company

Unicode Text Inspector — codepoint, normalize, homoglyph

perryay/unicode-inspector

Deep Unicode inspection: reveals every codepoint as U+XXXX with block/name. Detects invisible chars (zero-width, BOM, RTL/LTR overrides). Identifies mixed encodings and mojibake. Performs NFC/NFD/NFKC/NFKD normalization. Security mode detects homoglyph/confusable characters.

Perry AY

Japan Contact Scraper

kyo_kou/japan-contact-scraper

Extract emails, Japanese phone numbers (03-, 090-, 0120- formats), and social media links from Japanese company websites. Optimized regex patterns ensure high accuracy with minimal false positives.