Text Dedupe — exact & near-duplicate detection, Japanese ready
Pricing
from $5.00 / 1,000 results
Text Dedupe — exact & near-duplicate detection, Japanese ready
Detect exact and near-duplicate texts in scraped or collected datasets: normalized hashing for exact matches plus character-shingle Jaccard similarity for near matches. Works on Japanese (no word tokenization needed). Returns per-text verdicts and duplicate clusters. No LLM cost.
Pricing
from $5.00 / 1,000 results
Rating
0.0
(0)
Developer
Shinobu Otani
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Text Dedupe
Detect exact and near-duplicate texts in scraped or collected datasets — deterministic, Japanese-ready, no LLM cost.
What it does
- Exact duplicates: grouped by hash of the normalized text (Unicode
NFKC, lowercased, whitespace collapsed —
Hellomatcheshello). - Near-duplicates: Jaccard similarity over character n-gram shingles. Character shingles work for Japanese, where word tokenization is unreliable.
- Each text is compared against cluster representatives only (O(n × clusters), not O(n²) pairs).
Input
{"texts": ["first article ...", "first article ...!", "something else"],"similarity_percent": 85,"ngram_size": 5,"normalize": true}
Output (one dataset item per text)
{"index": 1, "status": "duplicate", "duplicate_of": 0, "similarity": 0.93, "exact": false}
duplicate_of points at the cluster representative (the first text of the
cluster). exact: true means the text is identical to an earlier input
after normalization. A SUMMARY record in the key-value store lists all
duplicate clusters and totals.
Usage
Run it on scraped articles, product descriptions, or RAG corpus chunks
before indexing; drop every item whose status is duplicate.