Text Dedupe — exact & near-duplicate detection, Japanese ready avatar

Text Dedupe — exact & near-duplicate detection, Japanese ready

Pricing

from $5.00 / 1,000 results

Go to Apify Store
Text Dedupe — exact & near-duplicate detection, Japanese ready

Text Dedupe — exact & near-duplicate detection, Japanese ready

Detect exact and near-duplicate texts in scraped or collected datasets: normalized hashing for exact matches plus character-shingle Jaccard similarity for near matches. Works on Japanese (no word tokenization needed). Returns per-text verdicts and duplicate clusters. No LLM cost.

Pricing

from $5.00 / 1,000 results

Rating

0.0

(0)

Developer

Shinobu Otani

Shinobu Otani

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Categories

Share

Text Dedupe

Detect exact and near-duplicate texts in scraped or collected datasets — deterministic, Japanese-ready, no LLM cost.

What it does

  • Exact duplicates: grouped by hash of the normalized text (Unicode NFKC, lowercased, whitespace collapsed — Hello matches hello).
  • Near-duplicates: Jaccard similarity over character n-gram shingles. Character shingles work for Japanese, where word tokenization is unreliable.
  • Each text is compared against cluster representatives only (O(n × clusters), not O(n²) pairs).

Input

{
"texts": ["first article ...", "first article ...!", "something else"],
"similarity_percent": 85,
"ngram_size": 5,
"normalize": true
}

Output (one dataset item per text)

{"index": 1, "status": "duplicate", "duplicate_of": 0, "similarity": 0.93, "exact": false}

duplicate_of points at the cluster representative (the first text of the cluster). exact: true means the text is identical to an earlier input after normalization. A SUMMARY record in the key-value store lists all duplicate clusters and totals.

Usage

Run it on scraped articles, product descriptions, or RAG corpus chunks before indexing; drop every item whose status is duplicate.