Pricing

from $5.00 / 1,000 results

Text Dedupe — exact & near-duplicate detection, Japanese ready

Detect exact and near-duplicate texts in scraped or collected datasets: normalized hashing for exact matches plus character-shingle Jaccard similarity for near matches. Works on Japanese (no word tokenization needed). Returns per-text verdicts and duplicate clusters. No LLM cost.

Pricing

from $5.00 / 1,000 results

Rating

0.0

(0)

Developer

Shinobu Otani

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

Text Dedupe

Detect exact and near-duplicate texts in scraped or collected datasets — deterministic, Japanese-ready, no LLM cost.

What it does

Exact duplicates: grouped by hash of the normalized text (Unicode NFKC, lowercased, whitespace collapsed — Ｈｅｌｌｏ matches hello).
Near-duplicates: Jaccard similarity over character n-gram shingles. Character shingles work for Japanese, where word tokenization is unreliable.
Each text is compared against cluster representatives only (O(n × clusters), not O(n²) pairs).

Input

{
    "texts": ["first article ...", "first article ...!", "something else"],
    "similarity_percent": 85,
    "ngram_size": 5,
    "normalize": true
}

Output (one dataset item per text)

{"index": 1, "status": "duplicate", "duplicate_of": 0, "similarity": 0.93, "exact": false}

duplicate_of points at the cluster representative (the first text of the cluster). exact: true means the text is identical to an earlier input after normalization. A SUMMARY record in the key-value store lists all duplicate clusters and totals.

Usage

Run it on scraped articles, product descriptions, or RAG corpus chunks before indexing; drop every item whose status is duplicate.

SEO Duplicate Content Detector

gr_59017/seo-duplicate-content-detector

Detects duplicate or identical content across multiple webpages by analyzing visible page text. Helps identify SEO duplicate content issues, content reuse, and potential ranking risks using simple content comparison and scoring.

Gautam Rana

Duplicate Run Guardian

tomas.gabik/duplicate-run-guardian

Save costs by automatically aborting duplicate Actor runs. The essential integration for every scraping workflow

Tomáš Gabík

Shopify Duplicate SKU & Variant Cleanup Audit

charlieakan/shopify-duplicate-sku-variant-cleanup-audit

Find duplicate SKUs, missing SKUs, messy variants, zero-price variants, duplicate option values, and catalog cleanup issues before inventory or marketplace sync breaks.

CharlieAKAN

Vach : Detect content theft

rexreus/Vach

Detect content theft, AI rewrites, and LLM scraping exposure for your web pages. Input URLs or sitemaps, get semantic similarity scores, duplicate detection, and SEO risk reports per URL.

REXREUS D.O

5.0

Content Similarity Finder

fiery_dream/content-similarity-finder

Find duplicate and similar content with advanced fuzzy matching algorithms. Perfect for data cleaning and deduplication.

Cody Churchwell

Angi Near-Me Scraper

moving_beacon-owner1/angi-near-me-scraper

Scrape targeted Angi "Near Me" business listings by category and location. Extract business profiles, ratings, reviews, contact details, service attributes, and addresses with automatic pagination and JSON-LD parsing. Ideal for lead generation, market research, and local service business analysis.

Jamshaid Arif

Google Lens Scraper

gio21/google-lens-scraper

Reverse image search via Google Lens. Submit any image URL, get visual matches, exact matches, OCR text, and AI descriptions in a single run. Multi-tab support. $0.005 per image — half the price of alternatives.

Gio

Lead List Deduplicator & Normalizer

webdata_labs/lead-list-deduplicator

[💵 $0.05 / 1K] Clean messy B2B lead lists into CRM-ready company/contact records with duplicate clusters, confidence scores, match reasons, normalized domains, emails, and phones.

WebData Labs

NASA Near-Earth Asteroids (NeoWs) Scraper

parseforge/nasa-neows-asteroids-scraper

Pull near-earth asteroid data from NASA NeoWs (Near Earth Object Web Service). Returns asteroid name, designation, hazardous flag, estimated diameter, close-approach date, miss-distance, relative velocity, orbiting body, and orbital data. Filter by date range or browse catalog.

ParseForge

Text Diff API - Compare Two Texts, 0-100 Similarity Score

eliai/text-diff-tool

Text diff and similarity API. Input: two text strings. Output: JSON with line- and word-level diffs, counted additions and deletions, and a 0-100 similarity score. Deterministic, no API key. Pay per result: $0.02 per diff run. Good for dedup and change tracking.