Content Similarity Finder avatar
Content Similarity Finder

Pricing

Pay per event

Go to Apify Store
Content Similarity Finder

Content Similarity Finder

Find duplicate and similar content with advanced fuzzy matching algorithms. Perfect for data cleaning and deduplication.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Cody Churchwell

Cody Churchwell

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

7 days ago

Last modified

Share

Content Similarity & Duplicate Finder

Find duplicate and similar content with advanced fuzzy matching algorithms. Perfect for data cleaning and deduplication.

🎯 What It Does

Content Similarity Finder detects duplicate and near-duplicate content using multiple similarity algorithms: cosine similarity, Levenshtein distance, fuzzy matching, and Jaccard similarity.

✨ Key Features

  • Multiple Algorithms: Cosine, Levenshtein, Fuzzy, Jaccard
  • Configurable Threshold: Set minimum similarity (0-100%)
  • Smart Normalization: Case-insensitive, whitespace handling
  • Duplicate Grouping: Cluster similar items together
  • Fast Processing: Optimized for large datasets

🚀 Quick Start

{
"content": [
{"id": "1", "text": "The quick brown fox jumps"},
{"id": "2", "text": "A quick brown fox jumps"},
{"id": "3", "text": "Completely different text"}
],
"similarityThreshold": 0.8,
"algorithms": {
"cosine": true,
"levenshtein": true,
"fuzzy": true,
"jaccard": true
}
}

📥 Input

  • content: Array of items with id and text fields
  • similarityThreshold: 0-1 (0.8 = 80% similar minimum)
  • algorithms: Enable/disable cosine, levenshtein, fuzzy, jaccard
  • caseSensitive: Treat case as significant (default: false)
  • ignoreWhitespace: Normalize whitespace (default: true)
  • minLength: Skip texts shorter than this
  • groupByDuplicate: Cluster similar items (default: true)

📤 Output

Similarity Matches

{
"item1": "1",
"item2": "2",
"text1": "The quick brown fox",
"text2": "A quick brown fox",
"similarity": 0.89,
"algorithm": "cosine"
}

Duplicate Groups (if groupByDuplicate: true)

{
"totalGroups": 1,
"groups": [
{
"groupId": "group_1",
"members": ["1", "2"],
"size": 2
}
]
}

🛠 Use Cases

  • Data Deduplication: Remove duplicate entries from databases
  • Plagiarism Detection: Find copied content
  • Content Moderation: Detect spam or repeated messages
  • SEO Analysis: Find duplicate website content
  • Data Cleaning: Merge similar records

📊 Algorithms

  • Cosine Similarity: Best for semantic similarity (TF-IDF based)
  • Levenshtein Distance: Best for typos, minor edits
  • Fuzzy Matching: Best for approximate string matching
  • Jaccard Similarity: Best for word overlap comparison

📄 License

MIT License


Clean data, better insights 🔍