Content Similarity Finder
Pricing
from $0.01 / 1,000 results
Go to Apify Store

Content Similarity Finder
Find duplicate and similar content with advanced fuzzy matching algorithms. Perfect for data cleaning and deduplication.
Pricing
from $0.01 / 1,000 results
Rating
0.0
(0)
Developer

Cody Churchwell
Maintained by Community
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
a month ago
Last modified
Categories
Share
Content Similarity & Duplicate Finder
Find duplicate and similar content with advanced fuzzy matching algorithms. Perfect for data cleaning and deduplication.
🎯 What It Does
Content Similarity Finder detects duplicate and near-duplicate content using multiple similarity algorithms: cosine similarity, Levenshtein distance, fuzzy matching, and Jaccard similarity.
✨ Key Features
- Multiple Algorithms: Cosine, Levenshtein, Fuzzy, Jaccard
- Configurable Threshold: Set minimum similarity (0-100%)
- Smart Normalization: Case-insensitive, whitespace handling
- Duplicate Grouping: Cluster similar items together
- Fast Processing: Optimized for large datasets
🚀 Quick Start
{"content": [{"id": "1", "text": "The quick brown fox jumps"},{"id": "2", "text": "A quick brown fox jumps"},{"id": "3", "text": "Completely different text"}],"similarityThreshold": 0.8,"algorithms": {"cosine": true,"levenshtein": true,"fuzzy": true,"jaccard": true}}
📥 Input
- content: Array of items with
idandtextfields - similarityThreshold: 0-1 (0.8 = 80% similar minimum)
- algorithms: Enable/disable cosine, levenshtein, fuzzy, jaccard
- caseSensitive: Treat case as significant (default: false)
- ignoreWhitespace: Normalize whitespace (default: true)
- minLength: Skip texts shorter than this
- groupByDuplicate: Cluster similar items (default: true)
📤 Output
Similarity Matches
{"item1": "1","item2": "2","text1": "The quick brown fox","text2": "A quick brown fox","similarity": 0.89,"algorithm": "cosine"}
Duplicate Groups (if groupByDuplicate: true)
{"totalGroups": 1,"groups": [{"groupId": "group_1","members": ["1", "2"],"size": 2}]}
🛠 Use Cases
- Data Deduplication: Remove duplicate entries from databases
- Plagiarism Detection: Find copied content
- Content Moderation: Detect spam or repeated messages
- SEO Analysis: Find duplicate website content
- Data Cleaning: Merge similar records
📊 Algorithms
- Cosine Similarity: Best for semantic similarity (TF-IDF based)
- Levenshtein Distance: Best for typos, minor edits
- Fuzzy Matching: Best for approximate string matching
- Jaccard Similarity: Best for word overlap comparison
📄 License
MIT License
Clean data, better insights 🔍