Duplicate Content Checker
Pricing
Pay per event
Duplicate Content Checker
This actor compares the text content of two or more web pages to detect duplicate or near-duplicate content. It uses w-shingling (5-word n-grams) with Jaccard similarity to calculate the percentage of shared content between every pair of URLs. Pages with 90%+ similarity are flagged as...
Pricing
Pay per event
Rating
0.0
(0)
Developer

Stas Persiianenko
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
5 days ago
Last modified
Categories
Share
Check for duplicate or near-duplicate content between web pages using shingling and Jaccard similarity.
What does Duplicate Content Checker do?
This actor compares the text content of two or more web pages to detect duplicate or near-duplicate content. It uses w-shingling (5-word n-grams) with Jaccard similarity to calculate the percentage of shared content between every pair of URLs. Pages with 90%+ similarity are flagged as duplicates, and 60-90% as near-duplicates.
Duplicate content is one of the most common SEO problems. When multiple pages have substantially similar text, search engines struggle to decide which version to rank, often resulting in neither page performing well. This actor gives you a precise similarity percentage for every URL pair so you can take action.
Each result includes word counts, shingle counts, and the exact similarity percentage for the pair, giving you all the data needed to decide whether to consolidate, differentiate, or canonicalize duplicate pages. The actor also provides page titles in the output so you can quickly identify which pages are being compared.
Use cases
- SEO specialist -- find pages on your site that compete against each other for the same keywords due to duplicate content
- Content manager -- identify redundant blog posts or landing pages that should be consolidated or differentiated
- Agency strategist -- run plagiarism checks to see if competitor sites have copied your client's unique content
- Migration engineer -- verify that content matches between old and new URLs after a site migration
- E-commerce manager -- detect product pages with nearly identical descriptions that need unique, differentiated copy
Why use Duplicate Content Checker?
- Pairwise comparison -- every URL is compared against every other URL, catching duplicates you might miss manually
- Shingling algorithm -- uses 5-word n-grams for accurate text similarity, not just word counts
- Clear thresholds -- automatically flags 90%+ as duplicate and 60-90% as near-duplicate
- Batch processing -- submit multiple URLs and get all pair comparisons in one run
- Structured JSON output -- results include similarity percentages, shingle counts, and word counts for detailed analysis
- Pay-per-event pricing -- you only pay per URL pair compared
- Word count metrics -- includes total word counts for each page to help contextualize similarity percentages
- Page titles in output -- each result includes both page titles for quick identification of compared pages
- No configuration needed -- just provide URLs and the actor handles all text extraction, shingling, and similarity comparison automatically
Input parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
urls | array | Yes | -- | List of web page URLs to compare for duplicate content. Minimum 2 URLs required. Every pair of URLs will be compared. |
Input example
{"urls": ["https://en.wikipedia.org/wiki/Web_scraping","https://en.wikipedia.org/wiki/Data_extraction","https://example.com"]}
Output example
{"urlA": "https://en.wikipedia.org/wiki/Web_scraping","urlB": "https://en.wikipedia.org/wiki/Data_extraction","similarityPercent": 12.5,"isDuplicate": false,"isNearDuplicate": false,"sharedShingles": 245,"totalShinglesA": 1200,"totalShinglesB": 980,"wordCountA": 3450,"wordCountB": 2800,"titleA": "Web scraping - Wikipedia","titleB": "Data extraction - Wikipedia","error": null,"checkedAt": "2026-03-01T12:00:00.000Z"}
How much does it cost?
| Event | Price | Description |
|---|---|---|
| Start | $0.035 | One-time per run |
| Pair compared | $0.002 | Per URL pair compared |
Example costs:
- 3 URLs = 3 pairs: $0.035 + (3 x $0.002) = $0.041
- 5 URLs = 10 pairs: $0.035 + (10 x $0.002) = $0.055
- 10 URLs = 45 pairs: $0.035 + (45 x $0.002) = $0.125
Using the Apify API
You can call Duplicate Content Checker programmatically from your own applications using the Apify API. The following examples show how to start a run, wait for it to finish, and retrieve the results.
Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_TOKEN' });const run = await client.actor('automation-lab/duplicate-content-checker').call({urls: ['https://example.com/page-1', 'https://example.com/page-2'],});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(items);
Python
from apify_client import ApifyClientclient = ApifyClient('YOUR_TOKEN')run = client.actor('automation-lab/duplicate-content-checker').call(run_input={'urls': ['https://example.com/page-1', 'https://example.com/page-2'],})items = client.dataset(run['defaultDatasetId']).list_items().itemsprint(items)
Integrations
Duplicate Content Checker works with all major automation and integration platforms available on Apify. Connect it to Make (formerly Integromat) or Zapier to trigger content consolidation workflows when duplicates are found. Export results to Google Sheets for easy review with your content team. Send Slack alerts when new duplicate pages are detected during scheduled runs. Use webhooks to get instant notifications when a run finishes, or build custom pipelines with n8n. Schedule recurring runs on Apify to monitor your site for emerging duplicate content issues. All results are stored in Apify datasets and can be downloaded in JSON, CSV, or Excel format for further analysis.
Tips and best practices
- Start with pages targeting the same keyword -- these are the most likely candidates for harmful duplicate content
- Remember the pair count grows quickly -- 10 URLs produce 45 pairs, 20 URLs produce 190 pairs; keep URL lists focused
- Investigate near-duplicates (60-90%) -- these are often more harmful than exact duplicates because they are harder to spot manually
- Use results to plan canonical tags -- if two pages are duplicates, set one as the canonical and redirect or noindex the other
- Combine with keyword density analysis -- if two pages have similar content and target the same keywords, consolidate them
- Group URLs by topic cluster -- compare pages within the same category or topic to find internal keyword cannibalization issues
FAQ
How is similarity calculated? The actor uses w-shingling with 5-word n-grams and Jaccard similarity. It breaks each page's text into overlapping 5-word sequences, then calculates the ratio of shared sequences to total unique sequences across both pages.
What counts as a duplicate vs. near-duplicate? Pages with 90% or higher similarity are flagged as duplicates. Pages with 60-89% similarity are flagged as near-duplicates. Below 60% is considered unique content.
Can I compare pages from different websites? Yes. You can include URLs from any website. This is useful for plagiarism detection or checking if content has been syndicated without modification.
How many URLs can I compare in one run? There is no strict limit, but keep in mind that the number of pairs grows quadratically. For N URLs, the actor compares N*(N-1)/2 pairs. For example, 20 URLs produce 190 pairs. For large sets, consider breaking them into smaller groups of related pages.
Does the actor compare HTML or visible text? The actor extracts and compares the visible text content of each page, ignoring HTML tags, navigation menus, and boilerplate elements. This gives a more accurate measure of content similarity than raw HTML comparison.
Why use shingling instead of simple word matching? Shingling preserves word order, which is important for detecting actual content duplication. Two pages could have the same individual words in completely different arrangements, which simple word frequency comparison would miss. Shingling with 5-word n-grams catches passages that were genuinely copied or reused, providing a more accurate similarity measure.