Pricing

Pay per event

Duplicate Content Checker

This actor compares the text content of two or more web pages to detect duplicate or near-duplicate content. It uses w-shingling (5-word n-grams) with Jaccard similarity to calculate the percentage of shared content between every pair of URLs. Pages with 90%+ similarity are flagged as...

Pricing

Pay per event

Rating

0.0

(0)

Developer

Stas Persiianenko

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

What does Duplicate Content Checker do?

This actor compares the text content of two or more web pages to detect duplicate or near-duplicate content. It uses w-shingling (5-word n-grams) with Jaccard similarity to calculate the percentage of shared content between every pair of URLs. Pages with 90%+ similarity are flagged as duplicates, and 60-90% as near-duplicates.

Duplicate content is one of the most common SEO problems. When multiple pages have substantially similar text, search engines struggle to decide which version to rank, often resulting in neither page performing well. This actor gives you a precise similarity percentage for every URL pair so you can take action.

Each result includes word counts, shingle counts, and the exact similarity percentage for the pair, giving you all the data needed to decide whether to consolidate, differentiate, or canonicalize duplicate pages. The actor also provides page titles in the output so you can quickly identify which pages are being compared.

Use cases

SEO specialist -- find pages on your site that compete against each other for the same keywords due to duplicate content
Content manager -- identify redundant blog posts or landing pages that should be consolidated or differentiated
Agency strategist -- run plagiarism checks to see if competitor sites have copied your client's unique content
Migration engineer -- verify that content matches between old and new URLs after a site migration
E-commerce manager -- detect product pages with nearly identical descriptions that need unique, differentiated copy

Why use Duplicate Content Checker?

Pairwise comparison -- every URL is compared against every other URL, catching duplicates you might miss manually
Shingling algorithm -- uses 5-word n-grams for accurate text similarity, not just word counts
Clear thresholds -- automatically flags 90%+ as duplicate and 60-90% as near-duplicate
Batch processing -- submit multiple URLs and get all pair comparisons in one run
Structured JSON output -- results include similarity percentages, shingle counts, and word counts for detailed analysis
Pay-per-event pricing -- you only pay per URL pair compared
Word count metrics -- includes total word counts for each page to help contextualize similarity percentages
Page titles in output -- each result includes both page titles for quick identification of compared pages
No configuration needed -- just provide URLs and the actor handles all text extraction, shingling, and similarity comparison automatically

Input parameters

Parameter	Type	Required	Default	Description
`urls`	array	Yes	--	List of web page URLs to compare for duplicate content. Minimum 2 URLs required. Every pair of URLs will be compared.

Input example

{
    "urls": [
        "https://en.wikipedia.org/wiki/Web_scraping",
        "https://en.wikipedia.org/wiki/Data_extraction",
        "https://example.com"
    ]
}

Output example

{
    "urlA": "https://en.wikipedia.org/wiki/Web_scraping",
    "urlB": "https://en.wikipedia.org/wiki/Data_extraction",
    "similarityPercent": 12.5,
    "isDuplicate": false,
    "isNearDuplicate": false,
    "sharedShingles": 245,
    "totalShinglesA": 1200,
    "totalShinglesB": 980,
    "wordCountA": 3450,
    "wordCountB": 2800,
    "titleA": "Web scraping - Wikipedia",
    "titleB": "Data extraction - Wikipedia",
    "error": null,
    "checkedAt": "2026-03-01T12:00:00.000Z"
}

How to check for duplicate content

Go to Duplicate Content Checker on Apify Store
Enter two or more web page URLs to compare (e.g., two blog posts or product pages)
Click Start and wait for the run to finish
Review the similarity percentage, duplicate/near-duplicate flags, and word counts for each URL pair
Download your results as JSON, CSV, or Excel from the Dataset tab

How much does it cost to check for duplicate content?

Event	Price	Description
Start	$0.035	One-time per run
Pair compared	$0.002	Per URL pair compared

Example costs:

3 URLs = 3 pairs: $0.035 + (3 x $0.002) = $0.041
5 URLs = 10 pairs: $0.035 + (10 x $0.002) = $0.055
10 URLs = 45 pairs: $0.035 + (45 x $0.002) = $0.125

Using the Apify API

You can call Duplicate Content Checker programmatically from your own applications using the Apify API. The following examples show how to start a run, wait for it to finish, and retrieve the results.

Node.js

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_TOKEN' });

const run = await client.actor('automation-lab/duplicate-content-checker').call({
    urls: ['https://example.com/page-1', 'https://example.com/page-2'],
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items);

Python

from apify_client import ApifyClient

client = ApifyClient('YOUR_TOKEN')

run = client.actor('automation-lab/duplicate-content-checker').call(run_input={
    'urls': ['https://example.com/page-1', 'https://example.com/page-2'],
})

items = client.dataset(run['defaultDatasetId']).list_items().items
print(items)

cURL

curl "https://api.apify.com/v2/acts/automation-lab~duplicate-content-checker/runs" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -d '{"urls": ["https://example.com/page-1", "https://example.com/page-2"]}'

Use with AI agents via MCP

Duplicate Content Checker is available as a tool for AI assistants via the Model Context Protocol (MCP).

Setup for Claude Code

$claude mcp add --transport http apify "https://mcp.apify.com?tools=automation-lab/duplicate-content-checker"

Setup for Claude Desktop, Cursor, or VS Code

{
    "mcpServers": {
        "apify": {
            "url": "https://mcp.apify.com?tools=automation-lab/duplicate-content-checker"
        }
    }
}

Example prompts

"Check these two URLs for duplicate content"
"Compare content similarity between these pages"

Learn more in the Apify MCP documentation.

Integrations

Duplicate Content Checker works with all major automation and integration platforms available on Apify. Connect it to Make (formerly Integromat) or Zapier to trigger content consolidation workflows when duplicates are found. Export results to Google Sheets for easy review with your content team. Send Slack alerts when new duplicate pages are detected during scheduled runs. Use webhooks to get instant notifications when a run finishes, or build custom pipelines with n8n. Schedule recurring runs on Apify to monitor your site for emerging duplicate content issues. All results are stored in Apify datasets and can be downloaded in JSON, CSV, or Excel format for further analysis.

Tips and best practices

Start with pages targeting the same keyword -- these are the most likely candidates for harmful duplicate content
Remember the pair count grows quickly -- 10 URLs produce 45 pairs, 20 URLs produce 190 pairs; keep URL lists focused
Investigate near-duplicates (60-90%) -- these are often more harmful than exact duplicates because they are harder to spot manually
Use results to plan canonical tags -- if two pages are duplicates, set one as the canonical and redirect or noindex the other
Combine with keyword density analysis -- if two pages have similar content and target the same keywords, consolidate them
Group URLs by topic cluster -- compare pages within the same category or topic to find internal keyword cannibalization issues

Legality

This tool analyzes publicly accessible web content. Automated analysis of public web resources is standard practice in SEO and web development. Always respect robots.txt directives and rate limits when analyzing third-party websites. For personal data processing, ensure compliance with applicable privacy regulations.

FAQ

How is similarity calculated? The actor uses w-shingling with 5-word n-grams and Jaccard similarity. It breaks each page's text into overlapping 5-word sequences, then calculates the ratio of shared sequences to total unique sequences across both pages.

What counts as a duplicate vs. near-duplicate? Pages with 90% or higher similarity are flagged as duplicates. Pages with 60-89% similarity are flagged as near-duplicates. Below 60% is considered unique content.

Can I compare pages from different websites? Yes. You can include URLs from any website. This is useful for plagiarism detection or checking if content has been syndicated without modification.

How many URLs can I compare in one run? There is no strict limit, but keep in mind that the number of pairs grows quadratically. For N URLs, the actor compares N*(N-1)/2 pairs. For example, 20 URLs produce 190 pairs. For large sets, consider breaking them into smaller groups of related pages.

Does the actor compare HTML or visible text? The actor extracts and compares the visible text content of each page, ignoring HTML tags, navigation menus, and boilerplate elements. This gives a more accurate measure of content similarity than raw HTML comparison.

Why use shingling instead of simple word matching? Shingling preserves word order, which is important for detecting actual content duplication. Two pages could have the same individual words in completely different arrangements, which simple word frequency comparison would miss. Shingling with 5-word n-grams catches passages that were genuinely copied or reused, providing a more accurate similarity measure.

Two pages show high similarity but they are not actual duplicates. Why? Pages with large shared boilerplate elements (navigation menus, footers, sidebars) can inflate the similarity score because the actor compares all visible text, not just the main content body. If your pages share a lot of template text, consider focusing on pairs with 90%+ similarity for true duplicate detection.

The actor returned an error for one URL but the others worked fine. What happened? Each URL is fetched independently. If a single URL fails (due to a timeout, DNS error, or server error), the pairs involving that URL will have an error, but all other pair comparisons will proceed normally. Check the error field for details about the specific failure.

Other SEO tools

Content Readability Checker — Analyze readability scores for web page content
Keyword Density Analyzer — Analyze keyword frequency and density on web pages
SEO Title Checker — Check title tags and meta descriptions for SEO best practices
Website Health Report — Run a comprehensive SEO health check on any website

SEO Duplicate Content Detector

gr_59017/seo-duplicate-content-detector

Detects duplicate or identical content across multiple webpages by analyzing visible page text. Helps identify SEO duplicate content issues, content reuse, and potential ranking risks using simple content comparison and scoring.

Gautam Rana

Content Similarity Finder

fiery_dream/content-similarity-finder

Find duplicate and similar content with advanced fuzzy matching algorithms. Perfect for data cleaning and deduplication.

Cody Churchwell

Vach : Detect content theft

rexreus/Vach

Detect content theft, AI rewrites, and LLM scraping exposure for your web pages. Input URLs or sitemaps, get semantic similarity scores, duplicate detection, and SEO risk reports per URL.

REXREUS D.O

5.0

Duplicate Run Guardian

tomas.gabik/duplicate-run-guardian

Save costs by automatically aborting duplicate Actor runs. The essential integration for every scraping workflow

Tomáš Gabík

Webpage Diff Checker

automation-lab/webpage-diff-checker

This actor compares text content between two web pages. It extracts visible text from both URLs, computes a line-by-line diff, and reports similarity percentage, added/removed lines, and a sample of differences. Useful for detecting content changes, comparing staging vs production, or A/B...

Stas Persiianenko

Enhanced Deep Content Crawler

assertive_analogy/advanced-crawler

A fast, Python-powered web crawler with smart content extraction, JS support, metadata capture, and duplicate detection. Ideal for SEO, content migration, and e-commerce scraping. Reliable, scalable, and easy to customize.

Gideon Nesh

1.0

Analyze Website Content: Extract Keywords and Terminology

nlp_data_lni/analyze-website-content-extract-keywords-and-terminology

The tool analyzes the textual content of a website, scrapes pages, cleans the html, analyze text and extract the terminology (keywords, words, n-grams and seed related keywords). It can be used to identify the main topics covered, analyze competitor content, find new ideas or trends and help for SEO

LilaK

Terms of Service (TOS) Watchdog

woundless_vehicle/tos-watchdog

An Actor that compares two Terms of Services (updated or current versus previous or old) and uses an LLM to analyze the risk of changes

Bolzyefx

LinkedIn Jobs Scraper | Remove Duplicate Jobs | Pay Per Result

cheap_scraper/linkedin-job-scraper

LinkedIn Jobs Scraper | Remove Duplicate Jobs. The LinkedIn jobs scraper allows you to collect jobs in 2 ways: By providing one or more start URLs, or By entering multiple keywords, search queries. You can use either method individually or combine both.

cheap_scraper

3.3K

4.7

AI Website Content Markdown Scraper

quaking_pail/ai-website-content-markdown-scraper

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.