SEO Duplicate Content Detector avatar
SEO Duplicate Content Detector

Pricing

Pay per usage

Go to Apify Store
SEO Duplicate Content Detector

SEO Duplicate Content Detector

Detects duplicate or identical content across multiple webpages by analyzing visible page text. Helps identify SEO duplicate content issues, content reuse, and potential ranking risks using simple content comparison and scoring.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Gautam Rana

Gautam Rana

Maintained by Community

Actor stats

0

Bookmarked

34

Total users

20

Monthly active users

8 days ago

Last modified

Categories

Share

An API that detects duplicate or identical content across multiple webpages by analyzing visible page text and generating a duplication score.


Description

SEO Duplicate Content Detector analyzes the visible textual content of multiple webpages and compares them to identify duplicate or highly similar content. It helps uncover SEO issues related to content reuse, near-duplicate pages, and potential ranking risks caused by content redundancy.

The API performs basic text normalization, hashing, and similarity comparison to determine whether pages share identical or substantially similar content, and reports duplication metrics in a structured dataset.

This API is useful for:

  • SEO audits and website quality checks
  • Identifying content reuse across pages
  • Detecting thin or duplicated pages
  • Developers building SEO monitoring tools
  • Automation workflows for large-scale content analysis

Features

  • Extracts visible text content from webpages
  • Generates a content hash for comparison
  • Detects exact and near-duplicate pages
  • Identifies which URLs share duplicated content
  • Calculates content length
  • Computes a duplication score (percentage)
  • Returns structured JSON output
  • Supports multiple URLs per request

Tech Stack

  • Platform: Apify Actor
  • Language: (your implementation language)
  • HTTP Client / Crawler: (e.g., Crawlee, Axios, Requests)
  • HTML Parsing: (e.g., Cheerio, BeautifulSoup)
  • Similarity Logic: Hashing / basic text comparison

Input Format

The API accepts a JSON input with a list of webpage URLs to analyze.

Example input.json

{
"startUrls": [
{ "url": "https://github.com/Gautamrana14" },
{ "url": "https://github.com/Gautamrana14?tab=repositories" },
{ "url": "https://github.com/Gautamrana14?tab=overview" }
]
}

Usage

Base Endpoint (Apify Actor)

https://api.apify.com/v2/acts/<your-actor-id>/run-sync-get-dataset-items

Example Request (cURL)

curl -X POST \
-H "Content-Type: application/json" \
-d @input.json \
"https://api.apify.com/v2/acts/<your-actor-id>/run-sync-get-dataset-items?token=YOUR_API_TOKEN"

Example using JavaScript

import fetch from "node-fetch";
const response = await fetch(
"https://api.apify.com/v2/acts/<your-actor-id>/run-sync-get-dataset-items?token=YOUR_API_TOKEN",
{
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
startUrls: [
{ url: "https://example.com/page1" },
{ url: "https://example.com/page2" }
]
})
}
);
const data = await response.json();
console.log(data);

Example using Python

import requests
url = "https://api.apify.com/v2/acts/<your-actor-id>/run-sync-get-dataset-items"
params = {"token": "YOUR_API_TOKEN"}
payload = {
"startUrls": [
{"url": "https://example.com/page1"},
{"url": "https://example.com/page2"}
]
}
response = requests.post(url, params=params, json=payload)
print(response.json())

Output Format

The API returns a dataset URL containing the analysis results.

Output Schema

{
"dataset": "https://api.apify.com/v2/datasets/xxxx/items"
}

Dataset Schema

Each dataset item has the following structure:

{
"url": "https://example.com/page1",
"contentHash": "a94a8fe5ccb19ba61c4c0873d391e987982fbbd3",
"isDuplicate": true,
"duplicateWith": [
"https://example.com/page2"
],
"contentLength": 3450,
"duplicationScore": 92
}

Field Description

FieldDescription
urlWebsite URL
contentHashHash generated from visible text content
isDuplicateIndicates whether duplicate content was detected
duplicateWithList of URLs with matching or similar content
contentLengthLength of extracted text content
duplicationScoreSimilarity score in percentage (0–100)

Limitations

  • Only analyzes visible text content
  • Does not detect plagiarism or semantic similarity
  • Dynamic or JavaScript-rendered content may affect accuracy
  • Not a replacement for full SEO audit tools

Rate Limits

Depends on your Apify plan and actor configuration.


Roadmap

  • Near-duplicate detection using similarity algorithms
  • Semantic similarity scoring
  • Content clustering
  • Domain-wide crawling support
  • Export reports in CSV and JSON formats

Contributing

Contributions are welcome.

  1. Fork the repository
  2. Create a feature branch
  3. Commit your changes
  4. Open a pull request

License

MIT License


Author

Gautam Rana GitHub: https://github.com/Gautamrana14