SEO Duplicate Content Detector
Pricing
Pay per usage
SEO Duplicate Content Detector
Detects duplicate or identical content across multiple webpages by analyzing visible page text. Helps identify SEO duplicate content issues, content reuse, and potential ranking risks using simple content comparison and scoring.
Pricing
Pay per usage
Rating
0.0
(0)
Developer

Gautam Rana
Actor stats
0
Bookmarked
34
Total users
20
Monthly active users
8 days ago
Last modified
Categories
Share
An API that detects duplicate or identical content across multiple webpages by analyzing visible page text and generating a duplication score.
Description
SEO Duplicate Content Detector analyzes the visible textual content of multiple webpages and compares them to identify duplicate or highly similar content. It helps uncover SEO issues related to content reuse, near-duplicate pages, and potential ranking risks caused by content redundancy.
The API performs basic text normalization, hashing, and similarity comparison to determine whether pages share identical or substantially similar content, and reports duplication metrics in a structured dataset.
This API is useful for:
- SEO audits and website quality checks
- Identifying content reuse across pages
- Detecting thin or duplicated pages
- Developers building SEO monitoring tools
- Automation workflows for large-scale content analysis
Features
- Extracts visible text content from webpages
- Generates a content hash for comparison
- Detects exact and near-duplicate pages
- Identifies which URLs share duplicated content
- Calculates content length
- Computes a duplication score (percentage)
- Returns structured JSON output
- Supports multiple URLs per request
Tech Stack
- Platform: Apify Actor
- Language: (your implementation language)
- HTTP Client / Crawler: (e.g., Crawlee, Axios, Requests)
- HTML Parsing: (e.g., Cheerio, BeautifulSoup)
- Similarity Logic: Hashing / basic text comparison
Input Format
The API accepts a JSON input with a list of webpage URLs to analyze.
Example input.json
{"startUrls": [{ "url": "https://github.com/Gautamrana14" },{ "url": "https://github.com/Gautamrana14?tab=repositories" },{ "url": "https://github.com/Gautamrana14?tab=overview" }]}
Usage
Base Endpoint (Apify Actor)
https://api.apify.com/v2/acts/<your-actor-id>/run-sync-get-dataset-items
Example Request (cURL)
curl -X POST \-H "Content-Type: application/json" \-d @input.json \"https://api.apify.com/v2/acts/<your-actor-id>/run-sync-get-dataset-items?token=YOUR_API_TOKEN"
Example using JavaScript
import fetch from "node-fetch";const response = await fetch("https://api.apify.com/v2/acts/<your-actor-id>/run-sync-get-dataset-items?token=YOUR_API_TOKEN",{method: "POST",headers: { "Content-Type": "application/json" },body: JSON.stringify({startUrls: [{ url: "https://example.com/page1" },{ url: "https://example.com/page2" }]})});const data = await response.json();console.log(data);
Example using Python
import requestsurl = "https://api.apify.com/v2/acts/<your-actor-id>/run-sync-get-dataset-items"params = {"token": "YOUR_API_TOKEN"}payload = {"startUrls": [{"url": "https://example.com/page1"},{"url": "https://example.com/page2"}]}response = requests.post(url, params=params, json=payload)print(response.json())
Output Format
The API returns a dataset URL containing the analysis results.
Output Schema
{"dataset": "https://api.apify.com/v2/datasets/xxxx/items"}
Dataset Schema
Each dataset item has the following structure:
{"url": "https://example.com/page1","contentHash": "a94a8fe5ccb19ba61c4c0873d391e987982fbbd3","isDuplicate": true,"duplicateWith": ["https://example.com/page2"],"contentLength": 3450,"duplicationScore": 92}
Field Description
| Field | Description |
|---|---|
| url | Website URL |
| contentHash | Hash generated from visible text content |
| isDuplicate | Indicates whether duplicate content was detected |
| duplicateWith | List of URLs with matching or similar content |
| contentLength | Length of extracted text content |
| duplicationScore | Similarity score in percentage (0–100) |
Limitations
- Only analyzes visible text content
- Does not detect plagiarism or semantic similarity
- Dynamic or JavaScript-rendered content may affect accuracy
- Not a replacement for full SEO audit tools
Rate Limits
Depends on your Apify plan and actor configuration.
Roadmap
- Near-duplicate detection using similarity algorithms
- Semantic similarity scoring
- Content clustering
- Domain-wide crawling support
- Export reports in CSV and JSON formats
Contributing
Contributions are welcome.
- Fork the repository
- Create a feature branch
- Commit your changes
- Open a pull request
License
MIT License
Author
Gautam Rana GitHub: https://github.com/Gautamrana14