Vach : Detect content theft
Pricing
from $0.10 / url scan result
Vach : Detect content theft
Detect content theft, AI rewrites, and LLM scraping exposure for your web pages. Input URLs or sitemaps, get semantic similarity scores, duplicate detection, and SEO risk reports per URL.
Pricing
from $0.10 / url scan result
Rating
5.0
(1)
Developer
REXREUS D.O
Actor stats
1
Bookmarked
2
Total users
1
Monthly active users
5 days ago
Last modified
Categories
Share
Vach — AI Content Theft & Shadow Index Monitor
Apify Actor that detects content duplication, AI rewrites, and LLM scraping exposure for your web pages. Given a list of URLs, sitemaps, or domains, Vach crawls each page, generates a semantic fingerprint, searches for copies across the web, and produces a structured risk report per URL.
Scope: Vach detects text duplication on publicly crawlable web pages. It does not detect usage inside AI model weights. The LLM Exposure Risk Score is a probabilistic estimate based on indirect signals, not definitive proof.
How It Works
- Input resolution — URLs, sitemap XML, or bare domains are normalized into a flat URL list.
- Crawl & extract — Each page is rendered with Playwright and text is extracted via Mozilla Readability.
- Fingerprinting — A SimHash (64-bit) and a mean-pooled semantic embedding vector are computed per page.
- Shadow index scan — Representative phrases are searched via Bing/SerpAPI, and a curated list of known scraper domains is checked.
- Similarity analysis — Each candidate is crawled and compared using a composite score:
(0.7 × cosine) + (0.3 × simhash_similarity). - Risk scoring — LLM exposure risk, SEO cannibalization risk, and traffic/ranking displacement risk are calculated.
- Output — One JSON item per input URL is pushed to the Apify Dataset, plus a summary item.
Input Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
urls | string[] | (required) | Page URLs, sitemap XML URLs, or domains (e.g. example.com) |
similarity_threshold | number | 0.80 | Minimum score (0.50–1.00) to flag a candidate as a confirmed duplicate |
max_urls | integer | 50 | Maximum URLs to process per run (max: 500) |
embedding_provider | "local" | "openai" | "local" | local uses bundled ONNX MiniLM model; openai uses text-embedding-3-small |
openai_api_key | string | — | Required when embedding_provider is "openai" |
concurrency | integer | 5 | Parallel crawl workers (max: 20) |
search_api_provider | "none" | "bing" | "serpapi" | "none" | Enable search engine scanning for broader duplicate discovery |
bing_api_key | string | — | Required when search_api_provider is "bing" |
serpapi_key | string | — | Required when search_api_provider is "serpapi" |
webhook_url | string | — | HTTP POST endpoint notified each time a confirmed duplicate is found |
curated_domains | string[] | — | Override the default curated scraper domain list |
llm_adjacent_domains | string[] | — | Override the default LLM-adjacent domain list |
Example Input
{"urls": ["https://myblog.com/sitemap.xml","https://myblog.com/blog/my-best-article"],"similarity_threshold": 0.80,"max_urls": 100,"embedding_provider": "local","search_api_provider": "bing","bing_api_key": "YOUR_BING_KEY","concurrency": 5,"webhook_url": "https://hooks.myapp.com/content-alert"}
Output
One JSON item is pushed to the Apify Dataset per input URL. A final summary item (with _type: "summary") is appended after all URLs are processed.
Example Output Item
{"input_url": "https://myblog.com/blog/my-best-article","scan_timestamp": "2024-06-01T14:32:00Z","content_title": "My Best Article","fingerprint_id": "a3f8c2d1e4b7...","embedding_status": "success","crawl_status": "success","http_status_code": 200,"duplicate_candidates_found": 12,"confirmed_duplicates_count": 2,"duplicates": [{"url": "https://scraper-farm.com/copied-article","similarity_score": 0.9412,"cosine_similarity": 0.9601,"simhash_distance": 4,"duplicate_type": "ai_rewrite","is_confirmed_duplicate": true,"original_publisher": "source","timestamp_confidence": "high","source_published_date": "2024-03-15T10:00:00Z","duplicate_published_date": "2024-04-20T09:00:00Z","crawl_status": "success"}],"llm_exposure_risk_score": 35,"llm_risk_level": "low","llm_risk_factors": ["2 confirmed duplicates contribute base score"],"cannibalization_score": 47,"cannibalization_risk_level": "medium","traffic_loss_risk": "medium","ranking_displacement_risk": "medium","threat_status": "threats_detected"}
Example Summary Item
{"_type": "summary","total_urls_scanned": 10,"total_confirmed_duplicates": 5,"high_risk_urls_count": 1,"scan_duration_seconds": 142,"actor_version": "0.1.0"}
Webhook Payload (per confirmed duplicate)
{"event": "duplicate_found","input_url": "https://myblog.com/blog/my-best-article","duplicate_url": "https://scraper-farm.com/copied-article","similarity_score": 0.9412,"duplicate_type": "ai_rewrite","scan_timestamp": "2024-06-01T14:32:00Z"}
Similarity Score & Duplicate Types
How the Score Is Calculated
similarity_score = (0.7 × cosine_similarity) + (0.3 × simhash_similarity)simhash_similarity = 1 - (hamming_distance / 64)
Both components range from 0.0 to 1.0. The final score is rounded to 4 decimal places.
Duplicate Type Classification
| Type | Condition | Meaning |
|---|---|---|
exact_copy | cosine ≥ 0.97 | Near-verbatim copy |
ai_rewrite | cosine 0.80–0.96 AND simhash_distance ≥ 10 | Paraphrased or AI-rewritten copy |
summarization_reuse | cosine 0.65–0.79 | Significant content reuse, possibly summarized |
partial_reuse | cosine 0.50–0.64 | Partial overlap, may share key sections |
below_threshold | cosine < 0.50 | Not considered a duplicate |
A candidate is flagged as is_confirmed_duplicate: true only when similarity_score ≥ similarity_threshold.
Risk Scores Explained
LLM Exposure Risk Score (0–100)
Estimates the probability that your content has been ingested into an LLM training pipeline.
score = min(100, round((llm_domain_hits × 20) +(avg_similarity_on_llm_domains × 30) +(confirmed_duplicate_count × 5)))
low: score < 40medium: score 40–69high: score ≥ 70
Cannibalization Score (0–100)
Estimates SEO impact from duplicate content competing for the same rankings.
score = min(100, round(confirmed_duplicate_count × 10 + avg_similarity_score × 50))
low: 0–33 |medium: 34–66 |high: 67–100
Traffic Loss Risk
Based on the number of confirmed duplicates: 0 → low, 1–3 → medium, ≥4 → high.
Ranking Displacement Risk
Based on the count of exact_copy or ai_rewrite duplicates: 0 → low, 1–2 → medium, ≥3 → high.
Limitations
- Authentication & paywalls: Pages behind login, paywalls, or heavy client-side rendering may not be fully crawlable.
- Timestamp inference is probabilistic: Published date metadata can be manipulated or absent. Results marked
timestamp_confidence: "unknown"or"low"should be treated as estimates only. - Similarity score is an indicator, not legal proof: A high score suggests duplication but does not constitute evidence for legal action without manual verification.
- LLM exposure is indirect: The LLM Exposure Risk Score is based on whether duplicates appear on domains associated with training data collection — it does not confirm your content is inside any specific model.
- Search API quota: Bing free tier allows ~1,000 requests/month (~333 URLs at 3 phrases each). For larger volumes, use a paid tier or SerpAPI.
- Scale: A single actor run supports up to 500 input URLs. For larger crawls, run multiple actors in parallel via an orchestrator.
- ONNX model not included in repo: The
models/all-MiniLM-L6-v2.onnxfile must be downloaded separately and placed in themodels/directory before building the Docker image.