Vach : Detect content theft avatar

Vach : Detect content theft

Pricing

from $0.10 / url scan result

Go to Apify Store
Vach : Detect content theft

Vach : Detect content theft

Detect content theft, AI rewrites, and LLM scraping exposure for your web pages. Input URLs or sitemaps, get semantic similarity scores, duplicate detection, and SEO risk reports per URL.

Pricing

from $0.10 / url scan result

Rating

5.0

(1)

Developer

REXREUS D.O

REXREUS D.O

Maintained by Community

Actor stats

1

Bookmarked

2

Total users

1

Monthly active users

5 days ago

Last modified

Share

Vach — AI Content Theft & Shadow Index Monitor

Apify Actor that detects content duplication, AI rewrites, and LLM scraping exposure for your web pages. Given a list of URLs, sitemaps, or domains, Vach crawls each page, generates a semantic fingerprint, searches for copies across the web, and produces a structured risk report per URL.

Scope: Vach detects text duplication on publicly crawlable web pages. It does not detect usage inside AI model weights. The LLM Exposure Risk Score is a probabilistic estimate based on indirect signals, not definitive proof.


How It Works

  1. Input resolution — URLs, sitemap XML, or bare domains are normalized into a flat URL list.
  2. Crawl & extract — Each page is rendered with Playwright and text is extracted via Mozilla Readability.
  3. Fingerprinting — A SimHash (64-bit) and a mean-pooled semantic embedding vector are computed per page.
  4. Shadow index scan — Representative phrases are searched via Bing/SerpAPI, and a curated list of known scraper domains is checked.
  5. Similarity analysis — Each candidate is crawled and compared using a composite score: (0.7 × cosine) + (0.3 × simhash_similarity).
  6. Risk scoring — LLM exposure risk, SEO cannibalization risk, and traffic/ranking displacement risk are calculated.
  7. Output — One JSON item per input URL is pushed to the Apify Dataset, plus a summary item.

Input Parameters

ParameterTypeDefaultDescription
urlsstring[](required)Page URLs, sitemap XML URLs, or domains (e.g. example.com)
similarity_thresholdnumber0.80Minimum score (0.50–1.00) to flag a candidate as a confirmed duplicate
max_urlsinteger50Maximum URLs to process per run (max: 500)
embedding_provider"local" | "openai""local"local uses bundled ONNX MiniLM model; openai uses text-embedding-3-small
openai_api_keystringRequired when embedding_provider is "openai"
concurrencyinteger5Parallel crawl workers (max: 20)
search_api_provider"none" | "bing" | "serpapi""none"Enable search engine scanning for broader duplicate discovery
bing_api_keystringRequired when search_api_provider is "bing"
serpapi_keystringRequired when search_api_provider is "serpapi"
webhook_urlstringHTTP POST endpoint notified each time a confirmed duplicate is found
curated_domainsstring[]Override the default curated scraper domain list
llm_adjacent_domainsstring[]Override the default LLM-adjacent domain list

Example Input

{
"urls": [
"https://myblog.com/sitemap.xml",
"https://myblog.com/blog/my-best-article"
],
"similarity_threshold": 0.80,
"max_urls": 100,
"embedding_provider": "local",
"search_api_provider": "bing",
"bing_api_key": "YOUR_BING_KEY",
"concurrency": 5,
"webhook_url": "https://hooks.myapp.com/content-alert"
}

Output

One JSON item is pushed to the Apify Dataset per input URL. A final summary item (with _type: "summary") is appended after all URLs are processed.

Example Output Item

{
"input_url": "https://myblog.com/blog/my-best-article",
"scan_timestamp": "2024-06-01T14:32:00Z",
"content_title": "My Best Article",
"fingerprint_id": "a3f8c2d1e4b7...",
"embedding_status": "success",
"crawl_status": "success",
"http_status_code": 200,
"duplicate_candidates_found": 12,
"confirmed_duplicates_count": 2,
"duplicates": [
{
"url": "https://scraper-farm.com/copied-article",
"similarity_score": 0.9412,
"cosine_similarity": 0.9601,
"simhash_distance": 4,
"duplicate_type": "ai_rewrite",
"is_confirmed_duplicate": true,
"original_publisher": "source",
"timestamp_confidence": "high",
"source_published_date": "2024-03-15T10:00:00Z",
"duplicate_published_date": "2024-04-20T09:00:00Z",
"crawl_status": "success"
}
],
"llm_exposure_risk_score": 35,
"llm_risk_level": "low",
"llm_risk_factors": ["2 confirmed duplicates contribute base score"],
"cannibalization_score": 47,
"cannibalization_risk_level": "medium",
"traffic_loss_risk": "medium",
"ranking_displacement_risk": "medium",
"threat_status": "threats_detected"
}

Example Summary Item

{
"_type": "summary",
"total_urls_scanned": 10,
"total_confirmed_duplicates": 5,
"high_risk_urls_count": 1,
"scan_duration_seconds": 142,
"actor_version": "0.1.0"
}

Webhook Payload (per confirmed duplicate)

{
"event": "duplicate_found",
"input_url": "https://myblog.com/blog/my-best-article",
"duplicate_url": "https://scraper-farm.com/copied-article",
"similarity_score": 0.9412,
"duplicate_type": "ai_rewrite",
"scan_timestamp": "2024-06-01T14:32:00Z"
}

Similarity Score & Duplicate Types

How the Score Is Calculated

similarity_score = (0.7 × cosine_similarity) + (0.3 × simhash_similarity)
simhash_similarity = 1 - (hamming_distance / 64)

Both components range from 0.0 to 1.0. The final score is rounded to 4 decimal places.

Duplicate Type Classification

TypeConditionMeaning
exact_copycosine ≥ 0.97Near-verbatim copy
ai_rewritecosine 0.80–0.96 AND simhash_distance ≥ 10Paraphrased or AI-rewritten copy
summarization_reusecosine 0.65–0.79Significant content reuse, possibly summarized
partial_reusecosine 0.50–0.64Partial overlap, may share key sections
below_thresholdcosine < 0.50Not considered a duplicate

A candidate is flagged as is_confirmed_duplicate: true only when similarity_score ≥ similarity_threshold.


Risk Scores Explained

LLM Exposure Risk Score (0–100)

Estimates the probability that your content has been ingested into an LLM training pipeline.

score = min(100, round(
(llm_domain_hits × 20) +
(avg_similarity_on_llm_domains × 30) +
(confirmed_duplicate_count × 5)
))
  • low: score < 40
  • medium: score 40–69
  • high: score ≥ 70

Cannibalization Score (0–100)

Estimates SEO impact from duplicate content competing for the same rankings.

score = min(100, round(confirmed_duplicate_count × 10 + avg_similarity_score × 50))
  • low: 0–33 | medium: 34–66 | high: 67–100

Traffic Loss Risk

Based on the number of confirmed duplicates: 0 → low, 1–3 → medium, ≥4 → high.

Ranking Displacement Risk

Based on the count of exact_copy or ai_rewrite duplicates: 0 → low, 1–2 → medium, ≥3 → high.


Limitations

  • Authentication & paywalls: Pages behind login, paywalls, or heavy client-side rendering may not be fully crawlable.
  • Timestamp inference is probabilistic: Published date metadata can be manipulated or absent. Results marked timestamp_confidence: "unknown" or "low" should be treated as estimates only.
  • Similarity score is an indicator, not legal proof: A high score suggests duplication but does not constitute evidence for legal action without manual verification.
  • LLM exposure is indirect: The LLM Exposure Risk Score is based on whether duplicates appear on domains associated with training data collection — it does not confirm your content is inside any specific model.
  • Search API quota: Bing free tier allows ~1,000 requests/month (~333 URLs at 3 phrases each). For larger volumes, use a paid tier or SerpAPI.
  • Scale: A single actor run supports up to 500 input URLs. For larger crawls, run multiple actors in parallel via an orchestrator.
  • ONNX model not included in repo: The models/all-MiniLM-L6-v2.onnx file must be downloaded separately and placed in the models/ directory before building the Docker image.