Pricing

from $0.10 / url scan result

Try for free

Go to Apify Store

Vach : Detect content theft

Try for free

Detect content theft, AI rewrites, and LLM scraping exposure for your web pages. Input URLs or sitemaps, get semantic similarity scores, duplicate detection, and SEO risk reports per URL.

Pricing

from $0.10 / url scan result

Rating

5.0

(1)

Developer

REXREUS D.O

Actor stats

Bookmarked

Total users

Monthly active users

5 days ago

Last modified

Vach — AI Content Theft & Shadow Index Monitor

Apify Actor that detects content duplication, AI rewrites, and LLM scraping exposure for your web pages. Given a list of URLs, sitemaps, or domains, Vach crawls each page, generates a semantic fingerprint, searches for copies across the web, and produces a structured risk report per URL.

Scope: Vach detects text duplication on publicly crawlable web pages. It does not detect usage inside AI model weights. The LLM Exposure Risk Score is a probabilistic estimate based on indirect signals, not definitive proof.

How It Works

Input resolution — URLs, sitemap XML, or bare domains are normalized into a flat URL list.
Crawl & extract — Each page is rendered with Playwright and text is extracted via Mozilla Readability.
Fingerprinting — A SimHash (64-bit) and a mean-pooled semantic embedding vector are computed per page.
Shadow index scan — Representative phrases are searched via Bing/SerpAPI, and a curated list of known scraper domains is checked.
Similarity analysis — Each candidate is crawled and compared using a composite score: (0.7 × cosine) + (0.3 × simhash_similarity).
Risk scoring — LLM exposure risk, SEO cannibalization risk, and traffic/ranking displacement risk are calculated.
Output — One JSON item per input URL is pushed to the Apify Dataset, plus a summary item.

Input Parameters

Parameter	Type	Default	Description
`urls`	`string[]`	(required)	Page URLs, sitemap XML URLs, or domains (e.g. `example.com`)
`similarity_threshold`	`number`	`0.80`	Minimum score (0.50–1.00) to flag a candidate as a confirmed duplicate
`max_urls`	`integer`	`50`	Maximum URLs to process per run (max: 500)
`embedding_provider`	`"local"` \| `"openai"`	`"local"`	`local` uses bundled ONNX MiniLM model; `openai` uses `text-embedding-3-small`
`openai_api_key`	`string`	—	Required when `embedding_provider` is `"openai"`
`concurrency`	`integer`	`5`	Parallel crawl workers (max: 20)
`search_api_provider`	`"none"` \| `"bing"` \| `"serpapi"`	`"none"`	Enable search engine scanning for broader duplicate discovery
`bing_api_key`	`string`	—	Required when `search_api_provider` is `"bing"`
`serpapi_key`	`string`	—	Required when `search_api_provider` is `"serpapi"`
`webhook_url`	`string`	—	HTTP POST endpoint notified each time a confirmed duplicate is found
`curated_domains`	`string[]`	—	Override the default curated scraper domain list
`llm_adjacent_domains`	`string[]`	—	Override the default LLM-adjacent domain list

Example Input

{
  "urls": [
    "https://myblog.com/sitemap.xml",
    "https://myblog.com/blog/my-best-article"
  ],
  "similarity_threshold": 0.80,
  "max_urls": 100,
  "embedding_provider": "local",
  "search_api_provider": "bing",
  "bing_api_key": "YOUR_BING_KEY",
  "concurrency": 5,
  "webhook_url": "https://hooks.myapp.com/content-alert"
}

Output

One JSON item is pushed to the Apify Dataset per input URL. A final summary item (with _type: "summary") is appended after all URLs are processed.

Example Output Item

{
  "input_url": "https://myblog.com/blog/my-best-article",
  "scan_timestamp": "2024-06-01T14:32:00Z",
  "content_title": "My Best Article",
  "fingerprint_id": "a3f8c2d1e4b7...",
  "embedding_status": "success",
  "crawl_status": "success",
  "http_status_code": 200,
  "duplicate_candidates_found": 12,
  "confirmed_duplicates_count": 2,
  "duplicates": [
    {
      "url": "https://scraper-farm.com/copied-article",
      "similarity_score": 0.9412,
      "cosine_similarity": 0.9601,
      "simhash_distance": 4,
      "duplicate_type": "ai_rewrite",
      "is_confirmed_duplicate": true,
      "original_publisher": "source",
      "timestamp_confidence": "high",
      "source_published_date": "2024-03-15T10:00:00Z",
      "duplicate_published_date": "2024-04-20T09:00:00Z",
      "crawl_status": "success"
    }
  ],
  "llm_exposure_risk_score": 35,
  "llm_risk_level": "low",
  "llm_risk_factors": ["2 confirmed duplicates contribute base score"],
  "cannibalization_score": 47,
  "cannibalization_risk_level": "medium",
  "traffic_loss_risk": "medium",
  "ranking_displacement_risk": "medium",
  "threat_status": "threats_detected"
}

Example Summary Item

{
  "_type": "summary",
  "total_urls_scanned": 10,
  "total_confirmed_duplicates": 5,
  "high_risk_urls_count": 1,
  "scan_duration_seconds": 142,
  "actor_version": "0.1.0"
}

Webhook Payload (per confirmed duplicate)

{
  "event": "duplicate_found",
  "input_url": "https://myblog.com/blog/my-best-article",
  "duplicate_url": "https://scraper-farm.com/copied-article",
  "similarity_score": 0.9412,
  "duplicate_type": "ai_rewrite",
  "scan_timestamp": "2024-06-01T14:32:00Z"
}

Similarity Score & Duplicate Types

How the Score Is Calculated

similarity_score = (0.7 × cosine_similarity) + (0.3 × simhash_similarity)
simhash_similarity = 1 - (hamming_distance / 64)

Both components range from 0.0 to 1.0. The final score is rounded to 4 decimal places.

Duplicate Type Classification

Type	Condition	Meaning
`exact_copy`	cosine ≥ 0.97	Near-verbatim copy
`ai_rewrite`	cosine 0.80–0.96 AND simhash_distance ≥ 10	Paraphrased or AI-rewritten copy
`summarization_reuse`	cosine 0.65–0.79	Significant content reuse, possibly summarized
`partial_reuse`	cosine 0.50–0.64	Partial overlap, may share key sections
`below_threshold`	cosine < 0.50	Not considered a duplicate

A candidate is flagged as is_confirmed_duplicate: true only when similarity_score ≥ similarity_threshold.

Risk Scores Explained

LLM Exposure Risk Score (0–100)

Estimates the probability that your content has been ingested into an LLM training pipeline.

score = min(100, round(
  (llm_domain_hits × 20) +
  (avg_similarity_on_llm_domains × 30) +
  (confirmed_duplicate_count × 5)
))

low: score < 40
medium: score 40–69
high: score ≥ 70

Cannibalization Score (0–100)

Estimates SEO impact from duplicate content competing for the same rankings.

score = min(100, round(confirmed_duplicate_count × 10 + avg_similarity_score × 50))

low: 0–33 | medium: 34–66 | high: 67–100

Traffic Loss Risk

Based on the number of confirmed duplicates: 0 → low, 1–3 → medium, ≥4 → high.

Ranking Displacement Risk

Based on the count of exact_copy or ai_rewrite duplicates: 0 → low, 1–2 → medium, ≥3 → high.

Limitations

Authentication & paywalls: Pages behind login, paywalls, or heavy client-side rendering may not be fully crawlable.
Timestamp inference is probabilistic: Published date metadata can be manipulated or absent. Results marked timestamp_confidence: "unknown" or "low" should be treated as estimates only.
Similarity score is an indicator, not legal proof: A high score suggests duplication but does not constitute evidence for legal action without manual verification.
LLM exposure is indirect: The LLM Exposure Risk Score is based on whether duplicates appear on domains associated with training data collection — it does not confirm your content is inside any specific model.
Search API quota: Bing free tier allows ~1,000 requests/month (~333 URLs at 3 phrases each). For larger volumes, use a paid tier or SerpAPI.
Scale: A single actor run supports up to 500 input URLs. For larger crawls, run multiple actors in parallel via an orchestrator.
ONNX model not included in repo: The models/all-MiniLM-L6-v2.onnx file must be downloaded separately and placed in the models/ directory before building the Docker image.

Duplicate Content Checker

automation-lab/duplicate-content-checker

This actor compares the text content of two or more web pages to detect duplicate or near-duplicate content. It uses w-shingling (5-word n-grams) with Jaccard similarity to calculate the percentage of shared content between every pair of URLs. Pages with 90%+ similarity are flagged as...

Stas Persiianenko

Ai Seo Content

vivid_astronaut/ai-seo-content

Fabio Suizu

NICB Title Check — Theft, Salvage & Flood Lookup

copious_atoll/nicb-title-check

Checks VINs against NICB VINCheck for theft, salvage, and flood records. Falls back to NHTSA vPIC decode for suspicious VIN detection. Returns clean/flagged status per VIN.

Grim R

SEO Duplicate Content Detector

gr_59017/seo-duplicate-content-detector

Detects duplicate or identical content across multiple webpages by analyzing visible page text. Helps identify SEO duplicate content issues, content reuse, and potential ranking risks using simple content comparison and scoring.

Gautam Rana

AI Sitemap Content Extractor

enosgb/ai-sitemap-content-extractor

Transform website sitemaps into clean, AI-ready content with Markdown, semantic chunks, and optional AI summaries.

Enos Melo

Content Similarity Finder

fiery_dream/content-similarity-finder

Find duplicate and similar content with advanced fuzzy matching algorithms. Perfect for data cleaning and deduplication.

Cody Churchwell

Enhanced Deep Content Crawler

assertive_analogy/advanced-crawler

A fast, Python-powered web crawler with smart content extraction, JS support, metadata capture, and duplicate detection. Ideal for SEO, content migration, and e-commerce scraping. Reliable, scalable, and easy to customize.

Gideon Nesh

1.0

Pricing Detector

brave_zygantrum/pricing-radar

Detect SaaS Pricing Pages & Extract Prices without LLM

Etan gentil

5.0

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

🚀 Transform web content into clean, LLM-ready Markdown! 📘 Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! 🌐📝🧠

EasyApi

291

5.0

Competitor Content Radar

intelligence-automation/competitor-content-radar

Monitor competitor websites and detect new pages, content updates, and topic shifts with AI-powered analysis. Get intelligent summaries, webhook notifications, and track changes over time. Perfect for SEO teams, content strategists, and competitive intelligence.