Pricing

from $3.00 / 1,000 results

Try for free

Go to Apify Store

i18n Audit

Try for free

Detects translation gaps and meaning/structural differences between multilingual pages. - Finds missing content and meaning drift in translated web pages - Compares multilingual pages to detect translation and structure gaps - Identifies incomplete or inconsistent page translations across languages

Pricing

from $3.00 / 1,000 results

Rating

5.0

(1)

Developer

Lisa Akinfiieva

Actor stats

Bookmarked

Total users

Monthly active users

11 days ago

Last modified

i18n Language Audit Crawler

An Apify Actor that crawls websites to identify available languages and validates if the content matches the expected language. Perfect for auditing multilingual websites and ensuring proper internationalization.

Features

🌐 Language Detection: Automatically detects expected language from URL patterns and HTML attributes
📝 Smart Content Analysis: Extracts and analyzes text from headings, paragraphs, articles, and other content elements
✅ Advanced Validation: Uses ELD (Efficient Language Detection) library to verify content matches expected language
📊 Language Mismatch Detection: Identifies discrepancies between URL language indicators and HTML lang attributes
📈 Detailed Scoring: Returns top 3 language scores for each text element to understand confidence levels
🎯 Consistency Metrics: Calculates if page meets 80% language consistency threshold
⚡ Scalable: Built on Crawlee and Playwright for efficient crawling
🔄 Deduplication: Skips processing duplicate text to reduce redundant analysis

Input Parameters

Configure the crawler with the following input parameters:

Parameter	Type	Required	Default	Description
`startUrl`	string	✅ Yes	-	The URL where the crawler will start (e.g., `https://example.com`)
`maxCrawlPages`	integer	No	10	Maximum number of pages to crawl (min: 1, max: 1000). When reached, no new links are enqueued.
`maxCrawlDepth`	integer	No	3	Maximum depth of crawling from the start URL (min: 1, max: 10). Depth 0 = start URL only, Depth 1 = start URL + links found on it, etc.
`proxyConfiguration`	object	No	`{"useApifyProxy": true}`	Proxy settings for the crawler

Example Input

{
  "startUrl": "https://example.com",
  "maxCrawlPages": 50,
  "maxCrawlDepth": 3,
  "proxyConfiguration": {
    "useApifyProxy": true
  }
}

Output

The Actor outputs a dataset with the following structure for each audited page:

Field	Type	Description
`url`	string	URL of the audited page
`title`	string	Title of the page
`available_languages`	string	JSON array of all detected languages on the page
`expected_language`	string	ISO 639-1 language code (e.g., 'en', 'es', 'fr')
`language_source`	string	Source of expected language: 'url' (from URL path like /en/) or 'html' (from `<html lang>` attribute)
`url_language`	string \| null	Language code extracted from URL path (e.g., 'en' from /en/page)
`html_language`	string \| null	Language code from HTML lang attribute (e.g., 'en' from `<html lang='en'>`)
`html_url_mismatch`	string	'true' if URL language and HTML lang attribute indicate different languages
`discrepancies`	string	JSON array of language mismatches with element location and top 3 confidence scores

Example Output

{
  "url": "https://example.com/en/about",
  "title": "About Us",
  "available_languages": "[\"en\"]",
  "expected_language": "en",
  "language_source": "url",
  "url_language": "en",
  "html_language": "en",
  "html_url_mismatch": "false",
  "discrepancies": "[]"
}

Example with Discrepancies

{
  "url": "https://example.com/fr/contact",
  "title": "Contact",
  "available_languages": "[\"en\",\"fr\"]",
  "expected_language": "fr",
  "language_source": "url",
  "url_language": "fr",
  "html_language": "en",
  "html_url_mismatch": "true",
  "discrepancies": "[{\"element\":\"p\",\"expectedLang\":\"fr\",\"detectedLang\":\"en\",\"score\":{\"en\":0.89,\"de\":0.72,\"nl\":0.68},\"text\":\"Contact us today\"}]"
}

Use Cases

i18n Audits: Verify multilingual websites have correct language content in each language version
Quality Assurance: Detect language inconsistencies before going live with new content
SEO Monitoring: Ensure language-specific pages have appropriate content for search engines
Content Migration: Validate language content after CMS migrations or website redesigns
Compliance: Ensure multilingual websites meet accessibility and localization standards

How It Works

Initialization: Starts from the provided URL with depth level 0
Language Extraction:
- Extracts expected language from URL patterns (e.g., /en/, /es/, en.example.com)
- Extracts language from HTML lang attribute (e.g., <html lang="en">)
- Detects discrepancies between URL and HTML language indicators
Content Analysis: Scans text from various HTML elements (headings, paragraphs, lists, etc.)
Deduplication: Skips processing identical text to reduce redundant analysis
Language Detection: Uses ELD library to detect actual language of text content
Score Optimization: Keeps only top 3 language scores for each element to reduce data size
Smart Comparison: Biases toward target language when scores are within 1% margin
Validation: Compares detected language vs expected language and tracks discrepancies
Consistency Check: Calculates if 80%+ of page content matches expected language
Reporting: Outputs comprehensive validation results to the dataset

Supported Languages

The Actor supports 30+ languages including: English, Spanish, French, German, Italian, Portuguese, Russian, Japanese, Korean, Chinese, Arabic, Hindi, Dutch, Swedish, Polish, Turkish, Czech, Danish, Finnish, Norwegian, Ukrainian, Romanian, Hungarian, Greek, Hebrew, Thai, Vietnamese, Indonesian, Malay, Slovak, Bulgarian, Croatian, Serbian, Catalan, and more.

Crawl Control

maxCrawlPages

Limits the total number of pages processed. Strictly enforced - when this limit is reached, no new links are enqueued.

maxCrawlPages: 1 = Crawl only the start URL
maxCrawlPages: 10 = Crawl maximum 10 pages total
maxCrawlPages: 50 = Crawl maximum 50 pages total

maxCrawlDepth

Limits how deep links are followed from the start URL.

Depth 0: Only the start URL
Depth 1: Start URL + links found on it (1 level deep)
Depth 2: Start URL + links + links from those pages (2 levels deep)
Depth 3: Three levels of link following

Example:

maxCrawlPages: 1, maxCrawlDepth: 1 = Only crawl the start URL (no link following)
maxCrawlPages: 50, maxCrawlDepth: 2 = Crawl up to 50 pages, following links up to 2 levels deep

Translation

vivid_astronaut/translation

Translate text between languages using AI. Support for 100+ languages with automatic detection. Perfect for content localization, multilingual apps, and international communication.

Fabio Suizu

Multilingual Translation Agent

labrat011/multilingual-translation-agent

A Python translation utility agent that translates text between languages and returns translated text with per‑character billing metrics. Perfect for automation workflows, multi‑language pipelines, and agent‑to‑agent integrations.

Mick

Multilingual Research Synthesis

ammarsalmi/multilingual-research-synthesis

Surface region-exclusive insights with AI-powered multilingual research. Automates query translation across 32 languages, Google search, page crawling, back-translation, and LLM summarization. Discover global knowledge hidden behind language barriers in one unified report.

Ammar Salmi

Ai Sentiment Multilingual

vivid_astronaut/ai-sentiment-multilingual

Fabio Suizu

Google Translation Scraper

dev_bodex/google-translation-scraper

This Google Translation Scraper Actor automates extracting translations for any input text from Google Translate. Built with Node.js and Puppeteer, it efficiently retrieves translations in multiple languages, providing structured data for use in language apps, research, or educational projects.

Eniola Bode

Ai Translation Context

vivid_astronaut/ai-translation-context

Fabio Suizu

Schema Drift Detector

solutionssmart/schema-drift-detector

Detect website structure changes before your scrapers break. Monitors pages with a headless browser, compares DOM fingerprints or watched selectors across runs, and alerts on drift via dataset or webhook. Built for reliable scraping pipelines.

Solutions Smart

DeepL Translate Scraper 🌐🔤

scrapestorm/deepl-translate-scraper

Gather DeepL translation results by keywords 🌐. Access detailed translations with original text, translated text, language pairs 🔄, timestamps ⏰, and more. Ideal for language learning, research, and multilingual projects 📊. Perfect for translators, researchers, and language enthusiasts.

Storm_Scraper

5.0

i n Scraper ⭐

jupri/linkedin

💼 Scrape Linkedin.com

cat

Multilingual Data Transformer

dz_omar/multilingual-data-transformer

Transform any JSON data into any language with AI-powered translation. Supports 16+ languages. Perfect for e-commerce, content localization, and data processing. Pay-per-word pricing. Features: ✨ Universal JSON support 🌍 16+ languages, French, German, Arabic, and more ⚡ Fast processing