I18n Audit avatar

I18n Audit

Pricing

from $3.00 / 1,000 results

Go to Apify Store
I18n Audit

I18n Audit

Detects translation gaps and meaning/structural differences between multilingual pages. - Finds missing content and meaning drift in translated web pages - Compares multilingual pages to detect translation and structure gaps - Identifies incomplete or inconsistent page translations across languages

Pricing

from $3.00 / 1,000 results

Rating

5.0

(1)

Developer

Lisa Akinfiieva

Lisa Akinfiieva

Maintained by Community

Actor stats

0

Bookmarked

4

Total users

1

Monthly active users

2 days ago

Last modified

Categories

Share

i18n Language Audit Crawler

An Apify Actor that crawls websites to identify available languages and validates if the content matches the expected language. Perfect for auditing multilingual websites and ensuring proper internationalization.

Features

  • 🌐 Language Detection: Automatically detects expected language from URL patterns and HTML attributes
  • 📝 Smart Content Analysis: Extracts and analyzes text from headings, paragraphs, articles, and other content elements
  • Advanced Validation: Uses ELD (Efficient Language Detection) library to verify content matches expected language
  • 📊 Language Mismatch Detection: Identifies discrepancies between URL language indicators and HTML lang attributes
  • 📈 Detailed Scoring: Returns top 3 language scores for each text element to understand confidence levels
  • 🎯 Consistency Metrics: Calculates if page meets 80% language consistency threshold
  • Scalable: Built on Crawlee and Playwright for efficient crawling
  • 🔄 Deduplication: Skips processing duplicate text to reduce redundant analysis

Input Parameters

Configure the crawler with the following input parameters:

ParameterTypeRequiredDefaultDescription
startUrlstring✅ Yes-The URL where the crawler will start (e.g., https://example.com)
maxCrawlPagesintegerNo10Maximum number of pages to crawl (min: 1, max: 1000). When reached, no new links are enqueued.
maxCrawlDepthintegerNo3Maximum depth of crawling from the start URL (min: 1, max: 10). Depth 0 = start URL only, Depth 1 = start URL + links found on it, etc.
proxyConfigurationobjectNo{"useApifyProxy": true}Proxy settings for the crawler

Example Input

{
"startUrl": "https://example.com",
"maxCrawlPages": 50,
"maxCrawlDepth": 3,
"proxyConfiguration": {
"useApifyProxy": true
}
}

Output

The Actor outputs a dataset with the following structure for each audited page:

FieldTypeDescription
urlstringURL of the audited page
titlestringTitle of the page
available_languagesstringJSON array of all detected languages on the page
expected_languagestringISO 639-1 language code (e.g., 'en', 'es', 'fr')
language_sourcestringSource of expected language: 'url' (from URL path like /en/) or 'html' (from <html lang> attribute)
url_languagestring | nullLanguage code extracted from URL path (e.g., 'en' from /en/page)
html_languagestring | nullLanguage code from HTML lang attribute (e.g., 'en' from <html lang='en'>)
html_url_mismatchstring'true' if URL language and HTML lang attribute indicate different languages
discrepanciesstringJSON array of language mismatches with element location and top 3 confidence scores

Example Output

{
"url": "https://example.com/en/about",
"title": "About Us",
"available_languages": "[\"en\"]",
"expected_language": "en",
"language_source": "url",
"url_language": "en",
"html_language": "en",
"html_url_mismatch": "false",
"discrepancies": "[]"
}

Example with Discrepancies

{
"url": "https://example.com/fr/contact",
"title": "Contact",
"available_languages": "[\"en\",\"fr\"]",
"expected_language": "fr",
"language_source": "url",
"url_language": "fr",
"html_language": "en",
"html_url_mismatch": "true",
"discrepancies": "[{\"element\":\"p\",\"expectedLang\":\"fr\",\"detectedLang\":\"en\",\"score\":{\"en\":0.89,\"de\":0.72,\"nl\":0.68},\"text\":\"Contact us today\"}]"
}

Use Cases

  • i18n Audits: Verify multilingual websites have correct language content in each language version
  • Quality Assurance: Detect language inconsistencies before going live with new content
  • SEO Monitoring: Ensure language-specific pages have appropriate content for search engines
  • Content Migration: Validate language content after CMS migrations or website redesigns
  • Compliance: Ensure multilingual websites meet accessibility and localization standards

How It Works

  1. Initialization: Starts from the provided URL with depth level 0
  2. Language Extraction:
    • Extracts expected language from URL patterns (e.g., /en/, /es/, en.example.com)
    • Extracts language from HTML lang attribute (e.g., <html lang="en">)
    • Detects discrepancies between URL and HTML language indicators
  3. Content Analysis: Scans text from various HTML elements (headings, paragraphs, lists, etc.)
  4. Deduplication: Skips processing identical text to reduce redundant analysis
  5. Language Detection: Uses ELD library to detect actual language of text content
  6. Score Optimization: Keeps only top 3 language scores for each element to reduce data size
  7. Smart Comparison: Biases toward target language when scores are within 1% margin
  8. Validation: Compares detected language vs expected language and tracks discrepancies
  9. Consistency Check: Calculates if 80%+ of page content matches expected language
  10. Reporting: Outputs comprehensive validation results to the dataset

Supported Languages

The Actor supports 30+ languages including: English, Spanish, French, German, Italian, Portuguese, Russian, Japanese, Korean, Chinese, Arabic, Hindi, Dutch, Swedish, Polish, Turkish, Czech, Danish, Finnish, Norwegian, Ukrainian, Romanian, Hungarian, Greek, Hebrew, Thai, Vietnamese, Indonesian, Malay, Slovak, Bulgarian, Croatian, Serbian, Catalan, and more.

Crawl Control

maxCrawlPages

Limits the total number of pages processed. Strictly enforced - when this limit is reached, no new links are enqueued.

  • maxCrawlPages: 1 = Crawl only the start URL
  • maxCrawlPages: 10 = Crawl maximum 10 pages total
  • maxCrawlPages: 50 = Crawl maximum 50 pages total

maxCrawlDepth

Limits how deep links are followed from the start URL.

  • Depth 0: Only the start URL
  • Depth 1: Start URL + links found on it (1 level deep)
  • Depth 2: Start URL + links + links from those pages (2 levels deep)
  • Depth 3: Three levels of link following

Example:

  • maxCrawlPages: 1, maxCrawlDepth: 1 = Only crawl the start URL (no link following)
  • maxCrawlPages: 50, maxCrawlDepth: 2 = Crawl up to 50 pages, following links up to 2 levels deep