I18n Audit
Pricing
from $3.00 / 1,000 results
I18n Audit
Detects translation gaps and meaning/structural differences between multilingual pages. - Finds missing content and meaning drift in translated web pages - Compares multilingual pages to detect translation and structure gaps - Identifies incomplete or inconsistent page translations across languages
Pricing
from $3.00 / 1,000 results
Rating
5.0
(1)
Developer

Lisa Akinfiieva
Actor stats
0
Bookmarked
4
Total users
1
Monthly active users
2 days ago
Last modified
Share
i18n Language Audit Crawler
An Apify Actor that crawls websites to identify available languages and validates if the content matches the expected language. Perfect for auditing multilingual websites and ensuring proper internationalization.
Features
- 🌐 Language Detection: Automatically detects expected language from URL patterns and HTML attributes
- 📝 Smart Content Analysis: Extracts and analyzes text from headings, paragraphs, articles, and other content elements
- ✅ Advanced Validation: Uses ELD (Efficient Language Detection) library to verify content matches expected language
- 📊 Language Mismatch Detection: Identifies discrepancies between URL language indicators and HTML lang attributes
- 📈 Detailed Scoring: Returns top 3 language scores for each text element to understand confidence levels
- 🎯 Consistency Metrics: Calculates if page meets 80% language consistency threshold
- ⚡ Scalable: Built on Crawlee and Playwright for efficient crawling
- 🔄 Deduplication: Skips processing duplicate text to reduce redundant analysis
Input Parameters
Configure the crawler with the following input parameters:
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
startUrl | string | ✅ Yes | - | The URL where the crawler will start (e.g., https://example.com) |
maxCrawlPages | integer | No | 10 | Maximum number of pages to crawl (min: 1, max: 1000). When reached, no new links are enqueued. |
maxCrawlDepth | integer | No | 3 | Maximum depth of crawling from the start URL (min: 1, max: 10). Depth 0 = start URL only, Depth 1 = start URL + links found on it, etc. |
proxyConfiguration | object | No | {"useApifyProxy": true} | Proxy settings for the crawler |
Example Input
{"startUrl": "https://example.com","maxCrawlPages": 50,"maxCrawlDepth": 3,"proxyConfiguration": {"useApifyProxy": true}}
Output
The Actor outputs a dataset with the following structure for each audited page:
| Field | Type | Description |
|---|---|---|
url | string | URL of the audited page |
title | string | Title of the page |
available_languages | string | JSON array of all detected languages on the page |
expected_language | string | ISO 639-1 language code (e.g., 'en', 'es', 'fr') |
language_source | string | Source of expected language: 'url' (from URL path like /en/) or 'html' (from <html lang> attribute) |
url_language | string | null | Language code extracted from URL path (e.g., 'en' from /en/page) |
html_language | string | null | Language code from HTML lang attribute (e.g., 'en' from <html lang='en'>) |
html_url_mismatch | string | 'true' if URL language and HTML lang attribute indicate different languages |
discrepancies | string | JSON array of language mismatches with element location and top 3 confidence scores |
Example Output
{"url": "https://example.com/en/about","title": "About Us","available_languages": "[\"en\"]","expected_language": "en","language_source": "url","url_language": "en","html_language": "en","html_url_mismatch": "false","discrepancies": "[]"}
Example with Discrepancies
{"url": "https://example.com/fr/contact","title": "Contact","available_languages": "[\"en\",\"fr\"]","expected_language": "fr","language_source": "url","url_language": "fr","html_language": "en","html_url_mismatch": "true","discrepancies": "[{\"element\":\"p\",\"expectedLang\":\"fr\",\"detectedLang\":\"en\",\"score\":{\"en\":0.89,\"de\":0.72,\"nl\":0.68},\"text\":\"Contact us today\"}]"}
Use Cases
- i18n Audits: Verify multilingual websites have correct language content in each language version
- Quality Assurance: Detect language inconsistencies before going live with new content
- SEO Monitoring: Ensure language-specific pages have appropriate content for search engines
- Content Migration: Validate language content after CMS migrations or website redesigns
- Compliance: Ensure multilingual websites meet accessibility and localization standards
How It Works
- Initialization: Starts from the provided URL with depth level 0
- Language Extraction:
- Extracts expected language from URL patterns (e.g.,
/en/,/es/,en.example.com) - Extracts language from HTML
langattribute (e.g.,<html lang="en">) - Detects discrepancies between URL and HTML language indicators
- Extracts expected language from URL patterns (e.g.,
- Content Analysis: Scans text from various HTML elements (headings, paragraphs, lists, etc.)
- Deduplication: Skips processing identical text to reduce redundant analysis
- Language Detection: Uses ELD library to detect actual language of text content
- Score Optimization: Keeps only top 3 language scores for each element to reduce data size
- Smart Comparison: Biases toward target language when scores are within 1% margin
- Validation: Compares detected language vs expected language and tracks discrepancies
- Consistency Check: Calculates if 80%+ of page content matches expected language
- Reporting: Outputs comprehensive validation results to the dataset
Supported Languages
The Actor supports 30+ languages including: English, Spanish, French, German, Italian, Portuguese, Russian, Japanese, Korean, Chinese, Arabic, Hindi, Dutch, Swedish, Polish, Turkish, Czech, Danish, Finnish, Norwegian, Ukrainian, Romanian, Hungarian, Greek, Hebrew, Thai, Vietnamese, Indonesian, Malay, Slovak, Bulgarian, Croatian, Serbian, Catalan, and more.
Crawl Control
maxCrawlPages
Limits the total number of pages processed. Strictly enforced - when this limit is reached, no new links are enqueued.
maxCrawlPages: 1= Crawl only the start URLmaxCrawlPages: 10= Crawl maximum 10 pages totalmaxCrawlPages: 50= Crawl maximum 50 pages total
maxCrawlDepth
Limits how deep links are followed from the start URL.
- Depth 0: Only the start URL
- Depth 1: Start URL + links found on it (1 level deep)
- Depth 2: Start URL + links + links from those pages (2 levels deep)
- Depth 3: Three levels of link following
Example:
maxCrawlPages: 1, maxCrawlDepth: 1= Only crawl the start URL (no link following)maxCrawlPages: 50, maxCrawlDepth: 2= Crawl up to 50 pages, following links up to 2 levels deep

