Unicode Text Inspector avatar

Unicode Text Inspector

Pricing

Pay per event

Go to Apify Store
Unicode Text Inspector

Unicode Text Inspector

Scan text for hidden Unicode characters: zero-width spaces, RTL override attacks, homoglyphs, and control characters. Get risk level + full codepoint details per character.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Stas Persiianenko

Stas Persiianenko

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

0

Monthly active users

18 days ago

Last modified

Categories

Share

πŸ” Detect hidden Unicode characters, homoglyphs, invisible markers, and security threats in any text. Analyze strings for zero-width spaces, RTL override attacks, Cyrillic/Greek look-alikes, control characters, and get a full Unicode category breakdown β€” all without any external dependencies.

What does Unicode Text Inspector do?

Unicode Text Inspector scans text strings for characters that are invisible, look deceptively similar to ASCII, or can manipulate text rendering. It covers:

  • Zero-width characters (U+200B, U+200C, U+200D, U+FEFF) β€” invisible spaces used for fingerprinting, SEO manipulation, and bypassing keyword filters
  • Bidirectional control characters (U+202A–U+202E, U+2066–U+2069) β€” the building blocks of the Trojan Source attack, where displayed text looks different from actual logical content
  • ASCII control characters (U+0000–U+001F, U+007F, U+0080–U+009F) β€” null bytes, escape sequences, and C1 controls that signal data corruption or injection attempts
  • Homoglyphs β€” Cyrillic Π° (U+0430) vs Latin a, Greek Ξ— (U+0397) vs ASCII H, fullwidth Latin characters (U+FF21–U+FF5A), and typographic quotes masquerading as ASCII
  • Unicode category breakdown β€” count of letters, numbers, symbols, marks, separators, format characters, and control characters per text

Each text string produces one output record listing every flagged character with its position, codepoint (e.g. U+200B), Unicode name, category, and a plain-English description of the risk.

Who is Unicode Text Inspector for?

πŸ” Security engineers and threat analysts

  • Detect homoglyph phishing domains in email headers (e.g., pΠ°ypal.com with a Cyrillic Π°)
  • Catch Trojan Source bidi attacks in code review pipelines
  • Identify null-byte injection attempts in web form inputs

πŸ—„οΈ Data quality and ETL engineers

  • Scrub invisible characters from user-generated content before indexing in Elasticsearch or Solr
  • Validate imported datasets for hidden formatting characters that break string matching
  • Clean CRM records that silently contain zero-width spaces from copy-paste operations

πŸ›‘οΈ Content moderation teams

  • Detect attempts to bypass keyword filters using look-alike characters
  • Identify text fingerprinting (watermarking with zero-width patterns)
  • Find suspicious Unicode in usernames, product titles, and forum posts

πŸ”Ž SEO and marketing professionals

  • Check scraped competitor content for invisible characters that could cause duplicate-content issues
  • Validate structured data fields before submission to Google Search Console
  • Ensure brand names and product titles are free of invisible markers

Why use Unicode Text Inspector?

  • βœ… No external dependencies β€” pure Unicode detection using built-in string operations. No rate limits, no API keys, no external services.
  • βœ… Covers all major threat vectors β€” zero-width, bidi, homoglyphs, and control characters in one pass
  • βœ… Security-grade detection β€” includes Trojan Source bidi patterns (CVE-2021-42574), not just simple invisible character checks
  • βœ… Rich output β€” character position, codepoint, Unicode name, category, issue type, and description for every flagged character
  • βœ… Configurable detection β€” enable/disable each detection type independently; turn off homoglyphs for performance-critical pipelines
  • βœ… Risk level scoring β€” each text gets a none/low/medium/high/critical risk level for easy filtering
  • βœ… Batch processing β€” analyze hundreds of strings in a single run; output is one dataset record per input text
  • βœ… Schedule and monitor β€” run on a schedule to continuously audit new content in your database

What data can you extract?

FieldDescription
textIndexPosition in the input array (1-based)
labelOptional source tag you provide
textPreviewFirst 100 chars with invisible chars stripped
totalCharactersFull Unicode codepoint count
issueCountTotal number of flagged characters
hasSuspiciousContentBoolean quick-filter
riskLevelnone / low / medium / high / critical
issues[].position0-based index of the flagged character
issues[].codepointUnicode codepoint string (e.g. U+200B)
issues[].codepointDecimalInteger codepoint value
issues[].characterThe actual character (may be invisible)
issues[].nameFull Unicode character name
issues[].categoryUnicode general category abbreviation
issues[].categoryNameHuman-readable category
issues[].issueTypezero-width / bidi-control / control-character / homoglyph / format-character
issues[].descriptionPlain-English risk explanation
categoryBreakdown.lettersTotal letter count (all scripts)
categoryBreakdown.uppercaseLettersUppercase letter count
categoryBreakdown.lowercaseLettersLowercase letter count
categoryBreakdown.numbersDecimal digit count
categoryBreakdown.punctuationPunctuation character count
categoryBreakdown.symbolsSymbol character count
categoryBreakdown.separatorsSpace separator count
categoryBreakdown.marksCombining mark count
categoryBreakdown.controlCharsControl character count
categoryBreakdown.formatCharsFormat/invisible character count

How much does it cost to analyze Unicode text?

Unicode Text Inspector uses pay-per-event pricing β€” you only pay for what you use:

EventFREE / BRONZESILVERGOLDPLATINUMDIAMOND
Run started (one-time)$0.001$0.001$0.001$0.001$0.001
Per text analyzed$0.00069$0.000552$0.000449$0.000345$0.000276

Real-world cost examples:

  • 100 texts analyzed: ~$0.070 (FREE tier)
  • 1,000 texts: ~$0.691
  • 10,000 texts: ~$6.91
  • 100 texts (DIAMOND tier): ~$0.029

Free plan estimate: Apify's free $5 credit gives you approximately 7,200 texts at FREE tier pricing β€” more than enough for most one-off audits.

Tip: DIAMOND tier users get 60% discount on per-text charges. Upgrade your Apify plan to reduce costs at scale.

How to inspect text for Unicode issues

  1. Go to Unicode Text Inspector on Apify Store
  2. Click Try for free
  3. Paste your text strings into the Texts to inspect field (one per line, or as JSON array)
  4. Configure detection options β€” all four detectors are enabled by default
  5. Click Start and wait for the run to complete (typically 2–10 seconds)
  6. Download results as JSON, CSV, or Excel from the Dataset tab

Input JSON example β€” basic inspection

{
"texts": [
"Hello\u200b World",
"paypal.com (p\u0430ypal with Cyrillic a)",
"Normal safe text"
],
"detectHomoglyphs": true,
"detectInvisible": true
}

Input JSON example β€” security audit of email subjects

{
"texts": [
"Your account has been suspended",
"Urgent: verify your p\u0430ssword",
"Click here to reset access"
],
"label": "email_subjects_2024_01",
"detectHomoglyphs": true,
"detectBidi": true,
"detectControl": true,
"detectInvisible": true,
"includeCategoryBreakdown": false
}

Input JSON example β€” data quality audit (performance mode)

{
"texts": ["item 1", "item 2", "item 3"],
"detectHomoglyphs": false,
"detectInvisible": true,
"detectControl": true,
"detectBidi": true,
"includeCategoryBreakdown": false
}

Input parameters

ParameterTypeDefaultDescription
textsarrayrequiredArray of text strings to analyze. Each becomes one output record.
detectHomoglyphsbooleantrueFlag characters that look like ASCII but are different Unicode codepoints (Cyrillic, Greek, fullwidth Latin)
detectInvisiblebooleantrueFlag zero-width spaces, zero-width joiners, BOM, and other invisible/format characters
detectControlbooleantrueFlag ASCII and C1 control characters (null bytes, escape, etc.)
detectBidibooleantrueFlag bidirectional control characters (Trojan Source attack vectors)
includeCategoryBreakdownbooleantrueInclude Unicode category counts per text (letters, numbers, symbols, etc.)
labelstringnullOptional tag to attach to all output records (e.g., "email_subjects", "user_input")

Output examples

Text with zero-width space:

{
"textIndex": 1,
"label": "demo",
"textPreview": "Hello World",
"totalCharacters": 12,
"issueCount": 1,
"hasSuspiciousContent": true,
"riskLevel": "low",
"issues": [
{
"position": 5,
"codepoint": "U+200B",
"codepointDecimal": 8203,
"character": "​",
"name": "ZERO WIDTH SPACE",
"category": "Cf",
"categoryName": "Format",
"issueType": "zero-width",
"description": "Invisible zero-width character that can hide text, break search, or be used for text fingerprinting."
}
]
}

Text with bidi override (critical / Trojan Source):

{
"textIndex": 2,
"riskLevel": "critical",
"issueCount": 2,
"issues": [
{
"position": 6,
"codepoint": "U+202E",
"name": "RIGHT-TO-LEFT OVERRIDE",
"issueType": "bidi-control",
"description": "Bidirectional control character that can reorder displayed text (Trojan Source attack vector)."
}
]
}

Clean text:

{
"textIndex": 3,
"textPreview": "Normal clean text",
"issueCount": 0,
"hasSuspiciousContent": false,
"riskLevel": "none",
"issues": []
}

Tips for best results

  • πŸš€ Start small β€” test with 5–10 strings first to verify the detection settings match your needs before running large batches
  • 🏷️ Use the label field β€” tag batches with a source identifier (e.g., "product_titles_jan") to track which dataset was audited
  • ⚑ Disable includeCategoryBreakdown for large batches where you only care about security issues β€” it reduces output size
  • πŸ”‡ Disable detectHomoglyphs if your content legitimately contains Cyrillic or Greek text (e.g., multilingual apps)
  • 🎯 Filter by riskLevel in downstream processing: critical and high need human review; low may be benign copy-paste artifacts
  • πŸ“… Schedule regular audits β€” run on a daily or weekly schedule against new user-generated content, imported data, or crawled text
  • πŸ”— Combine with webhook β€” trigger automated alerts when runs find critical or high risk texts

Integrations

Unicode Text Inspector β†’ Google Sheets (content moderation audit) Use the Apify β†’ Google Sheets integration to automatically append flagged texts to a review spreadsheet. Filter rows where riskLevel = "critical" for priority review.

Unicode Text Inspector β†’ Slack (security alerts) Connect via Make or Zapier: when a run dataset contains any record with riskLevel = "critical", post an alert to your #security Slack channel with the text preview and codepoints found.

Unicode Text Inspector β†’ Elasticsearch (data quality pipeline) Use the JSON output as a pre-indexing filter. Strip or reject texts where hasSuspiciousContent = true before feeding them to your search index to prevent invisible character search poisoning.

Scheduled runs (continuous monitoring) Schedule this actor to run nightly against your CRM contact names, product catalog titles, or user profile fields. Export results to your data warehouse to track zero-width character prevalence over time.

Unicode Text Inspector β†’ Make/Zapier (form validation) Trigger a run on new form submissions via webhook. If any field returns riskLevel != "none", flag the submission for review or reject it automatically.

Using the Apify API

Node.js

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });
const run = await client.actor('automation-lab/unicode-text-inspector').call({
texts: [
'Hello\u200b World',
'p\u0430ypal.com (phishing domain candidate)',
],
detectHomoglyphs: true,
detectBidi: true,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
for (const item of items) {
console.log(`Text ${item.textIndex}: risk=${item.riskLevel}, issues=${item.issueCount}`);
}

Python

from apify_client import ApifyClient
client = ApifyClient("YOUR_APIFY_TOKEN")
run = client.actor("automation-lab/unicode-text-inspector").call(run_input={
"texts": [
"Hello\u200b World",
"p\u0430ypal.com (Cyrillic a)",
],
"detectHomoglyphs": True,
"detectBidi": True,
})
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(f"Text {item['textIndex']}: risk={item['riskLevel']}, issues={item['issueCount']}")

cURL

curl -X POST "https://api.apify.com/v2/acts/automation-lab~unicode-text-inspector/runs?token=YOUR_APIFY_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"texts": ["Hello\u200b World", "Normal text"],
"detectHomoglyphs": true,
"detectBidi": true
}'

Use with AI agents via MCP

Unicode Text Inspector is available as a tool for AI assistants that support the Model Context Protocol (MCP).

Add the Apify MCP server to your AI client β€” this gives you access to all Apify actors, including this one:

Setup for Claude Code

$claude mcp add --transport http apify "https://mcp.apify.com?tools=automation-lab/unicode-text-inspector"

Setup for Claude Desktop, Cursor, or VS Code

Add this to your MCP config file:

{
"mcpServers": {
"apify": {
"type": "http",
"url": "https://mcp.apify.com?tools=automation-lab/unicode-text-inspector",
"headers": { "Authorization": "Bearer YOUR_APIFY_TOKEN" }
}
}
}

Example prompts for AI agents

  • "Check this text for hidden Unicode characters: 'Hello​ World'"
  • "Scan these domain names for homoglyph spoofing: paypal.com, pΠ°ypal.com, amazon.com"
  • "Analyze this CSV column of user-submitted names and flag any with bidirectional Unicode or zero-width spaces"

Unicode Text Inspector is a text analysis utility β€” it performs no web scraping, makes no external requests, and does not interact with any website. You provide text strings; the actor processes them locally on Apify infrastructure.

There are no legal concerns with Unicode character detection on text you own or have rights to process. Always ensure your data handling complies with applicable privacy regulations (GDPR, CCPA) when processing user-generated content.

FAQ

How fast is Unicode Text Inspector? Very fast. Pure in-memory string processing with no I/O or network calls. A batch of 1,000 strings typically completes in under 5 seconds. The per-run timeout of 300 seconds can handle hundreds of thousands of texts.

How much does it cost to analyze 10,000 texts? At FREE/BRONZE tier pricing: $0.001 (start) + 10,000 Γ— $0.00069 = approximately $6.90. At DIAMOND tier: approximately $2.76.

Does it detect all Unicode homoglyphs? The current detector covers the most common script-based confusables: Cyrillic, Greek, and fullwidth Latin characters. This covers the vast majority of real-world phishing and spoofing cases. Rare confusables from other scripts (Armenian, Georgian, etc.) are not yet included. The detector is pattern-based, not a comprehensive Unicode confusables database β€” it prioritizes precision over recall.

Why are some results marked risk=low even though there's a zero-width space? Zero-width spaces are sometimes inserted legitimately by word processors, CMS platforms, and copy-paste operations (especially from web pages). low risk means an issue was found but it may not be malicious. Only bidirectional override characters are marked critical because they have no legitimate use in most text contexts.

Why is a text showing 0 issues when I can see something strange in it? Most likely the character is not in the current detection tables. Try enabling all detection options. If the character is in a script that isn't covered (e.g., Armenian lookalikes), it won't be detected. You can check the character manually by pasting it into a Unicode inspector like unicode.org.

Can I analyze very long texts? Yes. The actor processes texts of any length. Very long texts (millions of characters) may take a few seconds each. The 300-second timeout is sufficient for typical use cases. If you need to analyze extremely long documents, split them into chunks before passing to the actor.

Other text and data quality tools

Looking for related utilities? Check these automation-lab actors: