Unicode Text Inspector
Pricing
Pay per event
Unicode Text Inspector
Scan text for hidden Unicode characters: zero-width spaces, RTL override attacks, homoglyphs, and control characters. Get risk level + full codepoint details per character.
Pricing
Pay per event
Rating
0.0
(0)
Developer
Stas Persiianenko
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
0
Monthly active users
18 days ago
Last modified
Categories
Share
π Detect hidden Unicode characters, homoglyphs, invisible markers, and security threats in any text. Analyze strings for zero-width spaces, RTL override attacks, Cyrillic/Greek look-alikes, control characters, and get a full Unicode category breakdown β all without any external dependencies.
What does Unicode Text Inspector do?
Unicode Text Inspector scans text strings for characters that are invisible, look deceptively similar to ASCII, or can manipulate text rendering. It covers:
- Zero-width characters (U+200B, U+200C, U+200D, U+FEFF) β invisible spaces used for fingerprinting, SEO manipulation, and bypassing keyword filters
- Bidirectional control characters (U+202AβU+202E, U+2066βU+2069) β the building blocks of the Trojan Source attack, where displayed text looks different from actual logical content
- ASCII control characters (U+0000βU+001F, U+007F, U+0080βU+009F) β null bytes, escape sequences, and C1 controls that signal data corruption or injection attempts
- Homoglyphs β Cyrillic
Π°(U+0430) vs Latina, GreekΞ(U+0397) vs ASCIIH, fullwidth Latin characters (U+FF21βU+FF5A), and typographic quotes masquerading as ASCII - Unicode category breakdown β count of letters, numbers, symbols, marks, separators, format characters, and control characters per text
Each text string produces one output record listing every flagged character with its position, codepoint (e.g. U+200B), Unicode name, category, and a plain-English description of the risk.
Who is Unicode Text Inspector for?
π Security engineers and threat analysts
- Detect homoglyph phishing domains in email headers (e.g.,
pΠ°ypal.comwith a CyrillicΠ°) - Catch Trojan Source bidi attacks in code review pipelines
- Identify null-byte injection attempts in web form inputs
ποΈ Data quality and ETL engineers
- Scrub invisible characters from user-generated content before indexing in Elasticsearch or Solr
- Validate imported datasets for hidden formatting characters that break string matching
- Clean CRM records that silently contain zero-width spaces from copy-paste operations
π‘οΈ Content moderation teams
- Detect attempts to bypass keyword filters using look-alike characters
- Identify text fingerprinting (watermarking with zero-width patterns)
- Find suspicious Unicode in usernames, product titles, and forum posts
π SEO and marketing professionals
- Check scraped competitor content for invisible characters that could cause duplicate-content issues
- Validate structured data fields before submission to Google Search Console
- Ensure brand names and product titles are free of invisible markers
Why use Unicode Text Inspector?
- β No external dependencies β pure Unicode detection using built-in string operations. No rate limits, no API keys, no external services.
- β Covers all major threat vectors β zero-width, bidi, homoglyphs, and control characters in one pass
- β Security-grade detection β includes Trojan Source bidi patterns (CVE-2021-42574), not just simple invisible character checks
- β Rich output β character position, codepoint, Unicode name, category, issue type, and description for every flagged character
- β Configurable detection β enable/disable each detection type independently; turn off homoglyphs for performance-critical pipelines
- β
Risk level scoring β each text gets a
none/low/medium/high/criticalrisk level for easy filtering - β Batch processing β analyze hundreds of strings in a single run; output is one dataset record per input text
- β Schedule and monitor β run on a schedule to continuously audit new content in your database
What data can you extract?
| Field | Description |
|---|---|
textIndex | Position in the input array (1-based) |
label | Optional source tag you provide |
textPreview | First 100 chars with invisible chars stripped |
totalCharacters | Full Unicode codepoint count |
issueCount | Total number of flagged characters |
hasSuspiciousContent | Boolean quick-filter |
riskLevel | none / low / medium / high / critical |
issues[].position | 0-based index of the flagged character |
issues[].codepoint | Unicode codepoint string (e.g. U+200B) |
issues[].codepointDecimal | Integer codepoint value |
issues[].character | The actual character (may be invisible) |
issues[].name | Full Unicode character name |
issues[].category | Unicode general category abbreviation |
issues[].categoryName | Human-readable category |
issues[].issueType | zero-width / bidi-control / control-character / homoglyph / format-character |
issues[].description | Plain-English risk explanation |
categoryBreakdown.letters | Total letter count (all scripts) |
categoryBreakdown.uppercaseLetters | Uppercase letter count |
categoryBreakdown.lowercaseLetters | Lowercase letter count |
categoryBreakdown.numbers | Decimal digit count |
categoryBreakdown.punctuation | Punctuation character count |
categoryBreakdown.symbols | Symbol character count |
categoryBreakdown.separators | Space separator count |
categoryBreakdown.marks | Combining mark count |
categoryBreakdown.controlChars | Control character count |
categoryBreakdown.formatChars | Format/invisible character count |
How much does it cost to analyze Unicode text?
Unicode Text Inspector uses pay-per-event pricing β you only pay for what you use:
| Event | FREE / BRONZE | SILVER | GOLD | PLATINUM | DIAMOND |
|---|---|---|---|---|---|
| Run started (one-time) | $0.001 | $0.001 | $0.001 | $0.001 | $0.001 |
| Per text analyzed | $0.00069 | $0.000552 | $0.000449 | $0.000345 | $0.000276 |
Real-world cost examples:
- 100 texts analyzed: ~$0.070 (FREE tier)
- 1,000 texts: ~$0.691
- 10,000 texts: ~$6.91
- 100 texts (DIAMOND tier): ~$0.029
Free plan estimate: Apify's free $5 credit gives you approximately 7,200 texts at FREE tier pricing β more than enough for most one-off audits.
Tip: DIAMOND tier users get 60% discount on per-text charges. Upgrade your Apify plan to reduce costs at scale.
How to inspect text for Unicode issues
- Go to Unicode Text Inspector on Apify Store
- Click Try for free
- Paste your text strings into the Texts to inspect field (one per line, or as JSON array)
- Configure detection options β all four detectors are enabled by default
- Click Start and wait for the run to complete (typically 2β10 seconds)
- Download results as JSON, CSV, or Excel from the Dataset tab
Input JSON example β basic inspection
{"texts": ["Hello\u200b World","paypal.com (p\u0430ypal with Cyrillic a)","Normal safe text"],"detectHomoglyphs": true,"detectInvisible": true}
Input JSON example β security audit of email subjects
{"texts": ["Your account has been suspended","Urgent: verify your p\u0430ssword","Click here to reset access"],"label": "email_subjects_2024_01","detectHomoglyphs": true,"detectBidi": true,"detectControl": true,"detectInvisible": true,"includeCategoryBreakdown": false}
Input JSON example β data quality audit (performance mode)
{"texts": ["item 1", "item 2", "item 3"],"detectHomoglyphs": false,"detectInvisible": true,"detectControl": true,"detectBidi": true,"includeCategoryBreakdown": false}
Input parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
texts | array | required | Array of text strings to analyze. Each becomes one output record. |
detectHomoglyphs | boolean | true | Flag characters that look like ASCII but are different Unicode codepoints (Cyrillic, Greek, fullwidth Latin) |
detectInvisible | boolean | true | Flag zero-width spaces, zero-width joiners, BOM, and other invisible/format characters |
detectControl | boolean | true | Flag ASCII and C1 control characters (null bytes, escape, etc.) |
detectBidi | boolean | true | Flag bidirectional control characters (Trojan Source attack vectors) |
includeCategoryBreakdown | boolean | true | Include Unicode category counts per text (letters, numbers, symbols, etc.) |
label | string | null | Optional tag to attach to all output records (e.g., "email_subjects", "user_input") |
Output examples
Text with zero-width space:
{"textIndex": 1,"label": "demo","textPreview": "Hello World","totalCharacters": 12,"issueCount": 1,"hasSuspiciousContent": true,"riskLevel": "low","issues": [{"position": 5,"codepoint": "U+200B","codepointDecimal": 8203,"character": "β","name": "ZERO WIDTH SPACE","category": "Cf","categoryName": "Format","issueType": "zero-width","description": "Invisible zero-width character that can hide text, break search, or be used for text fingerprinting."}]}
Text with bidi override (critical / Trojan Source):
{"textIndex": 2,"riskLevel": "critical","issueCount": 2,"issues": [{"position": 6,"codepoint": "U+202E","name": "RIGHT-TO-LEFT OVERRIDE","issueType": "bidi-control","description": "Bidirectional control character that can reorder displayed text (Trojan Source attack vector)."}]}
Clean text:
{"textIndex": 3,"textPreview": "Normal clean text","issueCount": 0,"hasSuspiciousContent": false,"riskLevel": "none","issues": []}
Tips for best results
- π Start small β test with 5β10 strings first to verify the detection settings match your needs before running large batches
- π·οΈ Use the
labelfield β tag batches with a source identifier (e.g.,"product_titles_jan") to track which dataset was audited - β‘ Disable
includeCategoryBreakdownfor large batches where you only care about security issues β it reduces output size - π Disable
detectHomoglyphsif your content legitimately contains Cyrillic or Greek text (e.g., multilingual apps) - π― Filter by
riskLevelin downstream processing:criticalandhighneed human review;lowmay be benign copy-paste artifacts - π Schedule regular audits β run on a daily or weekly schedule against new user-generated content, imported data, or crawled text
- π Combine with webhook β trigger automated alerts when runs find
criticalorhighrisk texts
Integrations
Unicode Text Inspector β Google Sheets (content moderation audit)
Use the Apify β Google Sheets integration to automatically append flagged texts to a review spreadsheet. Filter rows where riskLevel = "critical" for priority review.
Unicode Text Inspector β Slack (security alerts)
Connect via Make or Zapier: when a run dataset contains any record with riskLevel = "critical", post an alert to your #security Slack channel with the text preview and codepoints found.
Unicode Text Inspector β Elasticsearch (data quality pipeline)
Use the JSON output as a pre-indexing filter. Strip or reject texts where hasSuspiciousContent = true before feeding them to your search index to prevent invisible character search poisoning.
Scheduled runs (continuous monitoring) Schedule this actor to run nightly against your CRM contact names, product catalog titles, or user profile fields. Export results to your data warehouse to track zero-width character prevalence over time.
Unicode Text Inspector β Make/Zapier (form validation)
Trigger a run on new form submissions via webhook. If any field returns riskLevel != "none", flag the submission for review or reject it automatically.
Using the Apify API
Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });const run = await client.actor('automation-lab/unicode-text-inspector').call({texts: ['Hello\u200b World','p\u0430ypal.com (phishing domain candidate)',],detectHomoglyphs: true,detectBidi: true,});const { items } = await client.dataset(run.defaultDatasetId).listItems();for (const item of items) {console.log(`Text ${item.textIndex}: risk=${item.riskLevel}, issues=${item.issueCount}`);}
Python
from apify_client import ApifyClientclient = ApifyClient("YOUR_APIFY_TOKEN")run = client.actor("automation-lab/unicode-text-inspector").call(run_input={"texts": ["Hello\u200b World","p\u0430ypal.com (Cyrillic a)",],"detectHomoglyphs": True,"detectBidi": True,})for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(f"Text {item['textIndex']}: risk={item['riskLevel']}, issues={item['issueCount']}")
cURL
curl -X POST "https://api.apify.com/v2/acts/automation-lab~unicode-text-inspector/runs?token=YOUR_APIFY_TOKEN" \-H "Content-Type: application/json" \-d '{"texts": ["Hello\u200b World", "Normal text"],"detectHomoglyphs": true,"detectBidi": true}'
Use with AI agents via MCP
Unicode Text Inspector is available as a tool for AI assistants that support the Model Context Protocol (MCP).
Add the Apify MCP server to your AI client β this gives you access to all Apify actors, including this one:
Setup for Claude Code
$claude mcp add --transport http apify "https://mcp.apify.com?tools=automation-lab/unicode-text-inspector"
Setup for Claude Desktop, Cursor, or VS Code
Add this to your MCP config file:
{"mcpServers": {"apify": {"type": "http","url": "https://mcp.apify.com?tools=automation-lab/unicode-text-inspector","headers": { "Authorization": "Bearer YOUR_APIFY_TOKEN" }}}}
Example prompts for AI agents
- "Check this text for hidden Unicode characters: 'Helloβ World'"
- "Scan these domain names for homoglyph spoofing: paypal.com, pΠ°ypal.com, amazon.com"
- "Analyze this CSV column of user-submitted names and flag any with bidirectional Unicode or zero-width spaces"
Is it legal to analyze text with this tool?
Unicode Text Inspector is a text analysis utility β it performs no web scraping, makes no external requests, and does not interact with any website. You provide text strings; the actor processes them locally on Apify infrastructure.
There are no legal concerns with Unicode character detection on text you own or have rights to process. Always ensure your data handling complies with applicable privacy regulations (GDPR, CCPA) when processing user-generated content.
FAQ
How fast is Unicode Text Inspector? Very fast. Pure in-memory string processing with no I/O or network calls. A batch of 1,000 strings typically completes in under 5 seconds. The per-run timeout of 300 seconds can handle hundreds of thousands of texts.
How much does it cost to analyze 10,000 texts? At FREE/BRONZE tier pricing: $0.001 (start) + 10,000 Γ $0.00069 = approximately $6.90. At DIAMOND tier: approximately $2.76.
Does it detect all Unicode homoglyphs? The current detector covers the most common script-based confusables: Cyrillic, Greek, and fullwidth Latin characters. This covers the vast majority of real-world phishing and spoofing cases. Rare confusables from other scripts (Armenian, Georgian, etc.) are not yet included. The detector is pattern-based, not a comprehensive Unicode confusables database β it prioritizes precision over recall.
Why are some results marked risk=low even though there's a zero-width space?
Zero-width spaces are sometimes inserted legitimately by word processors, CMS platforms, and copy-paste operations (especially from web pages). low risk means an issue was found but it may not be malicious. Only bidirectional override characters are marked critical because they have no legitimate use in most text contexts.
Why is a text showing 0 issues when I can see something strange in it? Most likely the character is not in the current detection tables. Try enabling all detection options. If the character is in a script that isn't covered (e.g., Armenian lookalikes), it won't be detected. You can check the character manually by pasting it into a Unicode inspector like unicode.org.
Can I analyze very long texts? Yes. The actor processes texts of any length. Very long texts (millions of characters) may take a few seconds each. The 300-second timeout is sufficient for typical use cases. If you need to analyze extremely long documents, split them into chunks before passing to the actor.
Other text and data quality tools
Looking for related utilities? Check these automation-lab actors:
- π¨ Color Contrast Checker β WCAG 2.1 AA/AAA contrast ratio validation for UI design
- π¬ Accessibility Checker β WCAG accessibility audit for web pages
- π JSON Schema Generator β Generate JSON Schema from example JSON data
- π Ads.txt Checker β Validate ads.txt files for publisher compliance
- π Base64 Converter β Encode and decode Base64 strings