Pricing

Pay per event

Unicode Text Inspector

Scan text for hidden Unicode characters: zero-width spaces, RTL override attacks, homoglyphs, and control characters. Get risk level + full codepoint details per character.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Stas Persiianenko

Actor stats

Bookmarked

Total users

Monthly active users

9 days ago

Last modified

What does Unicode Text Inspector do?

Unicode Text Inspector scans text strings for characters that are invisible, look deceptively similar to ASCII, or can manipulate text rendering. It covers:

Zero-width characters (U+200B, U+200C, U+200D, U+FEFF) — invisible spaces used for fingerprinting, SEO manipulation, and bypassing keyword filters
Bidirectional control characters (U+202A–U+202E, U+2066–U+2069) — the building blocks of the Trojan Source attack, where displayed text looks different from actual logical content
ASCII control characters (U+0000–U+001F, U+007F, U+0080–U+009F) — null bytes, escape sequences, and C1 controls that signal data corruption or injection attempts
Homoglyphs — Cyrillic а (U+0430) vs Latin a, Greek Η (U+0397) vs ASCII H, fullwidth Latin characters (U+FF21–U+FF5A), and typographic quotes masquerading as ASCII
Unicode category breakdown — count of letters, numbers, symbols, marks, separators, format characters, and control characters per text

Each text string produces one output record listing every flagged character with its position, codepoint (e.g. U+200B), Unicode name, category, and a plain-English description of the risk.

Who is it for? Unicode Text Inspector audiences

🔐 Security engineers and threat analysts

Detect homoglyph phishing domains in email headers (e.g., pаypal.com with a Cyrillic а)
Catch Trojan Source bidi attacks in code review pipelines
Identify null-byte injection attempts in web form inputs

🗄️ Data quality and ETL engineers

Scrub invisible characters from user-generated content before indexing in Elasticsearch or Solr
Validate imported datasets for hidden formatting characters that break string matching
Clean CRM records that silently contain zero-width spaces from copy-paste operations

🛡️ Content moderation teams

Detect attempts to bypass keyword filters using look-alike characters
Identify text fingerprinting (watermarking with zero-width patterns)
Find suspicious Unicode in usernames, product titles, and forum posts

🔎 SEO and marketing professionals

Check scraped competitor content for invisible characters that could cause duplicate-content issues
Validate structured data fields before submission to Google Search Console
Ensure brand names and product titles are free of invisible markers

Why use Unicode Text Inspector?

✅ No external dependencies — pure Unicode detection using built-in string operations. No rate limits, no API keys, no external services.
✅ Covers all major threat vectors — zero-width, bidi, homoglyphs, and control characters in one pass
✅ Security-grade detection — includes Trojan Source bidi patterns (CVE-2021-42574), not just simple invisible character checks
✅ Rich output — character position, codepoint, Unicode name, category, issue type, and description for every flagged character
✅ Configurable detection — enable/disable each detection type independently; turn off homoglyphs for performance-critical pipelines
✅ Risk level scoring — each text gets a none/low/medium/high/critical risk level for easy filtering
✅ Batch processing — analyze hundreds of strings in a single run; output is one dataset record per input text
✅ Schedule and monitor — run on a schedule to continuously audit new content in your database

What data can you extract?

Field	Description
`textIndex`	Position in the input array (1-based)
`label`	Optional source tag you provide
`textPreview`	First 100 chars with invisible chars stripped
`totalCharacters`	Full Unicode codepoint count
`issueCount`	Total number of flagged characters
`hasSuspiciousContent`	Boolean quick-filter
`riskLevel`	`none` / `low` / `medium` / `high` / `critical`
`issues[].position`	0-based index of the flagged character
`issues[].codepoint`	Unicode codepoint string (e.g. `U+200B`)
`issues[].codepointDecimal`	Integer codepoint value
`issues[].character`	The actual character (may be invisible)
`issues[].name`	Full Unicode character name
`issues[].category`	Unicode general category abbreviation
`issues[].categoryName`	Human-readable category
`issues[].issueType`	`zero-width` / `bidi-control` / `control-character` / `homoglyph` / `format-character`
`issues[].description`	Plain-English risk explanation
`categoryBreakdown.letters`	Total letter count (all scripts)
`categoryBreakdown.uppercaseLetters`	Uppercase letter count
`categoryBreakdown.lowercaseLetters`	Lowercase letter count
`categoryBreakdown.numbers`	Decimal digit count
`categoryBreakdown.punctuation`	Punctuation character count
`categoryBreakdown.symbols`	Symbol character count
`categoryBreakdown.separators`	Space separator count
`categoryBreakdown.marks`	Combining mark count
`categoryBreakdown.controlChars`	Control character count
`categoryBreakdown.formatChars`	Format/invisible character count

How much does it cost to analyze Unicode text?

Unicode Text Inspector uses pay-per-event pricing — you only pay for what you use:

Event	FREE / BRONZE	SILVER	GOLD	PLATINUM	DIAMOND
Run started (one-time)	$0.001	$0.001	$0.001	$0.001	$0.001
Per text analyzed	$0.00069	$0.000552	$0.000449	$0.000345	$0.000276

Real-world cost examples:

100 texts analyzed: ~$0.070 (FREE tier)
1,000 texts: ~$0.691
10,000 texts: ~$6.91
100 texts (DIAMOND tier): ~$0.029

Free plan estimate: Apify's free $5 credit gives you approximately 7,200 texts at FREE tier pricing — more than enough for most one-off audits.

Tip: DIAMOND tier users get 60% discount on per-text charges. Upgrade your Apify plan to reduce costs at scale.

How to inspect text for Unicode issues

Go to Unicode Text Inspector on Apify Store
Click Try for free
Paste your text strings into the Texts to inspect field (one per line, or as JSON array)
Configure detection options — all four detectors are enabled by default
Click Start and wait for the run to complete (typically 2–10 seconds)
Download results as JSON, CSV, or Excel from the Dataset tab

Input JSON example — basic inspection

{
    "texts": [
        "Hello\u200b World",
        "paypal.com (p\u0430ypal with Cyrillic a)",
        "Normal safe text"
    ],
    "detectHomoglyphs": true,
    "detectInvisible": true
}

Input JSON example — security audit of email subjects

{
    "texts": [
        "Your account has been suspended",
        "Urgent: verify your p\u0430ssword",
        "Click here to reset access"
    ],
    "label": "email_subjects_2024_01",
    "detectHomoglyphs": true,
    "detectBidi": true,
    "detectControl": true,
    "detectInvisible": true,
    "includeCategoryBreakdown": false
}

Input JSON example — data quality audit (performance mode)

{
    "texts": ["item 1", "item 2", "item 3"],
    "detectHomoglyphs": false,
    "detectInvisible": true,
    "detectControl": true,
    "detectBidi": true,
    "includeCategoryBreakdown": false
}

Input parameters

Parameter	Type	Default	Description
`texts`	array	required	Array of text strings to analyze. Each becomes one output record.
`detectHomoglyphs`	boolean	`true`	Flag characters that look like ASCII but are different Unicode codepoints (Cyrillic, Greek, fullwidth Latin)
`detectInvisible`	boolean	`true`	Flag zero-width spaces, zero-width joiners, BOM, and other invisible/format characters
`detectControl`	boolean	`true`	Flag ASCII and C1 control characters (null bytes, escape, etc.)
`detectBidi`	boolean	`true`	Flag bidirectional control characters (Trojan Source attack vectors)
`includeCategoryBreakdown`	boolean	`true`	Include Unicode category counts per text (letters, numbers, symbols, etc.)
`label`	string	`null`	Optional tag to attach to all output records (e.g., `"email_subjects"`, `"user_input"`)

Output examples

Text with zero-width space:

{
    "textIndex": 1,
    "label": "demo",
    "textPreview": "Hello World",
    "totalCharacters": 12,
    "issueCount": 1,
    "hasSuspiciousContent": true,
    "riskLevel": "low",
    "issues": [
        {
            "position": 5,
            "codepoint": "U+200B",
            "codepointDecimal": 8203,
            "character": "",
            "name": "ZERO WIDTH SPACE",
            "category": "Cf",
            "categoryName": "Format",
            "issueType": "zero-width",
            "description": "Invisible zero-width character that can hide text, break search, or be used for text fingerprinting."
        }
    ]
}

Text with bidi override (critical / Trojan Source):

{
    "textIndex": 2,
    "riskLevel": "critical",
    "issueCount": 2,
    "issues": [
        {
            "position": 6,
            "codepoint": "U+202E",
            "name": "RIGHT-TO-LEFT OVERRIDE",
            "issueType": "bidi-control",
            "description": "Bidirectional control character that can reorder displayed text (Trojan Source attack vector)."
        }
    ]
}

Clean text:

{
    "textIndex": 3,
    "textPreview": "Normal clean text",
    "issueCount": 0,
    "hasSuspiciousContent": false,
    "riskLevel": "none",
    "issues": []
}

Tips for best results

🚀 Start small — test with 5–10 strings first to verify the detection settings match your needs before running large batches
🏷️ Use the label field — tag batches with a source identifier (e.g., "product_titles_jan") to track which dataset was audited
⚡ Disable includeCategoryBreakdown for large batches where you only care about security issues — it reduces output size
🔇 Disable detectHomoglyphs if your content legitimately contains Cyrillic or Greek text (e.g., multilingual apps)
🎯 Filter by riskLevel in downstream processing: critical and high need human review; low may be benign copy-paste artifacts
📅 Schedule regular audits — run on a daily or weekly schedule against new user-generated content, imported data, or crawled text
🔗 Combine with webhook — trigger automated alerts when runs find critical or high risk texts

Integrations

Unicode Text Inspector → Google Sheets (content moderation audit) Use the Apify → Google Sheets integration to automatically append flagged texts to a review spreadsheet. Filter rows where riskLevel = "critical" for priority review.

Unicode Text Inspector → Slack (security alerts) Connect via Make or Zapier: when a run dataset contains any record with riskLevel = "critical", post an alert to your #security Slack channel with the text preview and codepoints found.

Unicode Text Inspector → Elasticsearch (data quality pipeline) Use the JSON output as a pre-indexing filter. Strip or reject texts where hasSuspiciousContent = true before feeding them to your search index to prevent invisible character search poisoning.

Scheduled runs (continuous monitoring) Schedule this actor to run nightly against your CRM contact names, product catalog titles, or user profile fields. Export results to your data warehouse to track zero-width character prevalence over time.

Unicode Text Inspector → Make/Zapier (form validation) Trigger a run on new form submissions via webhook. If any field returns riskLevel != "none", flag the submission for review or reject it automatically.

API usage with the Apify API

Node.js

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });

const run = await client.actor('automation-lab/unicode-text-inspector').call({
    texts: [
        'Hello\u200b World',
        'p\u0430ypal.com (phishing domain candidate)',
    ],
    detectHomoglyphs: true,
    detectBidi: true,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
for (const item of items) {
    console.log(`Text ${item.textIndex}: risk=${item.riskLevel}, issues=${item.issueCount}`);
}

Python

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("automation-lab/unicode-text-inspector").call(run_input={
    "texts": [
        "Hello\u200b World",
        "p\u0430ypal.com (Cyrillic a)",
    ],
    "detectHomoglyphs": True,
    "detectBidi": True,
})

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(f"Text {item['textIndex']}: risk={item['riskLevel']}, issues={item['issueCount']}")

cURL

curl -X POST "https://api.apify.com/v2/acts/automation-lab~unicode-text-inspector/runs?token=YOUR_APIFY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "texts": ["Hello\u200b World", "Normal text"],
    "detectHomoglyphs": true,
    "detectBidi": true
  }'

Use with AI agents via MCP

Unicode Text Inspector is available as a tool for AI assistants that support the Model Context Protocol (MCP).

Add the Apify MCP server to your AI client — this gives you access to all Apify actors, including this one:

Setup for Claude Code

$claude mcp add --transport http apify "https://mcp.apify.com?tools=automation-lab/unicode-text-inspector"

Setup for Claude Desktop, Cursor, or VS Code

Add this to your MCP config file:

{
    "mcpServers": {
        "apify": {
            "type": "http",
            "url": "https://mcp.apify.com?tools=automation-lab/unicode-text-inspector",
            "headers": { "Authorization": "Bearer YOUR_APIFY_TOKEN" }
        }
    }
}

Example prompts for AI agents

"Check this text for hidden Unicode characters: 'Hello World'"
"Scan these domain names for homoglyph spoofing: paypal.com, pаypal.com, amazon.com"
"Analyze this CSV column of user-submitted names and flag any with bidirectional Unicode or zero-width spaces"

Legality: is it legal to analyze text with this tool?

Unicode Text Inspector is a text analysis utility — it performs no web scraping, makes no external requests, and does not interact with any website. You provide text strings; the actor processes them locally on Apify infrastructure.

There are no legal concerns with Unicode character detection on text you own or have rights to process. Always ensure your data handling complies with applicable privacy regulations (GDPR, CCPA) when processing user-generated content.

FAQ

How fast is Unicode Text Inspector? Very fast. Pure in-memory string processing with no I/O or network calls. A batch of 1,000 strings typically completes in under 5 seconds. The per-run timeout of 300 seconds can handle hundreds of thousands of texts.

How much does it cost to analyze 10,000 texts? At FREE/BRONZE tier pricing: $0.001 (start) + 10,000 × $0.00069 = approximately $6.90. At DIAMOND tier: approximately $2.76.

Does it detect all Unicode homoglyphs? The current detector covers the most common script-based confusables: Cyrillic, Greek, and fullwidth Latin characters. This covers the vast majority of real-world phishing and spoofing cases. Rare confusables from other scripts (Armenian, Georgian, etc.) are not yet included. The detector is pattern-based, not a comprehensive Unicode confusables database — it prioritizes precision over recall.

Why are some results marked risk=low even though there's a zero-width space? Zero-width spaces are sometimes inserted legitimately by word processors, CMS platforms, and copy-paste operations (especially from web pages). low risk means an issue was found but it may not be malicious. Only bidirectional override characters are marked critical because they have no legitimate use in most text contexts.

Why is a text showing 0 issues when I can see something strange in it? Most likely the character is not in the current detection tables. Try enabling all detection options. If the character is in a script that isn't covered (e.g., Armenian lookalikes), it won't be detected. You can check the character manually by pasting it into a Unicode inspector like unicode.org.

Can I analyze very long texts? Yes. The actor processes texts of any length. Very long texts (millions of characters) may take a few seconds each. The 300-second timeout is sufficient for typical use cases. If you need to analyze extremely long documents, split them into chunks before passing to the actor.

Other text and data quality tools

Looking for related utilities? Check these automation-lab actors:

🎨 Color Contrast Checker — WCAG 2.1 AA/AAA contrast ratio validation for UI design
🔬 Accessibility Checker — WCAG accessibility audit for web pages
📋 JSON Schema Generator — Generate JSON Schema from example JSON data
🔗 Ads.txt Checker — Validate ads.txt files for publisher compliance
📊 Base64 Converter — Encode and decode Base64 strings

Unicode Text Inspector

maximedupre/unicode-text-inspector

Inspect pasted text for hidden Unicode characters, zero-width spaces, bidi controls, control characters, and homoglyphs. Get risk levels, issue evidence, category counts, cleaned text, and batch summaries.

Maxime Dupré

Game of Thrones Characters API

kodyitson23n/game-of-thrones-characters-api

Game of Thrones Characters API

Mackenzie Covert

Harry Porter Characters Scraper

columban.vej/harry-porter-characters-scraper

Harry Porter Characters Scraper

Ian Schumacher

Japanese Text Normalizer — NFKC, kana, whitespace, sentences

shoebill-dev27/jp-text-normalizer

Normalize Japanese text for data pipelines: Unicode NFKC (full/half-width unification), wave-dash unification, whitespace cleanup, hiragana/katakana conversion, Japanese-aware sentence splitting, and per-script character stats.

Shinobu Otani

Password Generator

zsoftware/password-generator

Easily generate strong, customizable passwords in bulk. Configure length, character types (uppercase, lowercase, digits, special characters), and enforce minimum counts for digits and special characters. ldeal for bulk account creation, testing, or security workflows.

Karim

Text to Slug Generator

automation-lab/text-to-slug-generator

🔗 Convert text, titles, or headings to clean URL-friendly slugs. Batch-process thousands of strings with Unicode transliteration, stop-word removal, custom separators, and max-length truncation.

Stas Persiianenko

Rick and Morty Scraper - Characters & Episodes

lulzasaur/rickandmorty-scraper

Scrape Rick and Morty data including characters, locations, and episodes. Filter by name, status, species, or dimension.

lulz bot

Password Generator

vivid_astronaut/password-generator

Generate secure passwords. Customizable length and characters.

Fabio Suizu

Star Wars API Scraper - Characters & Planets

lulzasaur/swapi-scraper

Scrape Star Wars universe data including characters, planets, films, species, vehicles and starships from SWAPI. Search and filter results.

lulz bot

Rick and Morty API Scraper

crawlerbros/rick-and-morty-scraper

Scrape the Rick and Morty API - all 826 characters, 51 episodes, and 126 locations from the show. Filter characters by status (alive/dead), species, gender. Filter episodes by season/episode code. Returns character appearances, episode casts, and location resident counts.