Last updated
What Is Unicode Normalization?
Unicode normalization ensures that equivalent character sequences are represented identically. The same visible character can be stored as different byte sequences — normalization converts them to a consistent form, which is essential for correct string comparison, search indexing, and database deduplication.
The Core Problem: Multiple Representations
The letter é can be stored two ways that look identical but are different bytes:
Precomposed (NFC):
Character: é
Code point: U+00E9
UTF-8 bytes: C3 A9
Byte length: 2
Decomposed (NFD):
Characters: e + ◌́
Code points: U+0065 + U+0301
UTF-8 bytes: 65 CC 81
Byte length: 3
String comparison without normalization:
"café" (NFC) === "café" (NFD) → FALSE ❌
After normalizing both to NFC:
"café" (NFC) === "café" (NFC) → TRUE ✓
NFC — Canonical Decomposition, then Composition
NFC is the most common form for web and application storage. It produces precomposed characters:
Input (NFD, decomposed):
e + ́ → é (U+0065 + U+0301 → U+00E9)
a + ̈ → ä (U+0061 + U+0308 → U+00E4)
n + ̃ → ñ (U+006E + U+0303 → U+00F1)
After NFC normalization:
All precomposed — single code point per accented character
Recommended for: HTML, JSON, databases, APIs
NFD — Canonical Decomposition
NFD decomposes all precomposed characters into base + combining marks:
Input (NFC, precomposed):
é → e + ́ (U+00E9 → U+0065 + U+0301)
ä → a + ̈ (U+00E4 → U+0061 + U+0308)
ñ → n + ̃ (U+00F1 → U+006E + U+0303)
After NFD normalization:
All decomposed — base character + combining accent
Recommended for: text processing, accent stripping, macOS file system
NFKC — Compatibility Decomposition, then Composition
NFKC handles characters that are semantically equivalent but visually different:
Compatibility mappings applied by NFKC:
Ligatures:
fi (fi ligature, U+FB01) → fi
ff (ff ligature, U+FB00) → ff
Fullwidth characters:
A (fullwidth A, U+FF21) → A
1 (fullwidth 1, U+FF11) → 1
Superscripts/subscripts:
² (superscript 2, U+00B2) → 2
₃ (subscript 3, U+2083) → 3
Mathematical variants:
𝐀 (math bold A) → A
𝑨 (math italic A) → A
Recommended for: search indexing, username normalization, password hashing
NFKD — Compatibility Decomposition
NFKD applies all compatibility decompositions without recomposition — the most decomposed form:
Input: financé
NFKD output:
fi → f + i (ligature decomposed)
é → e + ́ (accent decomposed)
Result: f + i + n + a + n + c + e + ́
(7 base characters + 1 combining accent)
Recommended for: text analysis, character-level processing
Normalization in Practice: String Comparison
// JavaScript — normalize before comparing
const str1 = "caf\u00E9"; // NFC: café (precomposed)
const str2 = "cafe\u0301"; // NFD: café (decomposed)
console.log(str1 === str2); // false ❌
console.log(str1.normalize() === str2.normalize()); // true ✓
// Python
import unicodedata
s1 = "caf\u00E9"
s2 = "cafe\u0301"
print(unicodedata.normalize('NFC', s1) == unicodedata.normalize('NFC', s2))
# True ✓
Normalization for Search Indexing
-- PostgreSQL: normalize text before indexing
CREATE OR REPLACE FUNCTION normalize_text(input TEXT)
RETURNS TEXT AS $$
BEGIN
-- NFKC normalization for search
RETURN lower(normalize(input, NFKC));
END;
$$ LANGUAGE plpgsql;
-- Index on normalized form
CREATE INDEX idx_products_name
ON products (normalize_text(name));
-- Search matches regardless of normalization form
SELECT * FROM products
WHERE normalize_text(name) = normalize_text('Café');
-- Matches: "Café", "Cafe\u0301", "CAFÉ", "cafe"
Password Hashing with Normalization
Normalize passwords before hashing to prevent authentication failures across devices:
// macOS produces NFD, Windows produces NFC
// Without normalization, same password fails on different OS
// Correct approach: normalize to NFKC before hashing
import { createHash } from 'crypto';
function hashPassword(password: string): string {
const normalized = password.normalize('NFKC');
return createHash('sha256').update(normalized).digest('hex');
}
// "café" typed on macOS (NFD) and Windows (NFC) now produce same hash
Cross-Platform Normalization Issues
- macOS HFS+ file system uses NFD — file names are stored decomposed
- Windows NTFS uses NFC — file names are stored precomposed
- Linux ext4 stores whatever bytes you give it — no normalization
- This causes issues when syncing files between platforms (e.g., Git on macOS vs Windows)
# Git normalization config
git config core.precomposeunicode true # macOS: store as NFC in repo
git config core.quotepath false # show Unicode in file paths
Normalization Form Quick Reference
Form Full Name Use Case
---- --------- --------
NFC Canonical Decomposition + Composition Web, databases, APIs (default)
NFD Canonical Decomposition Text processing, macOS FS
NFKC Compatibility Decomp + Composition Search, usernames, passwords
NFKD Compatibility Decomposition Analysis, character processing