Last updated
Unicode Normalization Forms
// The letter "é" can be represented two ways in Unicode:
// Precomposed: U+00E9 (single character)
// Decomposed: U+0065 + U+0301 (e + combining accent)
// Both look identical but are different byte sequences:
"\u00e9" // NFC form — precomposed é
"e\u0301" // NFD form — e + combining accent
// NFC normalization (most common for storage/exchange):
Input: "café" (decomposed)
Output: "café" (precomposed, NFC)
// NFD normalization (for sorting/comparison):
Input: "café" (precomposed)
Output: "cafe\u0301" (decomposed, NFD)
// NFKC normalization (compatibility + composition):
Input: "file" (fi ligature U+FB01)
Output: "file" (decomposed to f + i)
Whitespace Normalization
// Input with various whitespace issues
"Hello World" // multiple spaces
" leading spaces" // leading whitespace
"trailing spaces " // trailing whitespace
"tab\there" // tab character
"non\u00a0breaking" // non-breaking space (U+00A0)
"thin\u2009space" // thin space (U+2009)
// After whitespace normalization:
"Hello World" // collapsed to single space
"leading spaces" // leading whitespace trimmed
"trailing spaces" // trailing whitespace trimmed
"tab here" // tab → space
"non breaking" // non-breaking space → regular space
"thin space" // thin space → regular space
Case Normalization
// Lowercase (for search indexing)
Input: "The Quick Brown Fox"
Output: "the quick brown fox"
// Uppercase
Input: "hello world"
Output: "HELLO WORLD"
// Title Case
Input: "the quick brown fox"
Output: "The Quick Brown Fox"
// Sentence case
Input: "hello world. this is a test."
Output: "Hello world. This is a test."
// Unicode-aware lowercase (handles non-ASCII)
Input: "ÜBER STRASSE"
Output: "über straße"
Punctuation Normalization
// Curly quotes → straight quotes (for code/data)
Input: "Hello" and 'World'
Output: "Hello" and 'World'
// Straight quotes → curly quotes (for typography)
Input: "Hello" and 'World'
Output: "Hello" and 'World'
// Em dash normalization
Input: "word—word" (em dash U+2014)
Output: "word - word" (spaced hyphen)
// Ellipsis normalization
Input: "Wait…" (ellipsis character U+2026)
Output: "Wait..." (three periods)
// Apostrophe normalization
Input: "it\u2019s" (right single quotation mark)
Output: "it's" (ASCII apostrophe)
Line Ending Normalization
// Windows CRLF → Unix LF
Input: "line1\r\nline2\r\nline3\r\n"
Output: "line1\nline2\nline3\n"
// Old Mac CR → Unix LF
Input: "line1\rline2\rline3\r"
Output: "line1\nline2\nline3\n"
// Mixed → Unix LF
Input: "line1\r\nline2\nline3\r"
Output: "line1\nline2\nline3\n"
// Unix LF → Windows CRLF (for Windows compatibility)
Input: "line1\nline2\nline3\n"
Output: "line1\r\nline2\r\nline3\r\n"
Diacritic Removal (ASCII Folding)
// Remove accent marks for URL slugs and search
Input: "café résumé naïve"
Output: "cafe resume naive"
Input: "Ångström über Straße"
Output: "Angstrom uber Strase"
Input: "Ñoño señor"
Output: "Nono senor"
// Useful for:
// - Generating URL slugs from titles
// - Creating ASCII-safe identifiers
// - Search indexing for accent-insensitive search
// - Normalizing names for comparison
Search Index Normalization
// Normalize text before indexing for search
function normalizeForSearch(text) {
return text
.normalize('NFKD') // Unicode decomposition
.replace(/[\u0300-\u036f]/g, '') // Remove combining marks
.toLowerCase() // Lowercase
.replace(/[^\w\s]/g, ' ') // Remove punctuation
.replace(/\s+/g, ' ') // Collapse whitespace
.trim();
}
normalizeForSearch("Café Résumé") // → "cafe resume"
normalizeForSearch("Hello, World!") // → "hello world"
normalizeForSearch("über straße") // → "uber strase"
Data Cleaning for Database Import
// Raw data from CSV export
" Alice Smith " → "Alice Smith" (trim)
"BOB JONES" → "Bob Jones" (title case)
"carol\u00a0white" → "carol white" (non-breaking space)
"dave\r\njohnson" → "dave johnson" (line ending in field)
"Eve\u2019s" → "Eve's" (curly apostrophe)
Normalization Options Reference
- Unicode form: NFC (default), NFD, NFKC, NFKD
- Whitespace: trim, collapse multiple spaces, normalize Unicode spaces
- Case: lowercase, uppercase, title case, sentence case
- Punctuation: normalize quotes, dashes, ellipsis, apostrophes
- Line endings: LF, CRLF, CR, or normalize to one format
- Diacritics: remove accent marks (ASCII folding)
Paste your text into the Text Normalizer, select the normalization operations you need, and get clean, consistent output ready for processing.
Full Normalization Pipeline Example
// Raw input from user form submission
" Hello, World!\u00a0 It\u2019s a \u201cgreat\u201d day\u2026 "
// Step 1: Unicode NFC normalization
" Hello, World!\u00a0 It\u2019s a \u201cgreat\u201d day\u2026 "
// Step 2: Normalize special Unicode characters
" Hello, World! It's a \"great\" day... "
// Step 3: Trim leading/trailing whitespace
"Hello, World! It's a \"great\" day..."
// Step 4: Collapse multiple spaces
"Hello, World! It's a \"great\" day..."
// Step 5: Lowercase
"hello, world! it's a \"great\" day..."
// Final normalized output:
"hello, world! it's a \"great\" day..."