Last updated
What Is a Character Set?
A character set (charset) is a mapping between characters and their numeric codes. A character encoding defines how those codes are stored as bytes. The two concepts are often conflated, but they're distinct: Unicode is a character set (assigns code points to over 140,000 characters); UTF-8, UTF-16, and UTF-32 are encodings (define how to store those code points as bytes).
Common Encodings Compared
| Encoding | Bytes per char | Coverage | Use case |
|---|---|---|---|
| ASCII | 1 | 128 chars (English only) | Legacy systems, protocols |
| ISO-8859-1 (Latin-1) | 1 | 256 chars (Western European) | Legacy web pages |
| Windows-1252 | 1 | 256 chars (similar to Latin-1) | Windows legacy files |
| UTF-8 | 1–4 | All Unicode (1.1M+ chars) | Web standard, modern default |
| UTF-16 | 2 or 4 | All Unicode | Windows APIs, Java strings |
| UTF-32 | 4 | All Unicode | Internal processing, Python 3 |
| Shift-JIS | 1–2 | Japanese characters | Legacy Japanese systems |
| GB2312/GBK | 1–2 | Chinese characters | Legacy Chinese systems |
Converting Encodings in Python
# Read a file in one encoding, write in another
def convert_encoding(input_path, output_path, from_enc, to_enc):
with open(input_path, encoding=from_enc, errors='replace') as f:
content = f.read()
with open(output_path, encoding=to_enc) as f:
f.write(content)
# Convert Windows-1252 to UTF-8
convert_encoding('legacy.txt', 'modern.txt', 'windows-1252', 'utf-8')
# Detect encoding with chardet
import chardet
with open('unknown.txt', 'rb') as f:
raw = f.read()
result = chardet.detect(raw)
print(result) # {'encoding': 'windows-1252', 'confidence': 0.73}
The "Mojibake" Problem
Mojibake (文字化け) is the garbled text that appears when a file is read with the wrong encoding. For example, reading a UTF-8 file as Latin-1 turns "café" into "café". The fix is always to identify the correct source encoding and re-read the file with it. Never try to "fix" mojibake by re-encoding already-garbled text — you'll lose data.