Character encoding
(February 2005)
Text files can be encoded in several ways. This is most important for languages which use accented characters (Spanish, French, German, ...).
In this article we focus on 4 character encodings, and compare them :
- ASCII
- ISO 8859-1
- ISO 8859-15
- UTF-8
Here are examples of how characters are encoded:
(the first column gives the glyph, and in the other columns values are in hexadecimal base)
(1) This symbol does not exist in the considered character encoding.
character |
ASCII |
ISO 8859-1 |
ISO 8859-15 |
UTF-8 |
a |
61 |
61 |
61 |
61 |
{ |
7B |
7B |
7B |
7B |
é |
N/A (1)
|
E9 |
E9 |
C3A9 |
¤ |
N/A (1)
|
N/A (1)
|
A4 |
E282 |
Characteristics of these encodings:
- ASCII has 128 characters only.
- Differences between ISO 8859-1 and ISO 8859-15 occur for 8 characters.
For instance, the following have been added in ISO 8859-15: ¤
(euro sign), ¼
A text which has been written with ISO 8859-1 encoding will not look correct if it is displayed by a software which considers it with UTF-8 encoding.
This happens often with emails and can usually be corrected by setting the encoding in a menu.
Example of a wrong encoding:
Example of the same text with a correct encoding:
When using any programming language (C, PHP, Python,...) the character encoding of the source code is important to determine the length of strings.
The following Python code will give a different result whether the character encoding if ISO 8859-1 or UTF-8:
len("déjà vu")
This will give the length of the string:
- 7 if the encoding is ISO 8859-1
- 9 if the encoding is UTF-8, because é and à are coded by 2 octets
On Linux systems, iconv
is a useful tool to convert a text document from an encoding to another:
shell> iconv -f UTF-8 -t ISO-8859-15 my_file_UTF_8.txt > my_file_ISO_8859_15.txt
shell> iconv -f UTF-8 -t ASCII my_file_UTF_8.txt > my_file_ASCII.txt
In the Linux environment, the LANG
variable specifies the encoding used.
shell> echo $LANG
fr_FR.UTF-8
There is more complete information on Wikipedia:
www.utf-8.com