Text files can be encoded in several ways. This is most important for languages which use accented characters (Spanish, French, German, ...).
In this article we focus on 4 character encodings, and compare them :
Here are examples of how characters are encoded:
- ISO 8859-1
- ISO 8859-15
(the first column gives the glyph, and in the other columns values are in hexadecimal base)
(1) This symbol does not exist in the considered character encoding.
Characteristics of these encodings:
A text which has been written with ISO 8859-1 encoding will not look correct if it is displayed by a software which considers it with UTF-8 encoding.
- ASCII has 128 characters only.
- Differences between ISO 8859-1 and ISO 8859-15 occur for 8 characters.
For instance, the following have been added in ISO 8859-15:
¤ (euro sign),
This happens often with emails and can usually be corrected by setting the encoding in a menu.
Example of a wrong encoding:
Example of the same text with a correct encoding:
When using any programming language (C, PHP, Python,...) the character encoding of the source code is important to determine the length of strings.
The following Python code will give a different result whether the character encoding if ISO 8859-1 or UTF-8:
This will give the length of the string:
- 7 if the encoding is ISO 8859-1
- 9 if the encoding is UTF-8, because é and à are coded by 2 octets
On Linux systems,
iconv is a useful tool to convert a text document from an encoding to another:
shell> iconv -f UTF-8 -t ISO-8859-15 my_file_UTF_8.txt > my_file_ISO_8859_15.txt
shell> iconv -f UTF-8 -t ASCII my_file_UTF_8.txt > my_file_ASCII.txt
In the Linux environment, the
LANG variable specifies the encoding used.
shell> echo $LANG
There is more complete information on Wikipedia: