Character encoding
(February 2005)

Text files can be encoded in several ways. This is most important for languages which use accented characters (Spanish, French, German, ...).
In this article we focus on 4 character encodings, and compare them :
Table of contents

Comparison

Display Issues

More information

Comparison

Here are examples of how characters are encoded:
(the first column gives the glyph, and in the other columns values are in hexadecimal base)
(1) This symbol does not exist in the considered character encoding.
character ASCII ISO 8859-1 ISO 8859-15 UTF-8
a 61 61 61 61
{ 7B 7B 7B 7B
é N/A (1) E9 E9 C3A9
¤ N/A (1) N/A (1) A4 E282
Characteristics of these encodings:

Display Issues

A text which has been written with ISO 8859-1 encoding will not look correct if it is displayed by a software which considers it with UTF-8 encoding.
This happens often with emails and can usually be corrected by setting the encoding in a menu.
Example of a wrong encoding:

Example of the same text with a correct encoding:


When using any programming language (C, PHP, Python,...) the character encoding of the source code is important to determine the length of strings.
The following Python code will give a different result whether the character encoding if ISO 8859-1 or UTF-8:
len("déjà vu")
This will give the length of the string:
On Linux systems, iconv is a useful tool to convert a text document from an encoding to another:
shell> iconv -f UTF-8 -t ISO-8859-15 my_file_UTF_8.txt > my_file_ISO_8859_15.txt
shell> iconv -f UTF-8 -t ASCII my_file_UTF_8.txt > my_file_ASCII.txt
In the Linux environment, the LANG variable specifies the encoding used.
shell> echo $LANG
fr_FR.UTF-8

More information

There is more complete information on Wikipedia: www.utf-8.com