Character encoding

Character encoding
(February 2005)

Text files can be encoded in several ways. This is most important for languages which use accented characters (Spanish, French, German, ...).
In this article we focus on 4 character encodings, and compare them :

ASCII
ISO 8859-1
ISO 8859-15
UTF-8

Table of contents

Comparison

Display Issues

More information

Comparison

Here are examples of how characters are encoded:
(the first column gives the glyph, and in the other columns values are in hexadecimal base)

(1) This symbol does not exist in the considered character encoding.
character	ASCII	ISO 8859-1	ISO 8859-15	UTF-8
a	61	61	61	61
{	7B	7B	7B	7B
é	N/A ⁽¹⁾	E9	E9	C3A9
€	N/A ⁽¹⁾	N/A ⁽¹⁾	A4	E282

Characteristics of these encodings:

ASCII has 128 characters only.
Differences between ISO 8859-1 and ISO 8859-15 occur for 8 characters.
For instance, the following have been added in ISO 8859-15: € (euro sign), Œ

Display Issues

A text which has been written with ISO 8859-1 encoding will not look correct if it is displayed by a software which considers it with UTF-8 encoding.
This happens often with emails and can usually be corrected by setting the encoding in a menu.
Example of a wrong encoding:

Example of the same text with a correct encoding:

When using any programming language (C, PHP, Python,...) the character encoding of the source code is important to determine the length of strings.
The following Python code will give a different result whether the character encoding if ISO 8859-1 or UTF-8:

len("déjà vu")

This will give the length of the string:

7 if the encoding is ISO 8859-1
9 if the encoding is UTF-8, because é and à are coded by 2 octets

On Linux systems, iconv is a useful tool to convert a text document from an encoding to another:

shell> iconv -f UTF-8 -t ISO-8859-15 my_file_UTF_8.txt > my_file_ISO_8859_15.txt
shell> iconv -f UTF-8 -t ASCII my_file_UTF_8.txt > my_file_ASCII.txt

In the Linux environment, the LANG variable specifies the encoding used.

shell> echo $LANG
fr_FR.UTF-8

More information

There is more complete information on Wikipedia: