Diff Algorithm
(April 2006)
This page describes a diff algorithm which uses a slightly modified Levenshtein matrix.
View the Diff Tool demo
The following diagram shows 2 ways to transform a source string to another string (the destination string).
Allowed transformations are:
- keep the letter (when no change is needed) - a diagonal line is drawn
- insert a letter - a vertical line
- remove a letter - a horizontal line
The source string is 'THE CALL OF THE WILD' (indexed by i).
The destination string is 'THE WOLF IS WILD' (indexed by j).
- At the beginning, 4 letters are the same and need not be changed.
- At this point there are two possibilities: delete the 'C' or insert a 'W'.
- 'W' and 'O' have been inserted.
- 'C' and 'A' have been removed.
- At this point the next character (a space) is a space and need not be changed.
- -
- This is the end: the source string has been transformed in the destination string.
- On the blue path, 'CALL OF THE' is removed.
- 'WOLF IS' is inserted.
What we want is find the simpliest path.
There we need the Levenshtein matrix.
The Levenshtein matrix gives the minimum cost to transform the source string to the destination string.
If a letter is inserted or removed, the cost is increased by 1. If the letter is not changed, the cost is not changed.
| |
T |
H |
E |
|
C |
A |
L |
L |
|
O |
F |
|
T |
H |
E |
|
W |
I |
L |
D |
|
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
T |
1 |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
H |
2 |
1 |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
E |
3 |
2 |
1 |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
|
4 |
3 |
2 |
1 |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
W |
5 |
4 |
3 |
2 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
12 |
13 |
14 |
15 |
O |
6 |
5 |
4 |
3 |
2 |
3 |
4 |
5 |
6 |
7 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
L |
7 |
6 |
5 |
4 |
3 |
4 |
5 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
14 |
15 |
F |
8 |
7 |
6 |
5 |
4 |
5 |
6 |
5 |
6 |
7 |
8 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
|
9 |
8 |
7 |
6 |
5 |
6 |
7 |
6 |
7 |
6 |
7 |
8 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
I |
10 |
9 |
8 |
7 |
6 |
7 |
8 |
7 |
8 |
7 |
8 |
9 |
8 |
9 |
10 |
11 |
12 |
13 |
12 |
13 |
14 |
S |
11 |
10 |
9 |
8 |
7 |
8 |
9 |
8 |
9 |
8 |
9 |
10 |
9 |
10 |
11 |
12 |
13 |
14 |
13 |
14 |
15 |
|
12 |
11 |
10 |
9 |
8 |
9 |
10 |
9 |
10 |
9 |
10 |
11 |
10 |
11 |
12 |
13 |
12 |
13 |
14 |
15 |
16 |
W |
13 |
12 |
11 |
10 |
9 |
10 |
11 |
10 |
11 |
10 |
11 |
12 |
11 |
12 |
13 |
14 |
13 |
12 |
13 |
14 |
15 |
I |
14 |
13 |
12 |
11 |
10 |
11 |
12 |
11 |
12 |
11 |
12 |
13 |
12 |
13 |
14 |
15 |
14 |
13 |
12 |
13 |
14 |
L |
15 |
14 |
13 |
12 |
11 |
12 |
13 |
12 |
11 |
12 |
13 |
14 |
13 |
14 |
15 |
16 |
15 |
14 |
13 |
12 |
13 |
D |
16 |
15 |
14 |
13 |
12 |
13 |
14 |
13 |
12 |
13 |
14 |
15 |
14 |
15 |
16 |
17 |
16 |
15 |
14 |
13 |
12 |
How to read this matrix ?
The blue cell indicates the minimum cost to transform 'THE CAL' into 'TH' (5 characters to remove).
Each cell contains the minimum cost.
And now, to find the simpliest path, we walk through this matrix starting from the right-bottom corner, and follow the lowest values (in green): if the letters are the same, go up-left in diagonal, else, go to the lowest value (up or left). Several paths are possible and we chose one:
| |
T |
H |
E |
|
C |
A |
L |
L |
|
O |
F |
|
T |
H |
E |
|
W |
I |
L |
D |
|
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
T |
1 |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
H |
2 |
1 |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
E |
3 |
2 |
1 |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
|
4 |
3 |
2 |
1 |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
W |
5 |
4 |
3 |
2 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
12 |
13 |
14 |
15 |
O |
6 |
5 |
4 |
3 |
2 |
3 |
4 |
5 |
6 |
7 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
L |
7 |
6 |
5 |
4 |
3 |
4 |
5 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
14 |
15 |
F |
8 |
7 |
6 |
5 |
4 |
5 |
6 |
5 |
6 |
7 |
8 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
|
9 |
8 |
7 |
6 |
5 |
6 |
7 |
6 |
7 |
6 |
7 |
8 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
I |
10 |
9 |
8 |
7 |
6 |
7 |
8 |
7 |
8 |
7 |
8 |
9 |
8 |
9 |
10 |
11 |
12 |
13 |
12 |
13 |
14 |
S |
11 |
10 |
9 |
8 |
7 |
8 |
9 |
8 |
9 |
8 |
9 |
10 |
9 |
10 |
11 |
12 |
13 |
14 |
13 |
14 |
15 |
|
12 |
11 |
10 |
9 |
8 |
9 |
10 |
9 |
10 |
9 |
10 |
11 |
10 |
11 |
12 |
13 |
12 |
13 |
14 |
15 |
16 |
W |
13 |
12 |
11 |
10 |
9 |
10 |
11 |
10 |
11 |
10 |
11 |
12 |
11 |
12 |
13 |
14 |
13 |
12 |
13 |
14 |
15 |
I |
14 |
13 |
12 |
11 |
10 |
11 |
12 |
11 |
12 |
11 |
12 |
13 |
12 |
13 |
14 |
15 |
14 |
13 |
12 |
13 |
14 |
L |
15 |
14 |
13 |
12 |
11 |
12 |
13 |
12 |
11 |
12 |
13 |
14 |
13 |
14 |
15 |
16 |
15 |
14 |
13 |
12 |
13 |
D |
16 |
15 |
14 |
13 |
12 |
13 |
14 |
13 |
12 |
13 |
14 |
15 |
14 |
15 |
16 |
17 |
16 |
15 |
14 |
13 |
12 |
In others words:
= THE
- CALL
= (space)
+ W
= O
+ L
= F
- THE
+ IS
= WILD
The following PHP function computes the diff between 2 strings.
You can test it in the demo.
<?
// implementation of a diff between 2 strings
// example of behaviour
// source = "abcdefghiZ..."
// dest = "abcdXefghi..."
// this function returns an array of this sort :
// '=', "abcd"
// '+', "X"
// '=', "efghi"
// '-', "Z"
// '=', "..."
function diff($source, $dest) {
$source_len = strlen($source);
$dest_len = strlen($dest);
// matrix of Levenshtein distance
// build the matrix
$d = array();
for ($i=0; $i<$source_len+1; $i++) $d[] = array_fill(0, $dest_len, 0);
// initialize the matrix
for ($i=0; $i<$source_len+1; $i++) $d[$i][0] = $i;
for ($j=0; $j<$dest_len+1; $j++) $d[0][$j] = $j;
// compute the matrix
for ($i=0; $i<$source_len; $i++) {
for ($j=0; $j<$dest_len; $j++) {
if ($source[$i] == $dest[$j]) $d[$i+1][$j+1] = $d[$i][$j];
else $d[$i+1][$j+1] = min($d[$i][$j+1], $d[$i+1][$j])+1;
}
}
// follow the path starting from the end
// in order to find the optimized path
$stack = array();
$j = $dest_len;
$i = $source_len;
while ($i>0 and $j>0) {
if ($source[$i-1] == $dest[$j-1]) {
array_unshift($stack, array('=', $source[$i-1]));
$j = $j-1;
$i = $i-1;
} else {
// find the best cell
if ($d[$i-1][$j] < $d[$i][$j-1]) {
array_unshift($stack, array('-', $source[$i-1]));
$i = $i-1;
} else {
array_unshift($stack, array('+', $dest[$j-1]));
$j = $j-1;
}
}
}
if ($i>0) { // source characters to be removed
array_unshift($stack, array('-', substr($source, 0, $i-1)));
}
if ($j>0) { // dest characters to be added
array_unshift($stack, array('+', substr($dest, 0, $j-1)));
}
// simplify the stack (put +, - or = together)
$simplified_stack = array();
$simplified_stack[0] = $stack[0];
$i = 1;
$i_s = 0; // index in the simplified stack
while ($i < count($stack)) {
if ($stack[$i][0] == $simplified_stack[$i_s][0]) {
$simplified_stack[$i_s][1] .= $stack[$i][1];
} else {
$i_s ++;
$simplified_stack[$i_s] = $stack[$i];
}
$i ++;
}
return $simplified_stack;
} // end of function diff
?>