Diff Algorithm
(April 2006)

Table of contents

Introduction

The Matrix

The Simpliest Path

A PHP implementation

Introduction

This page describes a diff algorithm which uses a slightly modified Levenshtein matrix.

View the Diff Tool demo

The Matrix

The following diagram shows 2 ways to transform a source string to another string (the destination string).
Allowed transformations are: The source string is 'THE CALL OF THE WILD' (indexed by i).
The destination string is 'THE WOLF IS WILD' (indexed by j).

  1. At the beginning, 4 letters are the same and need not be changed.
  2. At this point there are two possibilities: delete the 'C' or insert a 'W'.
  3. 'W' and 'O' have been inserted.
  4. 'C' and 'A' have been removed.
  5. At this point the next character (a space) is a space and need not be changed.
  6. -
  7. This is the end: the source string has been transformed in the destination string.
  8. On the blue path, 'CALL OF THE' is removed.
  9. 'WOLF IS' is inserted.
What we want is find the simpliest path.

The Simpliest Path

There we need the Levenshtein matrix.
The Levenshtein matrix gives the minimum cost to transform the source string to the destination string.
If a letter is inserted or removed, the cost is increased by 1. If the letter is not changed, the cost is not changed.
T H E   C A L L   O F   T H E   W I L D
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
T 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
H 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
E 3 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
W 5 4 3 2 1 2 3 4 5 6 7 8 9 10 11 12 13 12 13 14 15
O 6 5 4 3 2 3 4 5 6 7 6 7 8 9 10 11 12 13 14 15 16
L 7 6 5 4 3 4 5 4 5 6 7 8 9 10 11 12 13 14 15 14 15
F 8 7 6 5 4 5 6 5 6 7 8 7 8 9 10 11 12 13 14 15 16
9 8 7 6 5 6 7 6 7 6 7 8 7 8 9 10 11 12 13 14 15
I 10 9 8 7 6 7 8 7 8 7 8 9 8 9 10 11 12 13 12 13 14
S 11 10 9 8 7 8 9 8 9 8 9 10 9 10 11 12 13 14 13 14 15
12 11 10 9 8 9 10 9 10 9 10 11 10 11 12 13 12 13 14 15 16
W 13 12 11 10 9 10 11 10 11 10 11 12 11 12 13 14 13 12 13 14 15
I 14 13 12 11 10 11 12 11 12 11 12 13 12 13 14 15 14 13 12 13 14
L 15 14 13 12 11 12 13 12 11 12 13 14 13 14 15 16 15 14 13 12 13
D 16 15 14 13 12 13 14 13 12 13 14 15 14 15 16 17 16 15 14 13 12
How to read this matrix ?
The blue cell indicates the minimum cost to transform 'THE CAL' into 'TH' (5 characters to remove).
Each cell contains the minimum cost.
And now, to find the simpliest path, we walk through this matrix starting from the right-bottom corner, and follow the lowest values (in green): if the letters are the same, go up-left in diagonal, else, go to the lowest value (up or left). Several paths are possible and we chose one:
T H E   C A L L   O F   T H E   W I L D
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
T 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
H 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
E 3 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
W 5 4 3 2 1 2 3 4 5 6 7 8 9 10 11 12 13 12 13 14 15
O 6 5 4 3 2 3 4 5 6 7 6 7 8 9 10 11 12 13 14 15 16
L 7 6 5 4 3 4 5 4 5 6 7 8 9 10 11 12 13 14 15 14 15
F 8 7 6 5 4 5 6 5 6 7 8 7 8 9 10 11 12 13 14 15 16
9 8 7 6 5 6 7 6 7 6 7 8 7 8 9 10 11 12 13 14 15
I 10 9 8 7 6 7 8 7 8 7 8 9 8 9 10 11 12 13 12 13 14
S 11 10 9 8 7 8 9 8 9 8 9 10 9 10 11 12 13 14 13 14 15
12 11 10 9 8 9 10 9 10 9 10 11 10 11 12 13 12 13 14 15 16
W 13 12 11 10 9 10 11 10 11 10 11 12 11 12 13 14 13 12 13 14 15
I 14 13 12 11 10 11 12 11 12 11 12 13 12 13 14 15 14 13 12 13 14
L 15 14 13 12 11 12 13 12 11 12 13 14 13 14 15 16 15 14 13 12 13
D 16 15 14 13 12 13 14 13 12 13 14 15 14 15 16 17 16 15 14 13 12
In others words:
= THE
- CALL
=   (space)
+ W
= O
+ L
= F
- THE
+ IS
=  WILD

A PHP implementation

The following PHP function computes the diff between 2 strings.
You can test it in the demo.

<?
// implementation of a diff between 2 strings

// example of behaviour
// source = "abcdefghiZ..."
// dest =   "abcdXefghi..."
// this function returns an array of this sort :
// '=', "abcd"
// '+', "X"
// '=', "efghi"
// '-', "Z"
// '=', "..."
function diff($source, $dest) {
	$source_len = strlen($source);
	$dest_len = strlen($dest);

	// matrix of Levenshtein distance
	// build the matrix
	$d = array();
	for ($i=0; $i<$source_len+1; $i++) $d[] = array_fill(0, $dest_len, 0);

	// initialize the matrix
	for ($i=0; $i<$source_len+1; $i++) $d[$i][0] = $i;

	for ($j=0; $j<$dest_len+1; $j++) $d[0][$j] = $j;

	// compute the matrix
	for ($i=0; $i<$source_len; $i++) {
		for ($j=0; $j<$dest_len; $j++) {
			if ($source[$i] == $dest[$j]) $d[$i+1][$j+1] = $d[$i][$j];
			else $d[$i+1][$j+1] = min($d[$i][$j+1], $d[$i+1][$j])+1;
		}
	}

	// follow the path starting from the end
	// in order to find the optimized path
	$stack = array();
	$j = $dest_len;
	$i = $source_len;
	while ($i>0 and $j>0) {
		if ($source[$i-1] == $dest[$j-1]) {
			array_unshift($stack, array('=', $source[$i-1]));
			$j = $j-1;
			$i = $i-1;
		} else {
			// find the best cell
			if ($d[$i-1][$j] < $d[$i][$j-1]) {
				array_unshift($stack, array('-', $source[$i-1]));
				$i = $i-1;
			} else {
				array_unshift($stack, array('+', $dest[$j-1]));
				$j = $j-1;
			}
		}
	}
	if ($i>0) { // source characters to be removed
		array_unshift($stack, array('-', substr($source, 0, $i-1)));
	}
	if ($j>0) { // dest characters to be added
		array_unshift($stack, array('+', substr($dest, 0, $j-1)));
	}

	// simplify the stack (put +, - or = together)
	$simplified_stack = array();
	$simplified_stack[0] = $stack[0];
	$i = 1;
	$i_s = 0; // index in the simplified stack
	while ($i < count($stack)) {
		if ($stack[$i][0] == $simplified_stack[$i_s][0]) {
			$simplified_stack[$i_s][1] .= $stack[$i][1];
		} else {
			$i_s ++;
			$simplified_stack[$i_s] = $stack[$i];
		}
		$i ++;
	}
	return $simplified_stack;

} // end of function diff

?>