Sequence Alignment

Input Two finite words $x_1x_2...x_m$ and $y_1y_2...y_n$, the gap and mismatch penalties

Output The minimum cost alignment and its distance

The minimum edit distance or Levenshtein distance between two words is a measure of their similarity. Consider the word occurrence and the misspelling ocurrance. These could be aligned as follows:

o-currance
occurrence

where - indicates a gap. This distance between the two is a gap and one mismatch. They could also be aligned as

o-curr-ance
occurre-nce

which has three gaps and no mismatches. To determine which of these is better, there are different penalties for gaps and mismatches.

Let $\delta$ denote the gap penalty and $\alpha_{ab}$ denote the mismatch penalty for two letters $a$ and $b$.

A possible definition of $\alpha_{ab}$ is $$ \alpha_{ab} = \begin{cases} 0 & \text{if } a=b\\ 1 & \text{if } a \text{ and } b \text{ are both vowels or consonants}\\ 3 & \text{otherwise} \end{cases} $$

The cost of an alignment is therefore the minimum sum of gap and mismatch costs. The aim is to find the alignment of minimum cost between two words.

In an optimal alignment either

  • the last letters of $x$ and $y$ are matched
  • the last letter of $x$ is aligned with a gap
  • the last letter of $y$ is aligned with a gap

As a recurrence, this is $$ D[i,j] = \begin{cases} j \times \delta & i = 0\\ i \times \delta & j = 0\\ \min(\alpha_{x_iy_i}+D[i-1, j-1], \delta + D[i-1, j], \delta + D[i, j-1]) & \text{if } i \geq 1 \text{ and } j \geq 1 \end{cases} $$

Then

for j from 1 to n:
    for i from 1 to m:
        calculate D[i,j]
return D[m,n]

This takes $O(mn)$ time.

Finding an alignment in linear space

[TODO]