Sequence Alignment
Input Two finite words $x_1x_2...x_m$ and $y_1y_2...y_n$, the gap and mismatch penalties
Output The minimum cost alignment and its distance
The minimum edit distance or Levenshtein distance between two words is a measure of their similarity. Consider the word occurrence and the misspelling ocurrance. These could be aligned as follows:
o-currance
occurrence
where - indicates a gap. This distance between the two is a gap and one mismatch. They could also be aligned as
o-curr-ance
occurre-nce
which has three gaps and no mismatches. To determine which of these is better, there are different penalties for gaps and mismatches.
Let $\delta$ denote the gap penalty and $\alpha_{ab}$ denote the mismatch penalty for two letters $a$ and $b$.
A possible definition of $\alpha_{ab}$ is $$ \alpha_{ab} = \begin{cases} 0 & \text{if } a=b\\ 1 & \text{if } a \text{ and } b \text{ are both vowels or consonants}\\ 3 & \text{otherwise} \end{cases} $$
The cost of an alignment is therefore the minimum sum of gap and mismatch costs. The aim is to find the alignment of minimum cost between two words.
In an optimal alignment either
- the last letters of $x$ and $y$ are matched
- the last letter of $x$ is aligned with a gap
- the last letter of $y$ is aligned with a gap
As a recurrence, this is $$ D[i,j] = \begin{cases} j \times \delta & i = 0\\ i \times \delta & j = 0\\ \min(\alpha_{x_iy_i}+D[i-1, j-1], \delta + D[i-1, j], \delta + D[i, j-1]) & \text{if } i \geq 1 \text{ and } j \geq 1 \end{cases} $$
Then
for j from 1 to n:
for i from 1 to m:
calculate D[i,j]
return D[m,n]
This takes $O(mn)$ time.
Finding an alignment in linear space
[TODO]