Sequence alignment is a technique for defining the degree of similarity between two sequences. The essence of sequence alignment is best described by its premiere use case: genomics (there are many applications, but genomics is the most commonly cited). There are four types of bases in a DNA molecule: adenine (A), cytosine (C), guanine (G), and thymine (T)1. For computation purposes, strands of DNA can be represented as strings using these four characters (e.g. AGGGCT…). Using this technique geneticist can compare regions of DNA. Comparing an unknown genomic region to a well known one can provide useful insights. Additionally, it can divine proximity in a phylogenetic tree in order to determine evolutionary descendants.
Suppose there are two strands of DNA:
AGGCA. Obviously, they are
different; however, consider the similarities once they are aligned.
A G G G C T
A G G - C A
Notice the gap placed between
C in the second string. Alignment
expertly inserts gaps into the strings to equalized their lengths. It may occur
that gaps are places in both strings thus increasing both of their lengths.
After alignment, each gap and mismatched character represents a penalty. In the
example above, there are two penalties: the gap inserted into the second string
and the disparate characters at the end of the strings. Each type of penalty has
a different associated cost as defined by domain experts. The sequence
alignment algorithm generates strings with the minimum possible sum of
The minimum penalty generated by sequence alignment is know as the Needleman-Wunsch (NW) score after the algorithm’s inventors. It was first published in the Journal of Molecular Biology in 1970.
- Natural language processing
- Analysing series of purchases over time
- Reconstructing dead languages
$O(nm)$ where $n$ is the length of the first string and $m$ is the length of the second2
inputs: x string 1 x_len length of x y string 2 y_len length of y gap_penalty penalty incurred by a gap penalty_matrix penalty for each possible combination of mismatched characters sequence_alignment: // Define a two dimensional array to hold sub problems solutions[x_len + 1][y_len + 1] // Base case 1 for i = 0 to len_x: solutions[i] = i * gap_penalty // Base case 2 for j = 0 to len_y: solutions[j] = j * gap_penalty for i = 1 to len_x: for j = 1 to len_y: if x[i - 1] == y [i - 1]: solutions[i][j] = solutions[i - 1][j - 1] else: case1 = solutions[i - 1][j - 1] + penalty_matrix[x[i - 1]][y[i - 1]] case2 = solutions[i - 1][j] + gap_penalty case3 = solutions[i][j - 1] + gap_penalty solutions[i][j] = min(case1, case2, case3) return solutions[x_len][y_len]
Click here for build and run instructions