Sequence alignment is a technique for defining the degree of similarity between two sequences. The essence of sequence alignment is best described by its premiere use case: genomics (there are many applications, but genomics is the most commonly cited). There are four types of bases in a DNA molecule: adenine (A), cytosine (C), guanine (G), and thymine (T)^{1}. For computation purposes, strands of DNA can be represented as strings using these four characters (e.g. AGGGCT…). Using this technique geneticist can compare regions of DNA. Comparing an unknown genomic region to a well known one can provide useful insights. Additionally, it can divine proximity in a phylogenetic tree in order to determine evolutionary descendants.
Suppose there are two strands of DNA: AGGGCT
and AGGCA
. Obviously, they are
different; however, consider the similarities once they are aligned.
A G G G C T
A G G  C A
Notice the gap placed between G
and C
in the second string. Alignment
expertly inserts gaps into the strings to equalized their lengths. It may occur
that gaps are places in both strings thus increasing both of their lengths.
After alignment, each gap and mismatched character represents a penalty. In the
example above, there are two penalties: the gap inserted into the second string
and the disparate characters at the end of the strings. Each type of penalty has
a different associated cost as defined by domain experts. The sequence
alignment algorithm generates strings with the minimum possible sum of
penalties.
The minimum penalty generated by sequence alignment is know as the NeedlemanWunsch (NW) score after the algorithm’s inventors. It was first published in the Journal of Molecular Biology in 1970.
Applications
 Genomics
 Natural language processing
 Analysing series of purchases over time
 Reconstructing dead languages
Asymptotic Complexity
$O(nm)$ where $n$ is the length of the first string and $m$ is the length of the second^{2}
Pseudo Code
inputs:
x string 1
x_len length of x
y string 2
y_len length of y
gap_penalty penalty incurred by a gap
penalty_matrix penalty for each possible combination of mismatched characters
sequence_alignment:
// Define a two dimensional array to hold sub problems
solutions[x_len + 1][y_len + 1]
// Base case 1
for i = 0 to len_x:
solutions[i][0] = i * gap_penalty
// Base case 2
for j = 0 to len_y:
solutions[0][j] = j * gap_penalty
for i = 1 to len_x:
for j = 1 to len_y:
if x[i  1] == y [i  1]:
solutions[i][j] = solutions[i  1][j  1]
else:
case1 = solutions[i  1][j  1] + penalty_matrix[x[i  1]][y[i  1]]
case2 = solutions[i  1][j] + gap_penalty
case3 = solutions[i][j  1] + gap_penalty
solutions[i][j] = min(case1, case2, case3)
return solutions[x_len][y_len]
Source Code
Relevant Files:
Click here for build and run instructions

This is impressive given that with two strings of 500 characters each, there are 10^125 possible alignments. ↩