Under Construction: Consider this content a preview of the real thing which is coming soon (hopefully).
Sequence alignment is a technique for defining the degree of similarity between two sequences. The essence of sequence alignment is best described by its premiere use case: genomics (there are many applications, but genomics is the most commonly cited). There are four types of bases in a DNA molecule: adenine (A), cytosine (C), guanine (G), and thymine (T)1. For computation purposes, strands of DNA can be represented as strings using these four characters (e.g. AGGGCT…). Using this technique geneticist can compare regions of DNA. Comparing an unknown genomic region to a well known one can provide useful insights. Additionally, it can divine proximity in a phylogenetic tree in order to determine evolutionary descendants.
Suppose there are two strands of DNA: AGGGCT
and AGGCA
. Obviously, they are
different; however, consider the similarities once they are aligned.
A G G G C T
A G G - C A
Notice the gap placed between G
and C
in the second string. Alignment
expertly inserts gaps into the strings to equalized their lengths. It may occur
that gaps are places in both strings thus increasing both of their lengths.
After alignment, each gap and mismatched character represents a penalty. In the
example above, there are two penalties: the gap inserted into the second string
and the disparate characters at the end of the strings. Each type of penalty has
a different associated cost as defined by domain experts. The sequence
alignment algorithm generates strings with the minimum possible sum of
penalties.
The minimum penalty generated by sequence alignment is know as the Needleman-Wunsch (NW) score after the algorithm’s inventors. It was first published in the Journal of Molecular Biology in 1970.
Applications
- Genomics
- Natural language processing
- Analysing series of purchases over time
- Reconstructing dead languages
Asymptotic Complexity
$O(nm)$ where $n$ is the length of the first string and $m$ is the length of the second2
Pseudo Code
\begin{algorithm}
\caption{Sequence Alignment}
\begin{algorithmic}
\REQUIRE $x$ = string with length $\gt 0$
\REQUIRE $y$ = string with length $\gt 0$
\REQUIRE $gp$ = penalty incurred by a gap in the string
\REQUIRE $P$ = matrix defining penalties for each combination of mismatched symbols
\OUTPUT sum of penalties for each mismatch
\FUNCTION{sequenceAlignment}{$x, y, gp, P$}
\STATE S $\gets$ 2-dimensional array of size $(x.\text{length} + 1) \times (y.\text{length} + 1)$
\STATE
\FOR{$i \gets 0$ to $x.\text{length}$}
\STATE $S_{i,0} \gets i \times gp$
\ENDFOR
\STATE
\FOR{$j \gets 0$ to $y.\text{length}$}
\STATE $S_{0,j} \gets j \times gp$
\ENDFOR
\STATE
\FOR{$i \gets 0$ to $x.\text{length}$}
\FOR{$j \gets 0$ to $y.\text{length}$}
\IF{$x_{i-1} = y_{i-1}$}
\STATE $S_{i,j} \gets S_{i-1,j-1}$
\ELSE
\STATE $c1 \gets S_{i-1,j-1} + P_{x_{i-1},y_{i - 1}}$
\STATE $c2 \gets S_{i-1,j} + gp$
\STATE $c3 \gets S_{i,j-1} + gp$
\STATE
\STATE $S_{i,j} = \min\{c1, c2, c3\}$
\ENDIF
\ENDFOR
\ENDFOR
\STATE
\RETURN $S_{x.\text{length},y.\text{length}}$
\ENDFUNCTION
\end{algorithmic}
\end{algorithm}
Source Code
Relevant Files:
Click here for build and run instructions
-
This is impressive given that with two strings of 500 characters each, there are 10^125 possible alignments. ↩