Sequence Alignment

Under Construction: Consider this content a preview of the real thing which is coming soon (hopefully).

Sequence alignment is a technique for defining the degree of similarity between two sequences. The essence of sequence alignment is best described by its premiere use case: genomics (there are many applications, but genomics is the most commonly cited). There are four types of bases in a DNA molecule: adenine (A), cytosine (C), guanine (G), and thymine (T)1. For computation purposes, strands of DNA can be represented as strings using these four characters (e.g. AGGGCT…). Using this technique geneticist can compare regions of DNA. Comparing an unknown genomic region to a well known one can provide useful insights. Additionally, it can divine proximity in a phylogenetic tree in order to determine evolutionary descendants.

Suppose there are two strands of DNA: AGGGCT and AGGCA. Obviously, they are different; however, consider the similarities once they are aligned.

A G G G C T
A G G - C A

Notice the gap placed between G and C in the second string. Alignment expertly inserts gaps into the strings to equalized their lengths. It may occur that gaps are places in both strings thus increasing both of their lengths. After alignment, each gap and mismatched character represents a penalty. In the example above, there are two penalties: the gap inserted into the second string and the disparate characters at the end of the strings. Each type of penalty has a different associated cost as defined by domain experts. The sequence alignment algorithm generates strings with the minimum possible sum of penalties.

The minimum penalty generated by sequence alignment is know as the Needleman-Wunsch (NW) score after the algorithm’s inventors. It was first published in the Journal of Molecular Biology in 1970.

Applications

  • Genomics
  • Natural language processing
  • Analysing series of purchases over time
  • Reconstructing dead languages

Asymptotic Complexity

$O(nm)$ where $n$ is the length of the first string and $m$ is the length of the second2

Pseudo Code

\begin{algorithm}
\caption{Sequence Alignment}
\begin{algorithmic}
\REQUIRE $x$ = string with length $\gt 0$
\REQUIRE $y$ = string with length $\gt 0$
\REQUIRE $gp$ = penalty incurred by a gap in the string
\REQUIRE $P$ = matrix defining penalties for each combination of mismatched symbols
\OUTPUT sum of penalties for each mismatch
\FUNCTION{sequenceAlignment}{$x, y, gp, P$}
    \STATE S $\gets$ 2-dimensional array of size $(x.\text{length} + 1) \times (y.\text{length} + 1)$
    \STATE
    \FOR{$i \gets 0$ to $x.\text{length}$}
        \STATE $S_{i,0} \gets i \times gp$
    \ENDFOR
    \STATE
    \FOR{$j \gets 0$ to $y.\text{length}$}
        \STATE $S_{0,j} \gets j \times gp$
    \ENDFOR
    \STATE
    \FOR{$i \gets 0$ to $x.\text{length}$}
        \FOR{$j \gets 0$ to $y.\text{length}$}
            \IF{$x_{i-1} = y_{i-1}$}
                \STATE $S_{i,j} \gets S_{i-1,j-1}$
            \ELSE
                \STATE $c1 \gets S_{i-1,j-1} + P_{x_{i-1},y_{i - 1}}$
                \STATE $c2 \gets S_{i-1,j} + gp$
                \STATE $c3 \gets S_{i,j-1} + gp$
                \STATE
                \STATE $S_{i,j} = \min\{c1, c2, c3\}$
            \ENDIF
        \ENDFOR
    \ENDFOR
    \STATE
    \RETURN $S_{x.\text{length},y.\text{length}}$
\ENDFUNCTION
\end{algorithmic}
\end{algorithm}

Source Code

Full Repo

Relevant Files:

Click here for build and run instructions

  1. https://www.genome.gov/genetics-glossary/acgt 

  2. This is impressive given that with two strings of 500 characters each, there are 10^125 possible alignments.