Sequence Alignment

Sequence alignment is a technique for defining the degree of similarity between two sequences. The essence of sequence alignment is best described by its premiere use case: genomics (there are many applications, but genomics is the most commonly cited). There are four types of bases in a DNA molecule: adenine (A), cytosine (C), guanine (G), and thymine (T)1. For computation purposes, strands of DNA can be represented as strings using these four characters (e.g. AGGGCT…). Using this technique geneticist can compare regions of DNA. Comparing an unknown genomic region to a well known one can provide useful insights. Additionally, it can divine proximity in a phylogenetic tree in order to determine evolutionary descendants.

Suppose there are two strands of DNA: AGGGCT and AGGCA. Obviously, they are different; however, consider the similarities once they are aligned.

A G G G C T
A G G - C A

Notice the gap placed between G and C in the second string. Alignment expertly inserts gaps into the strings to equalized their lengths. It may occur that gaps are places in both strings thus increasing both of their lengths. After alignment, each gap and mismatched character represents a penalty. In the example above, there are two penalties: the gap inserted into the second string and the disparate characters at the end of the strings. Each type of penalty has a different associated cost as defined by domain experts. The sequence alignment algorithm generates strings with the minimum possible sum of penalties.

The minimum penalty generated by sequence alignment is know as the Needleman-Wunsch (NW) score after the algorithm’s inventors. It was first published in the Journal of Molecular Biology in 1970.

Applications

  • Genomics
  • Natural language processing
  • Analysing series of purchases over time
  • Reconstructing dead languages

Asymptotic Complexity

$O(nm)$ where $n$ is the length of the first string and $m$ is the length of the second2

Pseudo Code

inputs:
    x               string 1
    x_len           length of x
    y               string 2
    y_len           length of y
    gap_penalty     penalty incurred by a gap
    penalty_matrix  penalty for each possible combination of mismatched characters

sequence_alignment:
    // Define a two dimensional array to hold sub problems
    solutions[x_len + 1][y_len + 1]

    // Base case 1
    for i = 0 to len_x:
        solutions[i][0] = i * gap_penalty

    // Base case 2
    for j = 0 to len_y:
        solutions[0][j] = j * gap_penalty

    for i = 1 to len_x:
        for j = 1 to len_y:
            if x[i - 1] == y [i - 1]:
                solutions[i][j] = solutions[i - 1][j - 1]
            else:
                case1 = solutions[i - 1][j - 1] + penalty_matrix[x[i - 1]][y[i - 1]]
                case2 = solutions[i - 1][j] + gap_penalty
                case3 = solutions[i][j - 1] + gap_penalty

                solutions[i][j] = min(case1, case2, case3)

    return solutions[x_len][y_len]

Source Code

Full Repo

Relevant Files:

Click here for build and run instructions

  1. https://www.genome.gov/genetics-glossary/acgt 

  2. This is impressive given that with two strings of 500 characters each, there are 10^125 possible alignments.