Identifying Promoter Regions Using the Viterbi Algorithm

  • Promoter regions: regions upstream or next to where a gene will be transcribed that RNA polymerase binds to
  • Coding regions: the portion of the DNA or RNA that codes for proteins
  • Termination region: regions downstream that signal for RNA polymerase to break off

Hidden Markov Models

Example of a Hidden Markov Model.
  • Observations/emissions: These are the observations that the user is giving to the model. For this particular model, these would be “Walk”, “Shop”, and “Clean”.
  • States: These are the “hidden” values that we are trying to infer from the observations using probability analysis. For this particular model, these would be “Rainy” and “Sunny”.
  • Initial probabilities: These are the probabilities that the states will take place in the first step. One such example is that there is a 30% chance of starting at the Rainy state. In other words, P(Rainy) = 0.3.
  • Transition probabilities: These are the probabilities within various states. One such example is the one aforementioned about Rainy to Sunny. Assuming t_k is the current state, we can say P(t_k+1 = “Sunny” | t_k = “Rainy”) = 0.6.
  • Emission probabilities: These are the probabilities that a state will cause this particular emission. For instance, P(Shop | Rainy) = 0.4.

Converting to Code

from Bio.HMM.MarkovModel import HiddenMarkovModeltransition_alphabet = ['Rainy', 'Sunny']
emission_alphabet = ['W', 'S', 'C'] # W - walk, S - shop, C - clean
initial_prob = {'Rainy': 0.3, 'Sunny': 0.7}
transition_prob = {('Rainy', 'Rainy'): 0.4,
('Rainy', 'Sunny'): 0.6,
('Sunny', 'Rainy'): 0.2,
('Sunny', 'Sunny'): 0.8}
emission_prob = {('Rainy', 'W'): 0.1,
('Rainy', 'S'): 0.4,
('Rainy', 'C'): 0.5,
('Sunny', 'W'): 0.6,
('Sunny', 'S'): 0.3,
('Sunny', 'C'): 0.1}
model = HiddenMarkovModel(transition_alphabet, emission_alphabet, initial_prob, transition_prob, emission_prob, 1, 1)

Viterbi Algorithm

model.viterbi(Seq('WSC'), transition_alphabet)
(Seq('SunnySunnyRainy'), -4.5972020163389145)

Identifying Promoter Regions

Example of Hidden Markov Model in biology.
from Bio.HMM.MarkovModel import HiddenMarkovModeltransition_alphabet = ['B', 'P'] # B - background, P - promoter
emission_alphabet = ['A', 'T', 'C', 'G']
initial_prob = {'B': 0.5, 'P': 0.5}
transition_prob = {('B', 'B'): 0.85,
('B', 'P'): 0.15,
('P', 'P'): 0.75,
('P', 'B'): 0.25}
emission_prob = {('B', 'A'): 0.25,
('B', 'T'): 0.25,
('B', 'C'): 0.25,
('B', 'G'): 0.25,
('P', 'A'): 0.15,
('P', 'T'): 0.13,
('P', 'C'): 0.42,
('P', 'G'): 0.3}
model = HiddenMarkovModel(transition_alphabet, emission_alphabet, initial_prob, transition_prob, emission_prob, 1, 1)
model.viterbi(Seq('AGTACACTGGT'), transition_alphabet)
model.viterbi(Seq('GCGCGCGCAA'), transition_alphabet)
(Seq('BBBBBBBBBBB'), -17.567574447856494)
(Seq('PPPPPPPPBB'), -15.314217188702496)

HMMs with “Memory”

array_promoter = [0.18, 0.274, 0.426, 0.12, 0.171, 0.368, 0.274, 0.188, 0.161, 0.339, 0.375, 0.125, 0.079, 0.355, 0.384, 0.182]
vals_pp = [0.75*val for val in array_promoter]
vals_pb = [0.25*val for val in array_promoter]
array_background = [0.3, 0.205, 0.285, 0.21, 0.322, 0.298, 0.078, 0.302, 0.248, 0.246, 0.298, 0.208, 0.177, 0.239, 0.292, 0.292]
vals_bp = [0.15*val for val in array_background]
vals_bb = [0.85*val for val in array_background]
model.viterbi(Seq('AGTACACTGGT'), transition_alphabet)
model.viterbi(Seq('GCGCGCGCAA'), transition_alphabet)
(Seq('A-G-T-A-C-A-C-T-G-G-T-'), -17.77362274426406)
(Seq('G+C+G+C+G+C+G+C-A-A-'), -16.064944938458574)

Viterbi Algorithm

def viterbi_algorithm(observations, states, start_p, trans_p, emit_p, table):
V = [{}]

for st in states:
V[0][st] = {"prob": start_p[st] * emit_p[st][observations[0]], "prev": None}
for t in range(1, len(observations)):
V.append({})
for st in states:
max_tr_prob = V[t - 1][states[0]]["prob"] * trans_p[states[0]][st]
prev_st_selected = states[0]
for prev_st in states[1:]:
tr_prob = V[t - 1][prev_st]["prob"] * trans_p[prev_st][st]
if tr_prob > max_tr_prob:
max_tr_prob = tr_prob
prev_st_selected = prev_st

max_prob = max_tr_prob * emit_p[st][observations[t]]
V[t][st] = {"prob": max_prob, "prev": prev_st_selected}
opt = []
max_prob = 0.0
best_st = None

for st, data in V[-1].items():
if data["prob"] > max_prob:
max_prob = data["prob"]
best_st = st
opt.append(best_st)
previous = best_st
for t in range(len(V) - 2, -1, -1):
opt.insert(0, V[t + 1][previous]["prev"])
previous = V[t + 1][previous]["prev"]
print (" ".join(opt) + ", probability = %s" % max_prob)
Final result of the self-coded Viterbi algorithm.
HiddenMarkovModel Implementation
G+C+G+C+G+C+G+C-A-A-, probability = -16.064944938458574

Own Implementation
G+C+G+C+G+C+G+C-A-A-, probability = 1.0545885728228732e-07

TL;DR

  • Hidden Markov Models have both states and emissions, which are interconnected by conditional probability scores.
  • Hidden Markov Models are a tool used to infer whether a region is a promoter or background region based on a sequence of nucleotides.
  • The Viterbi algorithm is a dynamic programming algorithm that allows for this process to occur by looking at various transition and emission probabilities and keeping a pointer matrix to trace back through.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store