Emile Timothy

Consider the following set-up. It's actually the simplest Hidden Markov Model. It initializes itself to one state (or one vertex), and then stays at itself with probability $P(A|A)$ or $P(B|B)$ and transitions to the other vertex with probability $P(A|B)$ or $P(B|A)$. In this model, vertex A represents the uniform distribution from $\{1,...,6\}$, and vertex B represents the uniform distribution from $\{1,...,4\}$. So, the initial distribution is $\pi = (\pi_A, \pi_B)$. In real-world applications, the vertices of the Hidden Markov Model are categories that describe how the output of the model evolves over iterations.

The question that the Hidden Markov Model answers is that if you're given an observed sequence upon sampling from this Markov distribution, say $1,4,3,6,6,4$, can you predict the most likely sequence and what is the probability of this sequence occurring? Finally, for any element in the distribution that could have belonged to either $A$ or $B$, for instance any number from $\{1,2,3,4\}$, what is the probability that the element belonged to $A$ (and by contrast, $B$)?

These questions are actually well-answered by the Hidden Markov Model algorithms. The most likely sequence, given some observations, is predicted by the Viterbi algorithm, and its corresponding likelihood is determined by the Forward algorithm. Finally, the probability that an element from an observation belonged to a specific 'vertex' can be determined by the Backward algorithm. To explain these solutions (the forward, backward and Viterbi algorithms) better, I'm going to introduce some essential notation. $$\alpha^t(i) = \mathbb{P}(\text{observed sequence, ending in state i at w/t})$$ $$\beta^t(i) = \mathbb{P}(\text{observation after t | ending in state i at w/t})$$ $$\delta^t(i) = \max\limits_{\text{observations}}(\mathbb{P}(\text{observation ending in state i at w/t}))$$ Before I continue, there is one natural question to ask: this 2-vertex Hidden Markov Model is pretty cool, but obviously it doesn't capture the complexity of large datasets. So, maybe consider the $m$-vertex Hidden Markov Model. Is there even a unique or existing solution for the most likely forecasted sequence given some predictions? Turns out that the answer to this question is yes! The Hammersley-Clifford Theorem (or the fundamental theorem of random fields) states that the directed graphical model $V_\Delta(M_D)$ is equal to the image of the parameter space $\theta$ under the map $F_D$. Specifically, it states that a probability distribution has a strictly positive mass if and only if it it is a Gibbs random field: its probability density must be factorizable over the cliques (or complete subgraphs) of the graph. The implication of this result is that for all m and for any observed sequence, the $m$-vertex Hidden Markov Model has a unique solution for a predicted future sequence, which can be approximated by common algorithmic tools with neat twists.

So, first things first, can we convert our knowledge of these observations to knowledge about the hidden variables behind their system? Can we find out which vertex each of these elements came from? Yes again, and that's exactly what the Viterbi algorithm does.

Implementing the Viterbi Algorithm

The Viterbi algorithm first uses the observations to find the path of maximum probability, which is characterized by: $$ \text{Path} = \theta'_{\sigma_1 \tau_1}\theta_{\sigma_1 \sigma_2} \theta'_{\sigma_2 \tau_2} \theta_{\sigma_2 \sigma_3} \theta'_{\sigma_3 \tau_3} \theta_{\sigma_3 \sigma_4} \theta'_{\sigma_4 \tau_4}$$

Essentially, the Viterbi algorithm seeks to find the $\arg\max\limits_y P(y|x)$. There's actually an explicit formula for this value.
$$\begin{aligned} \arg\max\limits_y P(y|x) &= \arg\max\limits_y \frac{P(y,x)}{P(x)}\\ &= \arg\max\limits_y P(y,x) \\ &= \arg\max\limits_y \log P(x|y) + \log P(y) \end{aligned} $$ So, for $k=1,...,M$, we can use Dynamic Programming to iteratively solve for each $\log(\hat{Y}^k(Z))$, where Z loops over every possible probability of state, to produce the best $\hat{Y}^M(Z)$, which is also known as the Mean A Posteriori (MAP) inference. So, the Viterbi algorithm models the pairwise transitions between states. The reason it works is completely justified by the Bayesian principle. For instance, consider a 1st order HMM, characterized by the joint probability distribution $$ \begin{aligned} P(x,y) &= P(\text{End}|y^M) \prod\limits_{i=1}^M P(y^i | y^{i-1}) \prod\limits_{i=1}^M P(x^i | y^i) \\ P(x|y) &= \prod\limits_{i=1}^M P(x^i | y^i) \end{aligned} $$ Since we know that $P(y) = P(\text{End}|y^M)\prod\limits_{i=1}^M P(y^i | y^{i-1})$, we can use this characterization to recover the original Bayes' formula, which states that: $$ P(x|y) = \frac{P(x,y)}{P(y)} $$ So, what is the actual algorithm? Given an input of the observation space $O = \{o_1, ..., o_N\} \subseteq S$, a state space $S = \{s_1, ..., s_K\}$, an array of initial possibilities $\Pi = (\pi_1,...\pi_K)$ such that $x_1 = s_i$, a sequence of observations $Y = (y_1, ..., y_T)$ such that $y_t = o_i$ if the observation at time $t$ is $o_i$, a transition matrix $A$ of size $K\times K$ such that $A_{ij}$ stores the transition probability of transiting from state $s_i$ to state $s_j$, and an emission matrix $B$ of size $K\times N$ such that $B_{ij}$ stores the probability of observing $o_j$ from state $s_i$, the Viterbi algorithm outputs the most likely hidden state sequence $X = (x_1, ..., x_T)$.

function VITERBI$(O, S, \Pi, Y, A, B): X$)

for each state $i=1, 2, ..., K$ do

$T_1[i,1]\leftarrow\pi_1\cdot B_{iy_1}$

$T_2[i,1]\leftarrow 0$

end for

for each observation $j=2, 3, ..., T$ do

for each state $i=1, 2, ..., K$ do

$T_1[i, j] \leftarrow \max\limits_k (T_1[k, j-1]\cdot A_{ki}\cdot B_{i_{y_j}})$

$T_2[i, j] \leftarrow \arg \max\limits_k (T_1[k, j-1]\cdot A_{ki}\cdot B_{i_{y_j}})$

end for end for

$z_T \leftarrow \arg\max\limits_k (T_1[k,T])$

$x_T \leftarrow s_{z_T}$

for $j=T,T-1, ..., 2$ do

$z_{j-1}\leftarrow T_2[z_j, j]$

$x_{j-1}\leftarrow s_{z_j-1}$

end for return X end function

Implementing the Viterbi Forward and Backward Algorithm

Here are some conceptual ideas of what happens in the Forward and Backward algorithms, respectively.

So, for the forward algorithm, the goal is to solve for every $\alpha_z(i) = P(x^{1:i},y^i=Z|A,O)$. One naive (exponential time) solution is to let $$ \alpha_z(i) = P(x^{1:i},y^i=Z|A,O) = \sum\limits_{y^{1:i-1}} P(x^{1:i},y^i=Z,y^{1:i-1}|A,O)$$ A better solution is to use DP and recursively solve for $\alpha_z(i) $. So, $$ \begin{aligned} \alpha_z(1) &= P(y^1 = z|y^0) P(x^1 | y^1 = z) = O_{x^1, z} A_{\text{z,start}} \\ \alpha_z(i+1) &= O_{x^{i+1},z} \sum\limits_{j=1}^L \alpha_j(i) A_{z,j} \end{aligned} $$ Similarly, for the backward algorithm, the goal is to solve for every $\beta_z(i) = P(x^{1+i:M}|y^i=Z,A,O)$. The naive (exponential time) solution is to let $$ \beta_z(i) = P(x^{i+1:M}|y^i=Z|A,O) = \sum\limits_{y^{i+1:L}} P(x^{i+1:M},y^{i+1:M} | y^i = Z, A, O)$$ A better solution is to use DP and recursively solve for $\beta_z(i) $. So, $$ \begin{aligned} \beta_z(M) &= 1 \\ \beta_z(i) &= \sum\limits_{j=1}^L \beta_j(i+1) A_{j,z} O_{x^{i+1},j} \end{aligned} $$ Here's the code that computes these functions:


def forward(self, x, normalize=False):

alphas = [[0. for _ in range(self.L)] for _ in range(len(x) + 1)]
for state in range(len(self.A_start)):
alphas[1][state] = self.O[state][x[0]] * self.A_start[state]
for a, b in enumerate(x[1:]):
for state in range(self.L):
for previous_state in range(self.L):
alphas[a + 2][state] += self.A[previous_state][state] * 
self.O[state][b] * alphas[a + 1][previous_state]
if normalize:
if (sum(alphas[a + 2]) > 0):
alphas[a + 2] = [val / sum(alphas[a + 2]) 
for val in alphas[a + 2]]
return alphas


def backward(self, x, normalize=False):
    betas = [[0. for _ in range(self.L)] for _ in range(len(x) + 1)]
    betas[M] = [1 for _ in range(self.L)]
    for a, b in reversed(list(enumerate(x))):
    for current_state in range(self.L):
    for transition_state in range(self.L):
    betas[a][current_state] += betas[a + 1][transition_state] * 
    self.A[current_state][transition_state] * 
    self.O[transition_state][b]
    if normalize:
    if (sum(betas[a]) > 0):
    betas[a] = [val / sum(betas[a]) for val in betas[a]]
    return betas

Supervised Learning

We can use the supervised learning framework to train the HMM. So, given $S = \{(x_i,y_i)\}_{i=1}^N$, our goal is to use $S$ to estimate the maximum likelihood $P(x,y)$, where $$P(x,y) = P(\text{End}|y^M)\prod\limits_{i=1}^M P(y^i | y^{i-1}) \prod\limits_{i=1}^M P(x^i | y^i)$$ So, to do this, we define the Transition matrix $A$ and the Observation matrix $O$, where $$ A_{ab} = P(y^{i+1}=a|y^i = b)$$ $$ O_{wz} = P(x^i=w|y^i=z)$$ Using this notation, we have that $$ \begin{aligned} P(x,y) &= P(\text{End}|y^M)\prod\limits_{i=1}^M P(y^i|y^{i-1})\prod\limits_{i=1}^M P(x^i|y^i) \\ &= A_{\text{End},y^M}\prod\limits_{i=1}^M A_{y^i,y^{i-1}} \prod\limits_{i=1}^M O_{x^i,y^i} \end{aligned} $$ To find the maximum likelihood probability, we have that $$ \arg\max\limits_{A,O} \prod\limits_{(x,y)\in S} P(x,y) = \arg \max\limits_{A,O} \prod\limits_{(x,y)\in S} P(\text{End}|y^M)\prod\limits_{i=1}^M P(y^i|y^{i-1})\prod\limits_{i=1}^M P(x^i | y^i)$$ We can use supervised learning to estimate each component separately. So, $$ A_{ab} = \frac{\sum\limits_{j=1}^N \sum\limits_{i=0}^{M_j} \mathbb{1}_{[(y_j^{i+1}=a)\wedge (y_j^i=b)]}}{\sum\limits_{j=1}^N \sum\limits_{i=0}^{M_j}\mathbb{1}_{[y^i_j=b]}}$$ $$ O_{wz} = \frac{\sum\limits_{j=1}^N \sum\limits_{i=1}^{M_j} \mathbb{1}_{[(x_j^{i}=w)\wedge (y_j^i=z)]}}{\sum\limits_{j=1}^N \sum\limits_{i=1}^{M_j}\mathbb{1}_{[y^i_j=z]}}$$ Here's the code that executes the supervised learning framework for the Hidden Markov Model:


def supervised_learning(self, X, Y):

A = np.zeros((self.L, self.L))
B = np.zeros((self.L, self.L))
for state_seq in Y:
for i in range(len(state_seq) - 1):
A[state_seq[i]][state_seq[i+1]] += 1
for i in range(len(A[0])):
B[:,i] = [probability/sum(A[:,i]) for probability in A[:,i]]
self.A = B
O = [[0 for _ in range(self.D)] for _ in range(self.L)]
for a in range(len(list(X))):
for state in range(len(list(Y[a]))):
O[Y[a][state]][X[a][state]] += 1
for i in range(len(O)):
self.O[i] = [probability/sum(O[i]) for probability in O[i]]

There are some glaring assumptions that go along with the supervised learning framework, that are, in most cases, undesirable. For instance, we assume that everything can be decomposed to a pair of products: that $P(y^{i+1}=a|y^i=b)$ is independent. This is a crucial assumption since it gives us that $$P(x,y) = P(\text{End}|y^M)\prod\limits_{i=1}^M P(y^i | y^{i-1}) \prod\limits_{i=1}^M P(x^i | y^i)$$ Another crucial, albeit undesirable, assumption is that the model can easily learn (to an arbitrarily high precision) the frequentist statistics of how often $y^{i+1}=a$ when $y^i=b$ over the training set.

Unsupervised Learning

Due to undesirable assumptions of supervised learning that are mentioned in the previous paragraph, we instead consider the framework of unsupervised learning. Consider the case in which there are no y's. So, $S = \{x_i\}_{i=1}^N$. Can we still estimate $P(x,y)$? Again, the answer is yes! Note that $$ \arg \max \prod\limits_i P(x_i) = \arg \max \prod\limits_i \sum\limits_y P(x_i, y)$$ So, we now re-define our matrix protagonists $A$ and $O$ to: $$ A_{ab} = P(y^{i+1}=a|y^i = b)$$ $$ O_{wz} = P(x^i=w|y^i=z)$$ We then use the Unsupervised Learning equivalent of the Viterbi algorithm - the Baum-Welch algorithm - to train the Hidden Markov Model. Basically, it initializes $A$ and $O$ randomly using the framework above. It then predicts the probabilities of $y$ for each training $x$, in what's called the expectation step. In then uses the $y's$ to estimate the new $A$ and $O$ in what's called the maximization step. It then repeats this procedure until the estimates converge onto a value.

So, in the expectation step, we are given $A,O$ and $x=(x^1,...,x^M)$ and need to predict $P(y^i)$ for each $y=(y^1,...,y^M)$, while encoding the current model's beliefs and marginal distribution about $y$.

Next, in the maximization step, we seek to find the maximum likelihood over the marginal distribution using a dynamic programming approach: $$ A_{ab} = \frac{\sum\limits_{j=1}^N \sum\limits_{i=0}^{M_j} P(y_j^i=b, y_j^{i+1}=a)}{\sum\limits_{j=1}^N \sum\limits_{i=0}^{M_j} P(y^i_j=b)}$$ $$ O_{wz} = \frac{\sum\limits_{j=1}^N \sum\limits_{i=1}^{M_j} \mathbb{1}_{[x_j^{i}=w]} P(y_j^i=z)}{\sum\limits_{j=1}^N \sum\limits_{i=1}^{M_j} P(y^i_j=z)}$$ To explain the underlying algorithm further, I'm going to introduce some notation. Let $\alpha_z(i)$ be the probability of observing prefix $x^{1:i}$ and having the i-th state be $y^i=z$, and let $\beta_z(i)$ be the probability of observing suffix $x^{1+i:m}$ given the i-th state being $y^i = z$, where $$ \alpha_z(i) = P(x^{1:i},y^i=Z|A,O)$$ $$ \beta_z(i) = P(x^{1+i:M}|y^i=Z,A,O)$$ So, to compute the marginals, we can combine these two terms to get: $$P(y^i = z | x) = \frac{\alpha_z(i) \beta_z(i)}{\sum\limits_{z'} a_{z'}(i)\beta_{z'}(i)}$$ $$P(y^i = b, y^{i-1}=a | x) = \frac{a_a(i-1)P(y^i=b|y^{i-1}=a)P(x^i|y^i=b)\beta_b(i)}{\sum\limits_{a',b'}a_{a'}(i-1)P(y^i=b'|y^{i-1}=a')P(x^i|y^i=b')\beta_{b'}(i)}$$ Here's the code that does exactly that:


def unsupervised_learning(self, X, N_iters):

bar = progressbar.ProgressBar(max_value=N_iters)
for iter in range(N_iters):
bar.update(iter)
temp_A = np.zeros((self.L, self.L))
temp_O = np.zeros((self.L, self.D))
A_col = np.zeros(self.L)
O_col = np.zeros(self.L)
for x in X:
M = len(x)
x_state = np.zeros((M, self.L, self.L))
alphas = np.array(self.forward(x, normalize=True))
betas = np.array(self.backward(x, normalize=True))
c = (alphas * betas)[1:]
for t in range(len(list(c))):
b = c[t] / np.sum(c[t])
O_col += b
if (M - 1 > t):
A_col += b
for i in range(self.L):
temp_O[i][x[t]] += b[i]
for t in range(1, M):
for a in ranges(self.L):
for b in range(self.L):
x_state[t][a][b] += alphas[t][a] * self.A[a][b]
betas[t + 1][b] = self.O[b][x[t]]
for x_state_i in x_state[1:]:
temp_A += x_state_i/np.sum(x_state_i)
temp_A /= A_col[:,None]
temp_O /= O_col[:,None]
self.A = temp_A
self.O = temp_O

Forward-Backward Algorithm

We use the unsupervised learning framework in conjunction with the forward-backward algorithm, instead of the separate forward and backward algorithm that is typically used in conjunction with the Viterbi algorithm. The forward-backward algorithm has 3 key traits:

It runs forward: $\alpha_z(i) = P(x^{1:i},y^i=Z|A,O)$
It runs backward: $\beta_z(i) = P(x^{1+i:M}|y^i=Z,A,O)$

For each training $x = (x^1,...,x^M)$, it computes each $P(y^i)$ for each $y = (y^1, ..., y^M)$ $$ P(y^i=z|x) = \frac{\alpha_z(i)\beta_z(i)}{\sum\limits_{z'}\alpha_{z'}(i)\beta_{z'}(i)}$$

Generate Emission

We then use these above algorithms to determine the probabilities of forecasted sequences to find the maximum-likelihood sequence:


def generate_emission(self, M):

emission = []
states = []
new_state = np.random.choice(self.L, p=self.A_start)
for i in range(M):
states.append(new_state)
emission.append(np.random.choice(self.D, p=self.O[new_state]))
new_state = np.random.choice(self.L, p=self.A[new_state])
return emission, state

Probabilities - alpha and beta

Similarly, we use the forward and backward (or the forward-backward) algorithm to find the probabilities of the alpha or beta:


def probability_alphas(self, x):

alphas = self.forward(x)
prob = sum(alphas[-1])
return prob


def probability_betas(self, x):

betas = self.backward(x)
prob = sum([betas[1][j] * self.A_start[j] * self.O[j][x[0]] \
for j in range(self.L)])
return prob

Generating Sample Sentences

We then trained the HMM on the corpus of the constitution of the United States, and generated some sample sentences.

Sample Sentence 1

Hundred and not its public states shall state between number of thing but approved their common prescribe consequence iv he shall regulate the conventions and no...

Sample Sentence 2

A they state of no have but thereof declare of in of such be laws shall a to day entitled proceedings of enumeration any privileged...

Sample Sentence 3

From foreign of all two to prescribed as whereof laws and to not first states objections elected publish south state and senator prince all no...

Sample Sentence 4

Electors shall jersey taken have thousand on whose and the officer constitute to be weights the privilege a for of the the bill adhering subject...

Sample Sentence 5

From be shall given under if shall reserving the united may public and on both protect to any united of the constitution shall as of...

Comments about the Sparsity of the A and O matrix

Some interesting insights about the transition matrices $A$ and $O$ are that they are extremely sparse. This is actually an expected result, since the HMM enforces a strong regularization through the Baum-Welch (and, even Viterbi) algorithms. The sparsity of these matrices confirms that these algorithms are not overfitting on the dataset particularly, and are thus still capable of generating unique samples from the distribution.

Visualizations: The Data Wordcloud and How the HMM Transitions between Genres

Here is a word-cloud that we generated of the dataset (the corpus of the United States constitution).

We then generated a word-cloud of the genres of the dataset and categories of words, allowing the HMM to discover 10 hidden states from the corpus.

Finally, I mapped the process of how the HMM transitioned between categories of words when it began the process of generating unique phrases.

Projects

Sonnet 1

Sonnet 2

Sonnet 3

Sonnet 4

Sonnet 5

Sonnet 6