Hidden Markov Models#

Today we will take a look at Hidden Markov Models (HMMs). HMMs deal with Markov processes in which the states are unobservable or hidden but influence an observable process.

_images/HMM_chain.svg

HMMs are used in various fields that include bioinformatics, finance, robotics, developmental studies, speech recognition, and Natural Language Processing (NLP).

  • A concrete example from a developmental study is modelling of infant-free-play regimes. While it is not possible to directly observe if an infant is focused or exploring, a prediction can be made based on the number of toys they were interacting during a certain period of time.

  • One of the applications of HMMs in computational finance is for modeling stock market states, such as a bull market (it occurs when the stock prices are rising and investors are optimistic) and a bear market (it happends when the prices decline). It is impossible to directly observe the state of the market but put/call ratios, which are indicators of investor sentiment and are associated with short-term stock market returns, can be used to predict them.

  • In NLP, HMMs are often used for Parts-of-Speech (POS) tagging, a process that assigns a grammatical category, such as noun, verb, and adjective, to each word in a piece of text. While this task seems relatively easy to humans who speak the language from which the text has been taken, it is much harder for a machine. For instance, consider the following sentences:

  1. I should book my flight to Paris.

  2. I am reading an interesting book.
    The word book is a verb in the first sentence and is a noun in the second one. There are many ambigous words, like book, for which the POS-tagging is not a trivial task.
    We will look into HMMs for POS tagging in more detail in the next lecture.

Problem definition#

For now we will focus on the following example.

Example: Umbrella problem. A student is studying for an exam in a room without any windows. Every day she wants to know whether it is rainy or sunny outside. However, her only access to the outside world is when she sees her housemate leaving the house with or without an umbrella each morning.

This problem is a little unrealistic but has the components we are interested in.

  • The student doesn’t have direct access to the outside world, but wants to know whether it is sunny or rainy. Therefore, the set \(\left\{\text{Sunny}, \text{Rainy} \right\}\) represents the hidden states.

  • Instead of observing the hidden states directly, the student observes a signal emitted by the hidden states - whether the housemate carries an umbrella or not. Thus, \(\left\{\text{No umbrella}, \text{Umbrella}\right\}\) is the set of possible observations.

  • The signal that the student observes is noisy. For example, even if it’s not raining, the housemate might be bringing an umbrella, because he forgot to check the weather report.

Model assumptions#

To be able to model the umbrella problem, we need to represent it as a discrete-time process. That is, we need to specify a time step between the events.

Furthermore, we need the following assumptions to create an HMM.

  1. Markov property: the weather at day \(n+1\) depends only on the weather at day \(n\).

  2. Stationarity: the probability of transitioning from one hidden state to another is the same for every time step.

  3. Output independence: the observation at day \(n+1\) depends only on the hidden state at day \(n+1\).

State-transition diagram#

Let us visualize the model and assign the probabilities.

_images/HMM_diagram_no_start.svg

The diagram shows that, for instance, the probability of the housemate brining an umbrella on a sunny day is 0.2.
In addition, we assume that the intial probabilities are 0.6 and 0.4 for a sunny day and a rainy day, respectively.

Decoding#

In general, there are three types of questions that can be asked about the HMMs.

  1. Evaluation: What is the probability of an observed sequence?

  2. Decoding: What is the most likely series of states to generate an observed sequence?

  3. Learning: How can we learn the parameters of HMM given some data?

We will focus on decoding problems. In particular, the question we are going to address in this lecture is the following:
Given the model depicted in the state-transition diagram and a sequence of observations \(O=\left(o_1,o_2,o_3,o_4\right)=(0,1,1,0)\). Find the sequence of hidden states \(X=\left(x_1,x_2,x_3,x_4\right)\) that best describes the observations \(O\).

Transition and emission matrices#

The state-transition diagram provides an accesible manner to present the HMM for the umbrella problem. However, this is only the case for problems with a low number of hidden and observable states. Generally, matrix notation is used to describe the model.

Let us start with the transition matrix \(A\). For \(N\) hidden states, \(Q=\left\{q_1, q_2, \dots, q_{N}\right\}\),
\(A\) is an \(N\times N\) matrix with

\[A_{ij} = P\left(\text{state } q_i \text{ at time } n+1 | \text{ state } q_j \text{ at time } n \right).\]

The transition matrix for the umbrella problem is equal to

\[\begin{split}A = \begin{bmatrix} 0.7 & 0.4 \\ 0.3 & 0.6 \end{bmatrix}.\end{split}\]

Note that \(A\) is column stochastic: each of its columns sums up to 1.

The initial probabilities are given by

\[\begin{split}\pi = \begin{bmatrix} 0.6 \\ 0.4 \end{bmatrix}.\end{split}\]

Next, we construct matrix \(B\) that stores the observation-probability distribution. For a general set of possible observations \(V = \left\{v_1, v_2, \dots, v_{M} \right\}\), \(B\) is an \(M\times N\) matrix with

\[B_{ij} = P\left( \text{observation } v_i \text{ at time } n | \text{ state } q_j \text{ at time } n \right).\]

B is also column stochastic and is known as the emission matrix.

For the umbrella example \(B\) becomes

\[\begin{split}B = \begin{bmatrix} 0.8 & 0.4\\ 0.2 & 0.6 \end{bmatrix}.\end{split}\]

Together \(A, B,\) and \(\pi\) fully define an HMM.

Brute-force approach#

Recall that we want to decode the following sequence of observations \(O=(0,1,1,0).\)
Of course, it is possible to answer this question by directly computing the joint probability of \(O\) with each sequence of hidden states of length 4. This strategy is called the brute-force approach.

Let’s find an expression for the joint probability for a general sequence of observations of length 4, \(O=(o_0,o_1,o_2,o_3)\), and a corresponding sequence of hidden states, \(X = (x_0, x_1, x_2, x_3).\)

\[P(O,X) = P(o_0, o_1, o_2, o_3, x_0, x_1, x_2, x_3)\]
\[= P(o_3|o_0, o_1, o_2, x_0, x_1, x_2, x_3)P(o_0, o_1, o_2, x_0, x_1, x_2, x_3)\]
\[ = P(o_3 | x_3)P(o_0, o_1, o_2, x_0, x_1, x_2, x_3)\]
\[= P(o_3 | x_3)P(o_2 | o_0, o_1, x_0, x_1, x_2, x_3)P(o_0, o_1, x_0, x_1, x_2, x_3) \]
\[ = P(o_3 | x_3)P(o_2 | x_2)P(o_0, o_1, x_0, x_1, x_2, x_3) = \dots \]
\[ = P(o_3 | x_3)P(o_2 | x_2)P(o_1 | x_1)P(o_0 | x_0)P(x_0, x_1, x_2, x_3)\]
\[= P(o_3 | x_3)P(o_2 | x_2)P(o_1 | x_1)P(o_0 | x_0)P(x_3 | x_0, x_1, x_2) P(x_0, x_1, x_2)\]
\[= P(o_3 | x_3)P(o_2 | x_2)P(o_1 | x_1)P(o_0 | x_0)P(x_3 | x_2) P(x_0, x_1, x_2)= \dots\]
\[ = P(o_3 | x_3)P(o_2 | x_2)P(o_1 | x_1)P(o_0 | x_0)P(x_3 | x_2) P(x_2 | x_1) P(x_1 | x_0)P(x_0).\]

This can be written as

\[P(O,X) = \prod_{n=0}^3 P(o_n | x_n) \prod_{n=1}^3 P(x_n | x_{n-1})P(x_0). \]

All we did to obtain this expression was using conditional probability, the output-independence assumption, and the Markov-property assumption.

Similarly, for \(O\) and \(X\) of length \(T\), the expression becomes

\[P(O,X) = \prod_{n=0}^{T-1} P(o_n | x_n) \prod_{n=1}^{T-1} P(x_n | x_{n-1})P(x_0). \]

For example, let us look at the joint probability of observation sequence \((0,1,1,0)\) and hidden-state sequence \((S,R,S,S)\):

\[ P(0,1,1,0,S,R,S,S) = P(0|S)P(1|S)P(1|R)P(0|S)P(S|S)P(S|R)P(R|S)P(S) \]
\[ = B_{11}B_{21}B_{22}B_{11}A_{11}A_{12}A_{21}\pi_1 \]
\[ =0.8\cdot0.2\cdot0.6\cdot0.8\cdot0.7\cdot0.4\cdot0.3\cdot0.6\approx0.00387.\]

Imagine a problem with \(N=10\) hidden states and an observation sequence of length \(T=100\). We would have to compute \(10^{100}\) joint probabilities! Thus, the brute-force approach is generally infeasible.

The Viterbi algorithm#

Fortunately, there are algorithms that are considerably less computationally expensive than the brute-force approach. One of these algorithms is the Viterbi algorithm that has computational time complexity of \(\mathcal{O}(N^2T)\).

The Viterbi algorithm finds the solution to the decoding problem in a step-wise manner. Each step \(i\) of the algorithm corresponds to an observation \(o_i\) in the sequence \(O\). For every step the algorithm maximizes the joint probability for each possible hidden state and stores the hidden state that lead to the optimal joint probability.

The Viterbi algorithm is recursive, because the optimization is performed based on the HMM and the preceding step. The stored hidden states are used at the final stage of the Viterbi algorithm to find the optimal path. This path is known as the Viterbi path.

The Vitebri algorithm can be split into three main stages:

  1. Initialization,

  2. Forward pass,

  3. Backward pass.

Before we proceed with the discussion of the stages of the Viterbi algorithm, we need to introduce its auxiliary matrices, \(C\) and \(D\). Matrix \(C\) is used to store the intermediate probabilities, while matrix \(D\) contains the previously visited hidden states. Both \(C\) and \(D\) are \(N \times T\) matrices, where, as before, \(N\) is the number of possible hidden states and \(T\) is the length of the observed sequence.

Initialization#

The initialization stage populates the first column of matrix \(C\), \(C_{i1}\), and the first column of matrix \(D\), \(D_{i1}\).

To find the first column of \(C\), we use the following expression:

\[C_{i1} = P(o_0 | x_0 = q_i) P(x_0 = q_i).\]

Equivalently,

\[C_{i1} = B_{cindex(o_0)i}\pi_i,\]

where \(cindex(o_0)\) is the index of the first observation in the emission matrix.

Since for the umbrella problem \(o_0\) is equal to \(0\), \(B_{cindex(o_0)i}\) becomes \(B_{1i}\). This implies that components of \(C_{i1}\) are given by

\[C_{11} = P(0|S)P(S) = B_{11}\pi_1 = 0.8 \cdot 0.6 = 0.48,\]
\[C_{21} = P(0|R)P(R) = B_{12}\pi_2 = 0.4 \cdot 0.4 = 0.16.\]

The first column of \(D\) contains only zeros for any problem, because there are no hidden states preciding the initial state:

\[D_{i1} = 0.\]
_images/Viterbi_diagram_initialization.svg

Therefore, matrices \(C\) and \(D\) for the umbrella problem can be written as

\[\begin{split}C = \begin{bmatrix}0.48 & \ast & \ast & \ast \\ 0.16 & \ast & \ast & \ast \\ \end{bmatrix} \text{ and } D = \begin{bmatrix}0 & \ast & \ast & \ast \\ 0 & \ast & \ast & \ast \\ \end{bmatrix},\end{split}\]

where \(\ast\) denotes the components that haven’t been computed yet.

Forward pass#

The forward pass completes both auxiliary matrices. The components of matrix \(C\) are found in the following way:

\[C_{ij} = \max_k P(o_{j-1} | x_n = q_i)P(x_n = q_i | x_{n-1} = q_k)C_{k(j-1)}.\]

Equivalently, we can write

\[C_{ij} = \max_k B_{cindex(o_{j-1})i} A_{ik} C_{k(j-1)},\]

where \(cindex(o_{j-1})\) is the index of the observation \(j-1\) in the emission matrix.

The components of matrix \(D\) are strongly interconnected with those of matrix \(C\). In fact, \(D_{ij}\) simply stores the value of \(k\) that maximizes the entry of \(C_{ij}\). Mathematically this can be expressed as

\[D_{ij} = \operatorname{argmax}_k P(o_{j-1} | x_n = q_i)P(x_n = q_i | x_{n-1} = q_k)C_{k(j-1)}\]
\[= \operatorname{argmax}_k B_{cindex(o_{j-1})i} A_{ik} C_{k(j-1)}.\]

Let us perform the forward pass for the umbrella problem.

\[C_{12} = \max_k B_{cindex(o_1)1}A_{1k}C_{k1}.\]

Since the second observed value \(o_1\) is equal to 1, \(B_{cindex(o_1)1}\) becomes \(B_{21}\):

\[C_{12} = \max_k B_{21}A_{1k}C_{k1}.\]

The above expression can be written as

\[C_{12} = \max \left( B_{21} A_{11} C_{11}, B_{21} A_{12} C_{21}\right).\]

Substituting the values from matrices \(A\), \(B\), and \(C\) leads to

\[C_{12} = \max (0.2 \cdot 0.7 \cdot 0.48, 0.2 \cdot 0.4 \cdot 0.16) = \max(0.0672, 0.0128) = 0.0672.\]

Note that the value of \(k\) that maximized \(C_{12}\) was 1. Therefore,

\[D_{12} = 1.\]

The remaining components of \(C\) and \(D\) are computed in a similar way. The final result of the forward pass consists of fully defined matrices \(C\) and \(D\):

\[\begin{split}C = \begin{bmatrix}0.48 & 0.0672 & 0.009408 & 0.009953 \\ 0.16 & 0.0864 & 0.03110 & 0.007465 \end{bmatrix},\end{split}\]
\[\begin{split}D = \begin{bmatrix} 0 & 1 & 1 & 2\\ 0 & 1 & 2 & 2\end{bmatrix}.\end{split}\]
_images/Viterbi_diagram_forward.svg

Backward pass#

The backward pass constructs the path that provides the optimal sequence of hidden states for the observed sequence.

First, the backward pass finds the row that contains the highest probability in the last column of matrix \(C\), \(C_{iT}:\)

\[s = \operatorname{argmax}_{i} C_{iT}. \]

Once \(s\) is known, the backward pass proceeds with matrix \(D\). The algorithm walks through the columns of \(D\) in reverse order. That is, it starts with the last column, \(D_{iT}\). The value of \(s\) simply specifies the row in \(D_{iT}\).

In our case, it is the first row, which corresponds to hidden state \(Sunny\). Thus, the fourth state in the hidden state sequence \(X = (x_0,x_1, x_2, x_3)\) is \(Sunny\):

\[x_3 = S.\]

Moreover, \(D_{14}\) is equal to 2:

\[\begin{split}D = \begin{bmatrix} 0 & 1 & 1 & \mathbf{2}\\ 0 & 1 & 2 & 2\end{bmatrix}.\end{split}\]

This value directs us to the second row of matrix \(D\) that corresponds with hidden state \(Rainy\). Hence, \(Rainy\) becomes the third state in the hidden state sequence:

\[x_2 = R.\]

Since \(D_{23}\) is also equal to 2:

\[\begin{split}D = \begin{bmatrix} 0 & 1 & 1 & \mathbf{2}\\ 0 & 1 & \mathbf{2} & 2\end{bmatrix},\end{split}\]

we find that \(x_1\) is \(Rainy\) and so on.

The complete path through matrix \(D\) is shown below:

\[\begin{split}D = \begin{bmatrix} \mathbf{0} & 1 & 1 & \mathbf{2}\\ 0 & \mathbf{1} & \mathbf{2} & 2\end{bmatrix}.\end{split}\]

The corresponding sequence of hidden states is \((S, R, R, S),\) which is the solution to our decoding problem.

Python implementation of the Viterbi algorithm#

def Viterbi(y, A, B, Pi):
    N = A.shape[1] # cardinality of the state space
    T = len(y) # length of the observed sequence
    # Initialize C & D
    C = np.empty((N, T), 'd') #'d' stands for type double
    D = np.empty((N, T), 'B') #'B' stands for type unsigned integer 

    # Initialization stage
    C[:, 0] = B[y[0], :] * Pi.T
    D[:, 0] = 0

    # Forward pass
    for i in range(1, T):
        C[:, i] = np.max(B[y[i], :, np.newaxis] * A * C[:, i - 1], 1)
        D[:, i] = np.argmax(B[y[i], :, np.newaxis] * A * C[:, i - 1], 1)
    D[:,1:] =  D[:,1:] + 1 # hidden states indices start with 1

    # Backward pass
    x = np.empty(T, 'B')
    x[-1] = np.argmax(C[:, T - 1]) + 1 # finds the value of s
    for i in reversed(range(1, T)): 
        x[i - 1] = D[x[i] - 1, i]

    return x, C, D
# HMM and observed sequence
A = np.array([[0.7,0.4],[0.3,0.6]])
B = np.array([[0.8, 0.4],[0.2, 0.6]])
Pi = np.array([[0.6],[0.4]])

print("Transition matrix: \n", A)
print("Emission matrix: \n", B)
print("Initial state distribution: \n", Pi)

O = np.array([0,1,1,0])
print("Observed sequence: \n", O)
Transition matrix: 
 [[0.7 0.4]
 [0.3 0.6]]
Emission matrix: 
 [[0.8 0.4]
 [0.2 0.6]]
Initial state distribution: 
 [[0.6]
 [0.4]]
Observed sequence: 
 [0 1 1 0]
# Result
X, C, D = Viterbi(O,A,B,Pi)
print("Matrix C: \n", C)
print("Matrix D: \n", D)
print("Answer: \n", X)
Matrix C: 
 [[0.48       0.0672     0.009408   0.00995328]
 [0.16       0.0864     0.031104   0.00746496]]
Matrix D: 
 [[0 1 1 2]
 [0 1 2 2]]
Answer: 
 [1 2 2 1]

Summary#

  • HMMs deal with Markov processes in which the states are hidden but influence an observable process. HMMs satisfy the following assumptions:

  1. Markov property,

  2. stationarity,

  3. output independence.
    HMMs are fully described by the initial state, transition matrix, and emission matrix.

  • A decoding problem answers the question ‘’What is the most likely series of states to generate an observed sequence?’’
    It can be solved using a computationally expensive brute-force approach or by more advanced algorithms, such as the Viterbi algorithm.

  • The Vitebri algorithm is a recursive, step-wise procedure that can be split in 3 main parts:

  1. initialization,

  2. forward pass,

  3. backward pass.