Hidden Markov Model and Part of Speech Tagging

Sat 19 Mar 2016 by Tianlong Song Tags Natural Language Processing Machine Learning Data Mining

In a Markov model, we generally assume that the states are directly observable or one state corresponds to one observation/event only. However, this is not always true. A good example would be: in speech recognition, we are supposed to identify a sequence of words given a sequence of utterances, in which case the states (words) are not directly observable and one single state (word) could have different observations (utterances). This is a perfect example that could be treated as a hidden Markov model (HMM), by which the hidden states can be inferred from the observations.

Elements of a Hidden Markov Model (HMM)1

A hidden Markov model, \(\Phi\), typically includes the following elements:

  • Time: \(t=\{1,2,...,T\}\);
  • \(N\) States: \(Q=\{1,2,...,N\}\);
  • \(M\) Observations: \(O=\{1,2,...,M\}\);
  • Initial Probabilities: \(\pi_i=p(q_1=i),~1 \leq i \leq N\);
  • Transition Probabilities: \(a_{ij}=p(q_{t+1}=j|q_t=i),~1 \leq i,j \leq N\);
  • Observation Probabilities: \(b_j(k)=p(o_t=k|q_t=j)~1 \leq j \leq N, 1 \leq k \leq M\).

The entire model can be characterized by \(\Phi=(A,B,\pi)\), where \(A=\{a_{ij}\}\), \(B=\{b_j(k)\}\) and \(\pi=\{\pi_i\}\). The states are "hidden", since they are not directly observable, but reflected in observations with uncertainty.

Three Basic Problems for HMMs1

There are three basic problems that are very important to real-world applications of HMMs:

Problem 1: Evaluation Problem

Given the observation sequence \(O=o_1o_2...o_T\) and a model \(\Phi=(A,B,\pi)\), how to efficiently compute the probability of the observation sequence given the model, i.e., \(p(O|\Phi)\)?


\begin{equation} \alpha_t(i)=p(o_1o_2...o_t,q_t=i|\Phi) \end{equation}

denote the probability that the state is \(i\) at time \(t\) and we have a sequence of observations \(o_1o_2...o_t\). The evaluation problem can be solved by the forward algorithm as illustrated below:

  1. Base case:
    \begin{equation} \alpha_1(i)=p(o_1,q_1=i|\Phi)=p(o_1|q_1=i,\Phi)p(q_1=i|\Phi)=\pi_ib_i(o_1),~1 \leq i \leq N; \end{equation}
  2. Induction:
    \begin{equation} \alpha_{t+1}(j)=\left[\sum_{i=1}^N{\alpha_{t}(i)a_{ij}}\right]b_j(o_{t+1}),~1 \leq j \leq N; \end{equation}
  3. Termination:
    \begin{equation} p(O|\Phi)=\sum_{i=1}^N{\alpha_T(i)}, \end{equation}
    \begin{equation} p(q_T=i|O,\Phi)=\frac{\alpha_T(i)}{\sum_{j=1}^N{\alpha_T(j)}}. \end{equation}

The algorithm above essentially applies dynamic programming, and its complexity is \(O(N^2T)\).

Problem 2: Decoding Problem

Given the observation sequence \(O=o_1o_2...o_T\) and a model \(\Phi=(A,B,\pi)\), how to choose the "best" state sequence \(Q=q_1q_2...q_T\) (the most probable path) in terms of how good it explains the observations?


\begin{equation} v_t(i)=\max_{q_1q_2...q_{t-1}}{p(q_1q_2...q_{t-1},q_t=i,o_1o_2...o_t|\Phi)} \end{equation}

as the best state sequence through which the state arrives at \(i\) at time \(t\) with a sequence of observations \(o_1o_2...o_t\). The decoding problem can be solved by the Viterbi algorithm as illustrated below:

  1. Base case:
    \begin{equation} v_1(i)=p(q_1=i,o_1|\Phi)=p(o_1|q_1=i,\Phi)p(q_1=i|\Phi)=\pi_ib_i(o_1),~1 \leq i \leq N; \end{equation}
  2. Induction:
    \begin{equation} v_{t+1}(j)=\left[\max_i{v_{t}(i)a_{ij}}\right]b_j(o_{t+1}),~1 \leq j \leq N, \end{equation}
    in which the optimal \(i\) from the maximization should be stored properly for backtracking;
  3. Termination: The best state sequence can be determined by first finding the optimal final state
    \begin{equation} q_T=\max_i{v_T(i)}, \end{equation}
    and then backtracking all the way to the initial state.

The algorithm above also applies dynamic programming, and its complexity is \(O(N^2T)\) as well.

Problem 3: Model Learning

Given the observation sequence \(O=o_1o_2...o_T\), how to find the model \(\Phi=(A,B,\pi)\) that maximizes \(p(O|\Phi)\)?

A general maximum likelihood (ML) learning approach could determine the optimal \(\Phi\) as

\begin{equation} \hat{\Phi}=\max_{\Phi}{p(O|\Phi)}. \end{equation}

It is much easier to perform supervised learning, where the true state are tagged to each observation. Given \(V\) training sequences in total, the model parameters can be estimated as

\begin{equation} \label{Eqn:Supervised_Learning} \hat{a}_{ij}=\frac{Count(q:i \rightarrow j)}{Count(q:i)},~~\hat{b}_j(k)=\frac{Count(q:j,o:k)}{Count(q:j)},~~\hat{\pi}_i=\frac{Count(q_1=i)}{V}. \end{equation}

It becomes a little bit tricky for unsupervised learning, where the true state are not tagged. To facilitate our model learning, we need to first introduce the following definition/calculation:

\begin{equation} \label{Eqn:Episilon} \varepsilon_t(i,j)=p(q_t=i,q_{t+1}=j|O,\Phi)=\frac{p(q_t=i,q_{t+1}=j,O|\Phi)}{p(O|\Phi)}=\frac{\alpha_t(i)a_{ij}b_j(o_{t+1})\beta_{t+1}(j)}{\sum_{i=1}^N{\alpha_T(i)}}, \end{equation}

where \(p(O|\Phi)\) is exactly Problem 1 we have yet talked about. \(\beta_{t+1}(j)\) can be calculated using the backward algorithm, which is very similar to the forward algorithm in Problem 1 to calculate \(\alpha_t(i)\) except the difference in the direction of calculation. Following (\ref{Eqn:Episilon}), we further introduce

\begin{equation} \label{Eqn:Gamma} \gamma_t(i)=p(q_t=i|O,\Phi)=\sum_{j=1}^N{\varepsilon_t(i,j)}. \end{equation}

Then the model parameters can be recomputed as

\begin{equation} \begin{split} \hat{a}_{ij}&=\frac{Expected~number~of~transitions~from~state~i~to~j}{Expected~number~of~transitions~from~state~i}\\ &=\frac{\sum_{t=1}^{T-1}{\varepsilon_t(i,j)}}{\sum_{t=1}^{T-1}{\gamma_t(i)}}, \end{split} \end{equation}
\begin{equation} \begin{split} \hat{b}_j(k)&=\frac{Expected~number~of~times~in~state~j~and~observing~k}{Expected~number~of~times~in~state~j}\\ &=\frac{\sum_{t=1,~s.t.~o_t=k}^{T}{\gamma_t(j)}}{\sum_{t=1}^{T}{\gamma_t(j)}}, \end{split} \end{equation}
\begin{equation} \begin{split} \hat{\pi}_i&=Expected~number~of~times~in~state~i~at~time~t=1\\ &=\gamma_1(i). \end{split} \end{equation}

Now we are ready to apply the expectation maximization (EM) algorithm for HMM learning. More specifically:

  1. Initialize the HMM, \(\Phi\);
  2. Repeat the two steps below until convergence:
    • E Step: Given observations \(o_1o_2...o_T\) and the model \(\Phi\), compute \(\varepsilon_t(i,j)\) by (\ref{Eqn:Episilon}) and \(\gamma_t(i)\) by (\ref{Eqn:Gamma});
    • M Step: Update the model \(\Phi\) by recomputing parameters using the three equations right above.

Part of Speech (POS) Tagging

In natural language processing, part of speech (POS) tagging is to associate with each word in a sentence a lexical tag. As an example, Janet (NNP) will (MD) back (VB) the (DT) bill (NN), in which each POS tag describes what its corresponding word is about. In this particular example, "VB" tells that "back" is a verb, and "NN" tells that "bill" is a noun, etc.

POS tagging is very useful, because it is usually the first step of many practical tasks, e.g., speech synthesis, grammatical parsing and information extraction. For instance, if we want to pronounce the word "record" correctly, we need to first learn from context if it is a noun or verb and then determine where the stress is in its pronunciation. A similar argument applies to grammatical parsing and information extraction as well.

We need to do some preprocessing before performing POS tagging using HMM. First, because the vocabulary size could be very large while most of the words are not frequently used, we replace each low-frequency word with a special word "UNKA". This is very helpful to reduce the vocabulary size, and thus reduce the memory cost on storing the probability matrix. Second, for each sentence, we add two tags to represent sentence boundaries, e.g., "START" and "END".

Now we are ready to apply HMM to perform POS tagging. The model can be characterized by:

  • Time: length of each sentence;
  • \(N\) States: POS tags, e.g., 45 POS tags from Penn Treebank;
  • \(M\) Observations: vocabulary (compressed by replacing low-frequency words with "UNKA");
  • Initial Probabilities: probability of each tag associated to the first word;
  • Transition Probabilities: \(p(t_{i+1}|t_i)\), where \(t_i\) represents the tag for the \(i\)th word;
  • Observation Probabilities: \(p(w|t)\), where \(t\) stands for a tag and \(w\) stands for a word.

Once we finish training the model, e.g., under supervised learning by (\ref{Eqn:Supervised_Learning}), we will then be able to tag new sentences applying the Viterbi algorithm as previously illustrated in Problem 2 for HMM. To see details about implementing POS tagging using HMM, click here for demo codes.


  1. L. R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, in Proceedings of the IEEE, vol. 77, no. 2, pp. 257-286, Feb 1989.