Tianlong's Blog

Language Detection from Speech: Chinese or English?

2017-10-15T00:00:00-07:00

In language processing, it is an essential step to detect which language it is before speech recognition and machine translation. This blog post presents an approach to distinguish Chinese and English from speech (an audio sample) using a neural network model. Spark is used to perform data preprocessing, and TensorFlow is used for neural network model training and evaluation.

Raw Data Collection

YouTube videos (with audio extracted) are downloaded and converted to wav format. The data are collected from two representative interview shows in each language (Chinese and English), and they are:

635 minutes of Chinese interviews from Luyu Official (i.e., 《鲁豫有约》)
534 minutes of English interviews from Ellen Show

Data Preprocessing

The data preprocessing converts a wav audio file into a spectrogram image by the following steps:

Split audio into pieces of one second for each;
Re-sample (down-sample) to make sure each audio piece has the same sampling rate (16k);
Apply Mel Frequency Cepstral Coefficient (MFCC) filter to obtain the spectrogram of the audio;
Convert the spectrogram into a gray-scale image;
Improve the contrast of images by applying histogram equalization and levels filter;
Cut or pad to make sure each image has the same size.

To enable parallelization, Spark is used to execute the steps described above. Most steps are easy to understand, while Step 3 needs a little bit more explanation here. The MFCC filter essentially mimics the functionality of human cochleae, by framing, calculating power spectrum and summing over different mel-spaced filter banks for each audio sample. See here for more details, especially how mel-spaced filter banks are generated.

At last, all spectrogram images are labelled, mixed, shuffled and then split into train/test tests by 80%/20%. After that, we have:

Train set: 30497 spectrogram images for Chinese, and 25663 spectrogram images for English
Test set: 7625 spectrogram images for Chinese, and 6416 spectrogram images for English

Model Training and Evaluation

A Berlinnet neural network model is adopted from here to perform the classification. The model contains 12 layers (in the order they appear in the network):

One input layer
One convolutional layer
One local response normalization layer
One pooling layer
One convolutional layer
One local response normalization layer
One pooling layer
One convolutional layer
One local response normalization layer
One pooling layer
One fully connected layer
One fully connected layer
One output layer

The model is trained and evaluated using the TensorFlow framework. The configuration of the training as well as the model itself can be found here, and the results are discussed in the next section.

Results & Discussion

Due to limited resources on a regular PC (that is what I have), there was only 19300 iterations performed during the training step, which took around 24 hours. However, the evaluation on the test set delivered an accuracy of as high as 92.7%. It should be noted that the classification is merely based on a very short audio sample (lasting for one second only).

There are at least three potential ways to make the accuracy even higher:

Collect more data to train the model;
Apply more iterations during training, if more resources are available;
Most likely, an utterance lasts for more than one second, which gives us a chance to apply majority voting across the classification results drawn independently from multiple one-second audio pieces. A weighted voting is even better and meanwhile definitely doable, as each classification returns not only a label but also the confidence.

The last approach turns out to be a very effective way to boost the classification accuracy, which barely consumes any additional resource but can reduce the classification error dramatically.

Code Repository

Check out the codes, if you are interested in the implementation.

Acknowledgment

This project is inspired by and a large portion of codes comes from the great work here.

An Approach Of Scaling Airflow To A Corporate Level

2017-07-15T00:00:00-07:00

The last post on Airflow provides step-by-step instructions on how to build an Airflow cluster from scratch. It could serve the development purpose well, but lacks critical features to work in prod, e.g., CI/CD compliance, resource monitoring, service recovery, and so on.

I have been leading the efforts to build the Airflow backbone at Zillow's Data Science and Engineering (DSE) team, and I would like to introduce a post from Zillow's tech blog site. It describes how Airflow is adopted and working at Zillow, and can possibly give you an idea on how Airflow can be configured to run in a corporate level.

A Guide On How To Build An Airflow Server/Cluster

2016-10-23T00:00:00-07:00

Airflow is an open-source platform to author, schedule and monitor workflows and data pipelines. When you have periodical jobs, which most likely involve various data transfer and/or show dependencies on each other, you should consider Airflow. This blog post briefly introduces Airflow, and provides the instructions to build an Airflow server/cluster from scratch.

A Glimpse at Airflow under the Hood

Generally, Airflow works in a distributed environment, as you can see in the diagram below. The airflow scheduler schedules jobs according to the dependencies defined in directed acyclic graphs (DAGs), and the airflow workers pick up and run jobs with their loads properly balanced. All job information is stored in the meta DB, which is updated in a timely manner. The users can monitor their jobs via a shiny Airflow web UI and/or the logs.

Fig. 1: Airflow Diagram.

Although you do not necessarily need to run a fully distributed version of Airflow, this page will go through all three modes: standalone, pseudo-distributed and distributed modes.

Phase 1: Start with Standalone Mode Using Sequential Executor

Under the standalone mode with a sequential executor, the executor picks up and runs jobs sequentially, which means there is no parallelism for this choice. Although not often used in production, it enables you to get familiar with Airflow quickly.

Install and configure airflow

# Set the airflow home
export AIRFLOW_HOME=~/airflow

# Install from pypi using pip
pip install airflow

# Install necessary sub-packages
pip install airflow[crypto] # For connection credentials protection
pip install airflow[postgres] # For PostgreSQL DBs
pip install airflow[celery] # For distributed mode: celery executor
pip install airflow[rabbitmq] # For message queuing and passing between airflow server and workers
... # Anything more you need

# Configure airflow: modify AIRFLOW_HOME/airflow.cfg if necessary
# For the standalone mode, we will leave the configuration to default

Initialize the meta database (home for almost all airflow information)

# For the standalone mode, it could be a sqlite database, which applies to sequential executor only
airflow initdb

Start the airflow webserver and explore the web UI

airflow webserver -p 8080 # Test it out by opening a web browser and go to localhost:8080

Create your dags and place them into your DAGS_FOLDER (AIRFLOW_HOME/dags by default); refer to this tutorial for how to create a dag, and keep the key commands below in mind

# Check syntax errors for your dag
python ~/airflow/dags/tutorial.py

# Print the list of active DAGs
airflow list_dags

# Print the list of tasks the "tutorial" dag_id
airflow list_tasks tutorial

# Print the hierarchy of tasks in the tutorial DAG
airflow list_tasks tutorial --tree

# Test your tasks in your dag
airflow test [DAG_ID] [TASK_ID] [EXECUTION_DATE]
airflow test tutorial sleep 2015-06-01

# Backfill: execute jobs that are not done in the past
airflow backfill tutorial -s 2015-06-01 -e 2015-06-07

Start the airflow scheduler and monitor the tasks via the web UI

airflow scheduler # Monitor the your tasks via the web UI (success/failure/scheduling, etc.)

# Remember to turn on the dags you want to run via the web UI, if they are not on yet

[Optional] Put your dags in remote storage, and sync them with your local dag folder

# Create a daemon using crons to sync up dags; below is an example for remote dags in S3 (you can also put them in remote repo)
# Note: you need to have the aws command line tool installed and your AWS credentials properly configured
crontab -e
* * * * * /usr/local/bin/aws s3 sync s3://your_bucket/your_prefix YOUR_AIRFLOW_HOME/dags # Sync up every minute

[Optional] Add access control to the web UI; add users with password protection, see here. You may need to install the dependency below
```
pip install flask-bcrypt
```

Phase 2: Adopt Pseudo-distributed Mode Using Local Executor

Under the pseudo-distributed mode with a local executor, the local workers pick up and run jobs locally via multiprocessing. If you have only a moderate amount of scheduled jobs, this could be the right choice.

Adopt another DB server to support executors other than the sequential executor; MySQL and PostgreSQL are recommended; here PostgreSQL is used as an example

# Install postgres
brew install postgresql # For Mac, the command varies for different OS

# Connect to the database
psql -d postgres # This will open a prompt

# Operate on the database server
\l # List all databases
\du # List all users/roles
\dt # Show all tables in database
\h # List help information
\q # Quit the prompt

# Create a meta db for airflow
CREATE DATABASE database_name;
\l # Check for success

Modify the configuration in AIRFLOW_HOME/airflow.cfg

# Change the executor to Local Executor
executor = LocalExecutor

# Change the meta db configuration
# Note: the postgres username and password do not matter for now, since the database server and clients are still on the same host
sql_alchemy_conn = postgresql+psycopg2://your_postgres_user_name:your_postgres_password@host_name/database_name

Restart airflow to test your dags

airflow initdb
airflow webserver
airflow scheduler

Establish your own connections via the web UI; you can test your DB connections via the Ad Hoc Query (see here)

# Go to the web UI: Admin -> Connection -> Create
Connection ID: name it
Connection Type: e.g., database/AWS
Host: e.g., your database server name or address
Scheme: e.g., your database
Username: your user name
Password: will be encrypted if airflow[crypto] is installed
Extra: additional configuration in JSON, e.g., AWS credentials

# Encrypt your credentials
# Generate a valid Fernet key and place it into airflow.cfg
FERNET_KEY=$(python -c "from cryptography.fernet import Fernet; FERNET_KEY = Fernet.generate_key().decode(); print FERNET_KEY")

Phase 3: Extend to Distributed Mode Using Celery Executor

Under the distributed mode with a celery executor, remote workers pick up and run jobs as scheduled and load-balanced. As being highly scalable, it is the choice when you expect heavy and expanding loads.

Install and configure the message queuing/passing engine on the airflow server: RabbitMQ/Reddis/etc; RabbitMQ (resources: link1 and link2)

# Install RabbitMQ
brew install rabbitmq # For Mac, the command varies for different OS

# Add the following path to your .bash_profile or .profile
PATH=$PATH:/usr/local/sbin

# Start the RabbitMQ server
sudo rabbitmq-server # run in foreground; or
sudo rabbitmq-server -detached # run in background

# Configure RabbitMQ: create user and grant privileges
rabbitmqctl add_user rabbitmq_user_name rabbitmq_password
rabbitmqctl add_vhost rabbitmq_virtual_host_name
rabbitmqctl set_user_tags rabbitmq_user_name rabbitmq_tag_name
rabbitmqctl set_permissions -p rabbitmq_virtual_host_name rabbitmq_user_name ".*" ".*" ".*"

# Make the RabbitMQ server open to remote connections
Go to /usr/local/etc/rabbitmq/rabbitmq-env.conf, and change NODE_IP_ADDRESS from 127.0.0.1 to 0.0.0.0 (development only, restrict access for prod)

Modify the configuration in AIRFLOW_HOME/airflow.cfg

# Change the executor to Celery Executor
executor = CeleryExecutor

# Set up the RabbitMQ broker url and celery result backend
broker_url = amqp://rabbitmq_user_name:rabbitmq_password@host_name/rabbitmq_virtual_host_name # host_name=localhost on server
celery_result_backend = meta db url (as configured in step 2 of Phase 2), or RabbitMQ broker url (same as above), or any other eligible result backend

Open the meta DB (PostgreSQL) to remote connections

# Modify /usr/local/var/postgres/pg_hba.conf to add Client Authentication Record
host    all         all         0.0.0.0/0          md5 # 0.0.0.0/0 stands for all ips; use CIDR address to restrict access; md5 for pwd authentication

# Change the Listen Address in /usr/local/var/postgres/postgresql.conf
listen_addresses = '*'

# Create a user and grant privileges (run the commands below under superuser of postgres)
CREATE USER your_postgres_user_name WITH ENCRYPTED PASSWORD 'your_postgres_pwd';
GRANT ALL PRIVILEGES ON DATABASE your_database_name TO your_postgres_user_name;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO your_postgres_user_name;

# Restart the PostgreSQL server and test it out
brew services restart postgresql
psql -U [postgres_user_name] -h [postgres_host_name] -d [postgres_database_name]

# IMPORTANT: update your sql_alchemy_conn string in airflow.cfg

Configure your airflow workers; follow most steps for the airflow server, except that they do not have PostgreSQL and RabbitMQ servers

Test it out

# Start your airflow workers, on each worker, run:
airflow worker # The prompt will show the worker is ready to pick up tasks if everything goes well

# Start you airflow server
airflow webserver
airflow scheduler
airflow worker # [Optional] Let your airflow server be a worker as well

Your airflow workers should be now picking up and running jobs from the airflow server!

Monte Carlo Tree Search and Its Application in AlphaGo

2016-04-09T00:00:00-07:00

As one of the most important methods in artificial intelligence (AI), especially for playing games, Monte Carlo tree search (MCTS) has received considerable interest due to its spectacular success in the difficult problem of computer Go. In fact, most successful computer Go algorithms are powered by MCTS, including the recent success of Google's AlphaGo¹. This post introduces MCTS and explains how it is used in AlphaGo.

Warm Up: Bandit-Based Methods²

Bandit problems are a well-known class of sequential decision problems, in which one needs to choose among \(K\) actions (e.g. the \(K\) arms of a multi-armed bandit slot machine) in order to maximize the cumulative reward by consistently taking the optimal action. The choice of action is difficult as the underlying reward distributions are unknown, and potential rewards must be estimated based on past observations. This leads to the exploitation-exploration dilemma: one needs to balance the exploitation of the action currently believed to be optimal with the exploration of other actions that currently appear suboptimal but may turn out to be superior in the long run.

The primary goal is to find a policy that can minimize the player's regret after \(n\) plays, which is the difference between: (i) the best possible total reward if the player could at the beginning have the knowledge of the reward distributions that is actually learned afterwards; and (ii) the actual total reward from the \(n\) finished plays. In other words, the regret is the expected loss due to not playing the best bandit. An upper confidence bound (UCB) policy has been proposed, which has an expected logarithmic growth of regret uniformly over the total number of plays \(n\) without any prior knowledge regarding the reward distributions. According to the UCB policy, to minimize his regret, for the current play, the player should choose arm \(j\) that maximizes:

\begin{equation} \overline{X}_j+\sqrt{\frac{2\ln{n}}{n_j}}, \end{equation}

where \(\overline{X}_j\) is the average reward from arm \(j\), \(n_j\) is the number of times arm \(j\) was played and \(n\) is the total number of plays so far. The physical meaning is that: the term \(\overline{X}_j\) encourages the exploitation of higher-rewarded choices, while the term \(\sqrt{\frac{2\ln{n}}{n_j}}\) encourages the exploration of less-visited choices.

Monte Carlo Tree Search (MCTS)²

Let us use the board game as an example. Given a board state, the primary goal would be finding out the best action that should be taken currently, which should naturally be chosen according to some precomputed value of each action. The purpose of MCTS is to approximate the (true) values of actions that may be taken from the current board state. This is achieved by iteratively building a partial search tree.

Four Fundamental Steps in Each Iteration

The basic algorithm involves iteratively building a search tree until some predefined computational budget (e.g., time, memory or iteration constraint) is reached, at which point the search is halted and the best-performing root action returned. Each node in the search tree represents a state, and directed links to child nodes represent actions leading to subsequent states.

Fig. 1: Four steps in one iteration of MCTS.

As illustrate in Fig. 1, four steps are applied for each iteration²:

Selection: Starting from the root node (i.e., current state), a tree policy for child selection is recursively applied to descend through the tree until an expandable node is reached. A node is expandable if it represents a non-terminal state and has unvisited (i.e., expandable) children.
Expansion: For the expandable node we reached in the selection step, one child node is added to expand the tree, according to the available actions.
Simulation: A simulation is run from the newly expanded node according to the default policy to produce an outcome (e.g., win or lose when reaching a terminal state).
Backpropagation: The simulated result is backpropagated through the selected nodes in the selection step to update their statistics.

There are two essential ideas that should be highlighted here:

The tree policy for child selection should be able to give the high-value nodes priorities in value approximation, and meanwhile explore the less-visited nodes. This is quite similar to the bandit problem, so we can apply the UCB policy to choose the child node.
The value of each node is approximated in an incremental way. That is, its initial value is obtained from a random simulation by the default policy (e.g., a win/lose result along a random path), and then refined by the backpropagation steps during the following iterations.

The Full Algorithm Description

Before describing the algorithm, let us define some notations first.

\(s(v)\): the associated state to node \(v\)

\(a(v)\): the incoming action that leads to node \(v\)

\(N(v)\): the visit count of node \(v\)

\(Q(v)\): the vector of total simulation rewards of node \(v\) for all players

The main procedure of the MCTS algorithm is described below, which essentially executes the four fundamental steps for each iteration until the computational budget is reached. It returns the best action that should be taken for the current state.

The selection step is described below, which returns the expandable node according to the tree policy.

The child selection is described below, which returns the best child of a given node. It essentially applies the UCB method, which uses a constant \(c\) to balance the exploitation with the exploration. It should be noted that there might be multiple players, but the best child is selected as per the interest of the player who is supposed to play in this state.

The selected node after the selection step is expanded by choosing one of its unvisited children, and then adding the associated data to the new node. The procedure is described below.

Given the state associating to the newly expanded node, a random simulation is run as indicated below, which finds a random path to a terminal state and returns the simulated reward.

Once the simulated reward of the newly expanded node is obtained, it is backpropagated through the selected nodes in the selection step. The visit counts are updated at the same time.

Recall the board game example, assume that the rewards of winning and losing a game are 1 and 0, respectively. After applying the MCTS algorithm, for each node \(v\) in the tree, \(Q(v)\) would be the number of wins that is accumulated from \(N(v)\) visits of this node, and thus \(\frac{Q(v)}{N(v)}\) would be the winning rate. This is exactly the information we could rely on to choose the best action to take in the current state.

How is MCTS used by Google's AlphaGo?¹

We now understand how MCTS works. Can MCTS be directly applied to computer Go? Yes, but there could be a better way to do that. The reason is that Go is a very high-branching game. Consider the number of all possible sequences of moves, \(b^d\), where \(b\) is the game's breadth (number of legal moves per state, \(b \approx 250\) for Go), and \(d\) is game's depth (game length, \(d \approx 150\) for Go). As a result, exhaustive search is computationally impossible. Applying MCTS to Go in a straightforward way helps, but the benefits of MCTS are not really fully exploited, since the limited number of simulations could only scratch the surface of the giant search space.

Aided by the useful information learned by two deep convolutional neural networks (a.k.a., deep learning), policy network and value network, Google's AlphaGo applies MCTS in an innovative way.

First, for the tree policy to select child, instead of using the UCB method, AlphaGo takes into account the prior probability of actions learned by the policy network. More specifically, for node \(v_0\), the child \(v\) is selected by maximizing
\begin{equation} \frac{Q(v)}{N(v)}+\frac{P(v|v_0)}{1+N(v)}, \end{equation}
where \(P(v|v_0)\) is the prior probability that is provided by the policy network. This greatly improves the child selection policy, and thus grants more professional moves (e.g., by human experts) priorities in MCTS simulation.
Second, for the default policy to evaluate expanded nodes, AlphaGo combines the outcomes from simulation steps and node values learned by the value network, and their weights are balanced by a constant \(\lambda\).

Note that both the policy network and value network are trained offline, which greatly reduces the time cost in a real-time contest.

Acknowledgement

A large majority of this post, including Fig. 1 and the pseudo codes, comes from the survey paper². More details about MCTS and its variants can be found there.

References

D. Silver, et al., Mastering the game of Go with deep neural networks and tree search, Nature, 2016. ↩
C. Browne, et al., A Survey of Monte Carlo Tree Search Methods, IEEE Transactions on Computational Intelligence and AI in Gamges, 2012. ↩

Neural Networks and Deep Learning

2016-04-03T00:00:00-07:00

It has been a long time since the idea of neural networks was proposed, but it is really during the last few years that neural networks have become widely used. One of the major enablers is the infrastructure with high computational capability (e.g., cloud computing), which makes the training of large and deep (multilayer) neural networks possible. This post is in no way an exhaustive review of neural networks or deep learning, but rather an entry-level introduction excerpted from a very popular book¹.

Neural Network (NN) Basics

Let us start with sigmoid neurons, and then find out how they are used in NNs.

Sigmoid Neurons

As the smallest unit in NNs, a sigmoid neuron mimics the behaviour of a real neuron in human neural systems. It takes multiple inputs and generates one single output, as in a process of local or partial decision making. More specifically, given a series of inputs \([x_1,x_2,...]\), a neuron applies the sigmoid function to the weighted sum of the inputs plus a bias, i.e., the output of the neuron is computed as

\begin{equation} \sigma(z)=\sigma(\sum_j{w_jx_j}+b)=\frac{1}{1+\exp(-\sum_j{w_jx_j}-b)}, \end{equation}

where \(z=\sum_j{w_jx_j}+b\) is the weighted input to the neuron, and the sigmoid function, \(\sigma(z)=\frac{1}{1+\exp(-z)}\), is to approximate the step function as usually used in binary decision making. A natural question is: why do not we just use the step function? The answer is that the step function is not smooth (not differentiable at origin), which disables the gradient method in model learning. With the smoothed version of the step function, we are safe to relate the change at the output to the weight/bias changes by

\begin{equation} \Delta{output}=\sum_j{\frac{\partial{output}}{\partial{w_j}}\Delta{w_j}}+\frac{\partial{output}}{\partial{b}}\Delta{b}. \end{equation}

The Architecture of NNs

The architecture of a typical NN is depicted in Fig. 1. As shown, the leftmost layer in this network is called the input layer, and the neurons within this layer are called input neurons. The rightmost or output layer contains the output neuron(s). The middle layer is called a hidden layer, since the neurons in this layer are neither inputs nor outputs.

Fig. 1: The neural network architecture.

It was proved that NNs with a single hidden layer can be used to approximate any continuous function to any desired precision. Click here to see how. It can be expected that a NN with more neurons/layers could be more accurate in the approximation.

Learning with Gradient Descent

Before training a model, we first need to find a way to quantify how well we are achieving. That is, we need to introduce a cost function. Although there are many cost functions available, we will start with the quadratic cost function, which is defined as

\begin{equation} C(w,b)=\frac{1}{2n}\sum_x{||y(x)-a||^2}, \end{equation}

where \(w\) denotes the collection of all weights in the network, \(b\) all the biases, \(n\) is the total number of training inputs, \(a\) is the vector of outputs from the network when \(x\) is input, and the sum is over all training inputs, \(x\).

Applying the gradient descent method, we can learn the weights and biases by

\begin{equation} w_k'=w_k-\eta\frac{\partial{C}}{\partial{w_k}}, \end{equation}

\begin{equation} b_l'=b_l-\eta\frac{\partial{C}}{\partial{b_l}}. \end{equation}

To compute each gradient, we need to take into account all the training input \(x\) in each iteration. However, this slows the learning down if the training data size is large. An idea called stochastic gradient descent (a.k.a., mini-batch learning) can be used to speed up learning. The idea is to estimate the gradient by computing it for a small sample of randomly chosen training inputs.

The Backpropagation Algorithm: How to Compute Gradients of the Cost Function?

So far we have known that the model parameters can be learned by the gradient descent method, but the computation of the gradients can be challenging by itself. Note that the network size and the data size can both be very large. In this section, we will see how the backpropagation algorithm helps compute the gradients efficiently.

Matrix Notation for NNs

For ease of presentation, let us define some notations first.

\(w_{jk}^l\): weight from the \(k\)th neuron in layer \(l-1\) to the \(j\)th neuron in layer \(l\);

\(w^l=\{w_{jk}^l\}\): matrix including all weights from each neuron in layer \(l-1\) to each neuron in layer \(l\);

\(b_j^l\): bias for the \(j\)th neuron in layer \(l\);

\(b^l=\{b_j^l\}\): column vector including all biases for each neuron in layer \(l\);

\(a_j^l=\sigma(\sum_k{w_{jk}^la_k^{l-1}+b_j^l})\): activation of the \(j\)th neuron in layer \(l\);

\(a^l=\{a_j^l\}=\sigma(w^la^{l-1}+b^l)\): column vector including all activations of each neuron in layer \(l\);

\(z_j^l=\sum_k{w_{jk}^la_k^{l-1}+b_j^l}\): weighted input to the \(j\)th neuron in layer \(l\);

\(z^l=\{z_j^l\}=w^la^{l-1}+b^l\): column vector including all weighted inputs to each neuron in layer \(l\);

\(\delta_j^l=\frac{\partial{C}}{\partial{z_j^l}}\): gradient of the cost function w.r.t. the weighted input to the \(j\)th neuron in layer \(l\), \(z_j^l\);

\(\delta^l=\{\delta_j^l\}\): column vector including all gradients of the cost function w.r.t. the weighted input to each neuron in layer \(l\).

Four Fundamental Equations behind Backpropagation

There are four fundamental equations behind backpropagation, which will be explained one by one as below.

First, the gradient of the cost function w.r.t. the weighted input to each neuron in output layer \(L\) can be computed as

\begin{equation} \delta^L=\nabla_{a^L}C \odot \sigma'(z^L), \end{equation}

where \(\nabla_{a^L}C=\{\frac{\partial{C}}{\partial{a_j^L}}\}\) is defined to be a column vector whose components are the partial derivatives \(\frac{\partial{C}}{\partial{a_j^L}}\), \(\sigma'(z)\) is the first-order derivative of the sigmoid function \(\sigma(z)\), and \(\odot\) represents an element-wise product.

Second, the gradient of the cost function w.r.t. the weighted input to each neuron in layer \(l(l<L)\) can be computed from the results of layer \(l+1\) (backpropagation), i.e.,

\begin{equation} \delta^l=((w^{l+1})^T\delta^{l+1}) \odot \sigma'(z^l). \end{equation}

Third, the gradient of the cost function w.r.t. each bias can be computed as

\begin{equation} \frac{\partial{C}}{\partial{b_j^l}}=\delta_j^l. \end{equation}

Fourth, the gradient of the cost function w.r.t. each weight can be computed as

\begin{equation} \frac{\partial{C}}{\partial{w_{jk}^l}}=\delta_j^la_k^{l-1}. \end{equation}

The four equations above are not straightforward at first sight, but they are all consequences of the chain rule from multivariable calculus. The proof can be found here.

The Backpropagation Algorithm

The backpropagation algorithm essentially includes a feedforward process and a backpropagation process. More specifically, in each iteration:

Input a mini-batch of \(m\) training examples;
For each training example \(x\):
- Initialization: set the activations of the input layer by \(a^{x,1}=x\);
- Feedforward: for each \(l=2,3,...,L\), compute \(z^{x,l}=w^la^{x,l-1}+b^l\) and \(a^{x,l}=\sigma(z^{x,l})\);
- Output error: compute the error vector \(\delta^{x,L}=\nabla_{a^L}C_x \odot \sigma'(z^{x,L})\);
- Backpropagate the error: for \(l=L-1,L-2,...,2\), compute \(\delta^{x,l}=((w^{l+1})^T\delta^{x,l+1}) \odot \sigma'(z^{x,l})\).
Compute gradients and apply the gradient descent method by
\begin{equation} \frac{\partial{C_x}}{\partial{w_{jk}^l}}=\delta_j^{x,l}a_k^{x,l-1},~~\frac{\partial{C_x}}{\partial{b_j^l}}=\delta_j^{x,l},~~w^l=w^l-\frac{\eta}{m}\sum_x{\delta^{x,l}(a^{x,l-1})^T},~~b^l=b^l-\frac{\eta}{m}\sum_x{\delta^{x,l}}. \end{equation}

Deep Learning

Everything seems going well so far! What if our NNs are deep, i.e., with a lot of hidden layers? Typically we expect a deep NN could deliver better performance than shallow ones. Unfortunately it was observed that: for deep NNs, the learning speeds of the first few layers of neurons can be much higher/lower than those of the last few layers, in which case the NNs cannot be well trained. Click here to see why. This is known as vanishing/exploding gradient problem. To resolve this problem, it is suggested that we should resort to convolutional networks.

Convolutional Networks

Convolutional neural networks use three basic ideas: local receptive fields, shared weights, and pooling. A typical convolutional network is depicted in Fig. 2, where the network includes one input layer, one convolutional layer, one pooling layer and one output layer (fully connected). We will try to understand the three ideas using this network as an example.

Fig. 2: The convolutional networks.

Local receptive fields: Given a 28x28 image as the input layer, instead of directly connecting to all 28x28 pixels, the convolutional layer only connects to small, localized regions (local receptive fields) of the input image in the hope of detecting some localized features. In this example, the small region size is 5x5, so exploring all those regions with a stride length of one results into 24x24 neurons. Here we have three feature maps, so the convolutional layer eventually has 3x24x24 neurons.

Shared weights and biases: We use the same weights and bias for each neuron in a 24x24 feature map. It makes sense to use the same parameters in detecting the same feature but only in different local regions, and it can greatly reduce the number of parameters involved in a convolutional network compared to a fully connected layer.

The pooling layer: A pooling layer is usually used immediately after a convolutional layer. What the pooling layer do is simplify the information in the output from the convolutional layer. In this example, each unit in the pooling layer summarizes a region of 2x2 in the previous convolutional layer, which results into 3x12x12 neurons. The pooling can be a max-pooling (finding the maximum activation of the region to be summarized) or L2 pooling (obtaining the square root of the sum of the squares of the activations in the region to be summarized).

To summarize, a convolutional network tries to detect localized features and at the same time greatly reduces the number of parameters. In principle, we could include any number of convolutional/pooling layers and also fully connected layers in our networks. What is exciting is that the backpropagation algorithm can apply to convolutional networks with only necessary minor modifications.

Does convolutional networks avoid the vanishing/exploding gradient problem? Not really, but it helps us to proceed anyway, e.g., by reducing the number of parameters. The problem can be considerably alleviated by convolutional/pooling layers as well as other techniques, such as powerful regularization, advanced artificial neurons and more training epochs by GPU.

Deep Learning in Practice

It may be not that hard to construct a very basic (deep) NN and achieve a nice performance. However, there are still many techniques that could help to improve the NN's performance. A couple of insightful ideas are listed below, which should be kept in mind if we are working towards a better NN.

Use more convolutional/pooling layers: reduce the number of parameters and make learning faster;
Use right cost functions: resolve the problem of learning slow down by the cross-entropy cost function;
Use regularization: combat overfitting by L2/L1/drop-out regularization;
Use advanced neurons: for example, tanh neurons and rectified linear units;
Use good weight initialization: avoid neuron saturation and learn faster;
Use an expanded training data set: for example, rotate/shift images as new training data in image classification;
Use an ensemble of NNs: heavy computation, but multiple models beat one;
Use GPU: gain more training epochs;

Acknowledgement

A large majority of this post comes from Michael Nielsen's book¹ entitled "Neural Networks and Deep Learning", which I strongly recommend to anyone interested in discovering how essentially neural networks work.

References

M. Nielsen, Neural Networks and Deep Learning, Determination Press, 2015. ↩

Latent Dirichlet Allocation and Topic Modeling

2016-03-26T00:00:00-07:00

When reading an article, we humans are able to easily identify the topics the article talks about. An interesting question is: can we automate this process, i.e., train a machine to find out the underlying topics in articles? In this post, a very popular topic modeling method, Latent Dirichlet allocation (LDA), will be discussed.

Latent Dirichlet Allocation (LDA) Topic Model¹

Given a library of \(M\) documents, \(\mathcal{L}=\{d_1,d_2,...,d_M\}\), where each document \(d_m\) contains a sequence of words, \(d_m=\{w_{m,1},w_{m,2},...,w_{m,N_m}\}\), we need to think of a model which describes how essentially these documents are generated. Considering \(K\) topics and a vocabulary \(V\), the LDA topic model assumes that the documents are generated by the following two steps:

For each document \(d_m\), use a doc-to-topic model parameterized by \(\boldsymbol\vartheta_m\) to generate the topic for the \(n\)th word and denote it as \(z_{m,n}\), for all \(1 \leq n \leq N_m\);
For each generated topic \(k=z_{m,n}\) corresponding to each word in each document, use a topic-to-word model parameterized by \(\boldsymbol\varphi_k\) to generate the word \(w_{m,n}\).

Fig. 1: LDA topic model.

The two steps are graphically illustrated in Fig. 1. Considering that the doc-to-topic model and the topic-to-word model essentially follow multinomial distributions (counts of each topic in a document or each word in a topic), a good prior for their parameters, \(\boldsymbol\vartheta_m\) and \(\boldsymbol\varphi_k\), would be the conjugate prior of multinomial distribution, Dirichlet distribution.

A conjugate prior, \(p(\boldsymbol\varphi)\), of a likelihood, \(p(\textbf{x}|\boldsymbol\varphi)\), is a distribution that results in a posterior distribution, \(p(\boldsymbol\varphi|\textbf{x})\) with the same functional form as the prior (but different parameters). For example, the conjugate prior of a multinomial distribution is Dirichlet distribution. That is, for a multinomial distribution parameterized by \(\boldsymbol\varphi\), if the prior for \(\boldsymbol\varphi\) is a Dirichlet distribution characterized by \(Dir(\boldsymbol\varphi|\boldsymbol\alpha)\), after observing \(\textbf{x}\), the posterior for \(\boldsymbol\varphi\) still follows a Dirichlet distribution \(Dir(\boldsymbol\varphi|\textbf{n}_x+\boldsymbol\alpha)\), but incorporating the counting result \(\textbf{n}_x\) of observation \(\textbf{x}\).

Keep this in mind, let us take a closer look at the two steps:

In the first step, for the \(m\)th document, assume the prior for the doc-to-topic model's parameter \(\boldsymbol\vartheta_m\) follows \(Dir(\boldsymbol\vartheta_m|\boldsymbol\alpha)\), after observing topics in the document and obtaining the counting result \(\textbf{n}_m\), we have the posterior for \(\boldsymbol\vartheta_m\) as \(Dir(\boldsymbol\vartheta_m|\textbf{n}_m+\boldsymbol\alpha)\). After some calculation, we can obtain the topic distribution for the \(m\)th document as
\begin{equation} p(\textbf{z}_m|\boldsymbol\alpha)=\frac{\Delta(\textbf{n}_m+\boldsymbol\alpha)}{\Delta(\boldsymbol\alpha)}, \end{equation}
where \(\Delta(\boldsymbol\alpha)\) is the normalization factor for \(Dir(\textbf{p}|\boldsymbol\alpha)\), i.e., \(\Delta(\boldsymbol\alpha)=\int{\prod_{k=1}^K{p_k^{\alpha_k-1}}}d\textbf{p}\). Taking all documents into account,
\begin{equation} \label{Eqn:Doc2Topic} p(\textbf{z}|\boldsymbol\alpha)=\prod_{m=1}^M{p(\textbf{z}_m|\boldsymbol\alpha)}=\prod_{m=1}^M{\frac{\Delta(\textbf{n}_m+\boldsymbol\alpha)}{\Delta(\boldsymbol\alpha)}}. \end{equation}
In the second step, similarly, for the \(k\)th topic, assume the prior for the topic-to-word model's parameter \(\boldsymbol\varphi_k\) follows \(Dir(\boldsymbol\varphi_k|\boldsymbol\beta)\), after observing words in the topic and obtaining the counting result \(\textbf{n}_k\), we have the posterior for \(\boldsymbol\varphi_k\) as \(Dir(\boldsymbol\varphi_k|\textbf{n}_k+\boldsymbol\beta)\). After some calculation, we can obtain the word distribution for the \(k\)th topic as
\begin{equation} p(\textbf{w}_k|\textbf{z}_k,\boldsymbol\beta)=\frac{\Delta(\textbf{n}_k+\boldsymbol\beta)}{\Delta(\boldsymbol\beta)}. \end{equation}
Taking all topics into account,
\begin{equation} \label{Eqn:Topic2Word} p(\textbf{w}|\textbf{z},\boldsymbol\beta)=\prod_{k=1}^K{p(\textbf{w}_k|\textbf{z}_k,\boldsymbol\beta)}=\prod_{k=1}^K{\frac{\Delta(\textbf{n}_k+\boldsymbol\beta)}{\Delta(\boldsymbol\beta)}}. \end{equation}
Combining (\ref{Eqn:Doc2Topic}) and (\ref{Eqn:Topic2Word}), we have
\begin{equation} \label{Eqn:Joint_Distribution} p(\textbf{w},\textbf{z}|\boldsymbol\alpha,\boldsymbol\beta)=p(\textbf{w}|\textbf{z},\boldsymbol\beta)p(\textbf{z}|\boldsymbol\alpha)=\prod_{k=1}^K{\frac{\Delta(\textbf{n}_k+\boldsymbol\beta)}{\Delta(\boldsymbol\beta)}}\prod_{m=1}^M{\frac{\Delta(\textbf{n}_m+\boldsymbol\alpha)}{\Delta(\boldsymbol\alpha)}}. \end{equation}

Joint Distribution Emulation by Gibbs Sampling²

So far we know that the documents can be characterized by a joint distribution of topics and words as shown in (\ref{Eqn:Joint_Distribution}). The words are given, but the associating topics are not. Now we are thinking how to properly associate a topic to each word in each document, such that the result will best fit the joint distribution in (\ref{Eqn:Joint_Distribution}). This is a typical problem that can be solved by Gibbs sampling.

Gibbs sampling, a special case of Monte Carlo Markov Chain (MCMC) sampling, is a method to emulate high-dimensional probability distributions \(p(\textbf{x})\) by the stationary behaviour of a Markov chain. A typical Gibbs sampling works by: (i) Initialize \(\textbf{x}\); (ii) Repeat until convergence: for all \(i\), sample \(x_i\) from \(p(x_i|\textbf{x}_{\neg i})\), where \(\neg i\) indicates excluding the \(i\)th dimension. According to the stationary behaviour of a Markov chain, a sufficiently large collection of samples after convergence would well approximate the desired distribution \(p(\textbf{x})\).

In light of the observations above, to apply Gibbs sampling, it is essential to calculate \(p(x_i|\textbf{x}_{\neg i})\). In our case, it is \(p(z_i=k|\textbf{z}_{\neg i},\textbf{w})\), where \(i=(m,n)\) is a two dimensional coordinate indicating the \(n\)th word of the \(m\)th document. Since \(z_i=k,w_i=t\) involves the \(m\)th document and the \(k\)th topic only, \(p(z_i=k|\textbf{z}_{\neg i},\textbf{w})\) eventually depends only on two probabilities: (i) the probability of document \(m\) emitting topic \(k\), \(\hat{\vartheta}_{mk}\); (ii) the probability of topic \(k\) emitting word \(t\), \(\hat{\varphi}_{kt}\). Formally,

\begin{equation} \label{Eqn:Gibbs_Sampling} p(z_i=k|\textbf{z}_{\neg i},\textbf{w}) \propto \hat{\vartheta}_{mk}\hat{\varphi}_{kt}=\frac{n_{m,\neg i}^{(k)}+\alpha_k}{\sum_{k=1}^K{(n_{m,\neg i}^{(k)}+\alpha_k)}}\frac{n_{k,\neg i}^{(t)}+\beta_t}{\sum_{t=1}^V{(n_{k,\neg i}^{(t)}+\beta_t)}}, \end{equation}

where \(n_m^{(k)}\) is the count of topic \(k\) in document \(m\), \(n_k^{(t)}\) is the count of word \(t\) for topic \(k\), and \(\neg i\) indicates that \(w_i\) should not be counted. Besides, \(\alpha_k\) and \(\beta_t\) are the prior knowledge (pseudo counts) for topic \(k\) and word \(t\), respectively. The underlying physical meaning is that (\ref{Eqn:Gibbs_Sampling}) actually characterizes a word-generating path \(p(topic~k|doc~m)p(word~t|topic~k)\).

LDA: Training and Inference²

With the LDA model built, we want to: (i) estimate the model parameters, \(\boldsymbol\vartheta_m\) and \(\boldsymbol\varphi_k\), from training documents; (ii) find out the topic distribution, \(\boldsymbol\vartheta_{new}\), for each new document.

The training procedure is:

Initialization: assign a topic to each word in each document randomly;
For each word in each document, update its topic by the Gibbs sampling equation (\ref{Eqn:Gibbs_Sampling});
Repeat 2 until the Gibbs sampling converges;
Calculate the topic-to-word model parameter by \(\hat{\varphi}_{kt}=\frac{n_k^{(t)}+\beta_t}{\sum_{v=1}^V{(n_k^{(t)}+\beta_t)}}\), and save them as the model parameters.

Once the LDA model is trained, we are ready to analyze the topic distribution of any new document. The inference works in the following procedure:

Initialization: assign a topic to each word in the new document randomly;
For each word in the new document, update its topic by the Gibbs sampling equation (\ref{Eqn:Gibbs_Sampling}) (Note that the \(\hat{\varphi}_{kt}\) part is directly available from the trained model, and only \(\hat{\vartheta}_{mk}\) needs to be calculated regarding the new document);
Repeat 2 until the Gibbs sampling converges;
Calculate the topic distribution by \(\hat{\vartheta}_{new,k}=\frac{n_{new}^{(k)}+\alpha_k}{\sum_{k=1}^K{(n_{new}^{(k)}+\alpha_k)}}\).

There are multiple open-source LDA implementations available online. To learn how LDA could be implemented, a Python implementation can be found here.

LDA v.s. Probabilistic Latent Semantic Analysis (PLSA)

PLSA is a maximum likelihood (ML) model, while LDA is a maximum a posterior (MAP) model (Bayesian estimation). With that said, LDA would reduce to PLSA if a uniform Dirichlet prior is used. LDA is actually more complex than PLSA, so what could be the key advantages of LDA? The answer is the PRIOR! LDA would defeat PLSA, if there is a good prior for the data and the data by itself is not sufficient to train a model well.

References

G. Heinrich, Parameter estimation for text analysis, 2005. ↩
Z. Jin, LDA Topic Modeling (in Chinese), accessed on Mar 26, 2016. ↩

Hidden Markov Model and Part of Speech Tagging

2016-03-19T00:00:00-07:00

In a Markov model, we generally assume that the states are directly observable or one state corresponds to one observation/event only. However, this is not always true. A good example would be: in speech recognition, we are supposed to identify a sequence of words given a sequence of utterances, in which case the states (words) are not directly observable and one single state (word) could have different observations (utterances). This is a perfect example that could be treated as a hidden Markov model (HMM), by which the hidden states can be inferred from the observations.

Elements of a Hidden Markov Model (HMM)¹

A hidden Markov model, \(\Phi\), typically includes the following elements:

Time: \(t=\{1,2,...,T\}\);
\(N\) States: \(Q=\{1,2,...,N\}\);
\(M\) Observations: \(O=\{1,2,...,M\}\);
Initial Probabilities: \(\pi_i=p(q_1=i),~1 \leq i \leq N\);
Transition Probabilities: \(a_{ij}=p(q_{t+1}=j|q_t=i),~1 \leq i,j \leq N\);
Observation Probabilities: \(b_j(k)=p(o_t=k|q_t=j)~1 \leq j \leq N, 1 \leq k \leq M\).

The entire model can be characterized by \(\Phi=(A,B,\pi)\), where \(A=\{a_{ij}\}\), \(B=\{b_j(k)\}\) and \(\pi=\{\pi_i\}\). The states are "hidden", since they are not directly observable, but reflected in observations with uncertainty.

Three Basic Problems for HMMs¹

There are three basic problems that are very important to real-world applications of HMMs:

Problem 1: Evaluation Problem

Given the observation sequence \(O=o_1o_2...o_T\) and a model \(\Phi=(A,B,\pi)\), how to efficiently compute the probability of the observation sequence given the model, i.e., \(p(O|\Phi)\)?

Let

\begin{equation} \alpha_t(i)=p(o_1o_2...o_t,q_t=i|\Phi) \end{equation}

denote the probability that the state is \(i\) at time \(t\) and we have a sequence of observations \(o_1o_2...o_t\). The evaluation problem can be solved by the forward algorithm as illustrated below:

Base case:
\begin{equation} \alpha_1(i)=p(o_1,q_1=i|\Phi)=p(o_1|q_1=i,\Phi)p(q_1=i|\Phi)=\pi_ib_i(o_1),~1 \leq i \leq N; \end{equation}
Induction:
\begin{equation} \alpha_{t+1}(j)=\left[\sum_{i=1}^N{\alpha_{t}(i)a_{ij}}\right]b_j(o_{t+1}),~1 \leq j \leq N; \end{equation}
Termination:
\begin{equation} p(O|\Phi)=\sum_{i=1}^N{\alpha_T(i)}, \end{equation}

\begin{equation} p(q_T=i|O,\Phi)=\frac{\alpha_T(i)}{\sum_{j=1}^N{\alpha_T(j)}}. \end{equation}

The algorithm above essentially applies dynamic programming, and its complexity is \(O(N^2T)\).

Problem 2: Decoding Problem

Given the observation sequence \(O=o_1o_2...o_T\) and a model \(\Phi=(A,B,\pi)\), how to choose the "best" state sequence \(Q=q_1q_2...q_T\) (the most probable path) in terms of how good it explains the observations?

Define

\begin{equation} v_t(i)=\max_{q_1q_2...q_{t-1}}{p(q_1q_2...q_{t-1},q_t=i,o_1o_2...o_t|\Phi)} \end{equation}

as the best state sequence through which the state arrives at \(i\) at time \(t\) with a sequence of observations \(o_1o_2...o_t\). The decoding problem can be solved by the Viterbi algorithm as illustrated below:

Base case:
\begin{equation} v_1(i)=p(q_1=i,o_1|\Phi)=p(o_1|q_1=i,\Phi)p(q_1=i|\Phi)=\pi_ib_i(o_1),~1 \leq i \leq N; \end{equation}
Induction:
\begin{equation} v_{t+1}(j)=\left[\max_i{v_{t}(i)a_{ij}}\right]b_j(o_{t+1}),~1 \leq j \leq N, \end{equation}
in which the optimal \(i\) from the maximization should be stored properly for backtracking;
Termination: The best state sequence can be determined by first finding the optimal final state
\begin{equation} q_T=\max_i{v_T(i)}, \end{equation}
and then backtracking all the way to the initial state.

The algorithm above also applies dynamic programming, and its complexity is \(O(N^2T)\) as well.

Problem 3: Model Learning

Given the observation sequence \(O=o_1o_2...o_T\), how to find the model \(\Phi=(A,B,\pi)\) that maximizes \(p(O|\Phi)\)?

A general maximum likelihood (ML) learning approach could determine the optimal \(\Phi\) as

\begin{equation} \hat{\Phi}=\max_{\Phi}{p(O|\Phi)}. \end{equation}

It is much easier to perform supervised learning, where the true state are tagged to each observation. Given \(V\) training sequences in total, the model parameters can be estimated as

\begin{equation} \label{Eqn:Supervised_Learning} \hat{a}_{ij}=\frac{Count(q:i \rightarrow j)}{Count(q:i)},~~\hat{b}_j(k)=\frac{Count(q:j,o:k)}{Count(q:j)},~~\hat{\pi}_i=\frac{Count(q_1=i)}{V}. \end{equation}

It becomes a little bit tricky for unsupervised learning, where the true state are not tagged. To facilitate our model learning, we need to first introduce the following definition/calculation:

\begin{equation} \label{Eqn:Episilon} \varepsilon_t(i,j)=p(q_t=i,q_{t+1}=j|O,\Phi)=\frac{p(q_t=i,q_{t+1}=j,O|\Phi)}{p(O|\Phi)}=\frac{\alpha_t(i)a_{ij}b_j(o_{t+1})\beta_{t+1}(j)}{\sum_{i=1}^N{\alpha_T(i)}}, \end{equation}

where \(p(O|\Phi)\) is exactly Problem 1 we have yet talked about. \(\beta_{t+1}(j)\) can be calculated using the backward algorithm, which is very similar to the forward algorithm in Problem 1 to calculate \(\alpha_t(i)\) except the difference in the direction of calculation. Following (\ref{Eqn:Episilon}), we further introduce

\begin{equation} \label{Eqn:Gamma} \gamma_t(i)=p(q_t=i|O,\Phi)=\sum_{j=1}^N{\varepsilon_t(i,j)}. \end{equation}

Then the model parameters can be recomputed as

\begin{equation} \begin{split} \hat{a}_{ij}&=\frac{Expected~number~of~transitions~from~state~i~to~j}{Expected~number~of~transitions~from~state~i}\\ &=\frac{\sum_{t=1}^{T-1}{\varepsilon_t(i,j)}}{\sum_{t=1}^{T-1}{\gamma_t(i)}}, \end{split} \end{equation}

\begin{equation} \begin{split} \hat{b}_j(k)&=\frac{Expected~number~of~times~in~state~j~and~observing~k}{Expected~number~of~times~in~state~j}\\ &=\frac{\sum_{t=1,~s.t.~o_t=k}^{T}{\gamma_t(j)}}{\sum_{t=1}^{T}{\gamma_t(j)}}, \end{split} \end{equation}

\begin{equation} \begin{split} \hat{\pi}_i&=Expected~number~of~times~in~state~i~at~time~t=1\\ &=\gamma_1(i). \end{split} \end{equation}

Now we are ready to apply the expectation maximization (EM) algorithm for HMM learning. More specifically:

Initialize the HMM, \(\Phi\);
Repeat the two steps below until convergence:
- E Step: Given observations \(o_1o_2...o_T\) and the model \(\Phi\), compute \(\varepsilon_t(i,j)\) by (\ref{Eqn:Episilon}) and \(\gamma_t(i)\) by (\ref{Eqn:Gamma});
- M Step: Update the model \(\Phi\) by recomputing parameters using the three equations right above.

Part of Speech (POS) Tagging

In natural language processing, part of speech (POS) tagging is to associate with each word in a sentence a lexical tag. As an example, Janet (NNP) will (MD) back (VB) the (DT) bill (NN), in which each POS tag describes what its corresponding word is about. In this particular example, "VB" tells that "back" is a verb, and "NN" tells that "bill" is a noun, etc.

POS tagging is very useful, because it is usually the first step of many practical tasks, e.g., speech synthesis, grammatical parsing and information extraction. For instance, if we want to pronounce the word "record" correctly, we need to first learn from context if it is a noun or verb and then determine where the stress is in its pronunciation. A similar argument applies to grammatical parsing and information extraction as well.

We need to do some preprocessing before performing POS tagging using HMM. First, because the vocabulary size could be very large while most of the words are not frequently used, we replace each low-frequency word with a special word "UNKA". This is very helpful to reduce the vocabulary size, and thus reduce the memory cost on storing the probability matrix. Second, for each sentence, we add two tags to represent sentence boundaries, e.g., "START" and "END".

Now we are ready to apply HMM to perform POS tagging. The model can be characterized by:

Time: length of each sentence;
\(N\) States: POS tags, e.g., 45 POS tags from Penn Treebank;
\(M\) Observations: vocabulary (compressed by replacing low-frequency words with "UNKA");
Initial Probabilities: probability of each tag associated to the first word;
Transition Probabilities: \(p(t_{i+1}|t_i)\), where \(t_i\) represents the tag for the \(i\)th word;
Observation Probabilities: \(p(w|t)\), where \(t\) stands for a tag and \(w\) stands for a word.

Once we finish training the model, e.g., under supervised learning by (\ref{Eqn:Supervised_Learning}), we will then be able to tag new sentences applying the Viterbi algorithm as previously illustrated in Problem 2 for HMM. To see details about implementing POS tagging using HMM, click here for demo codes.

References

L. R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, in Proceedings of the IEEE, vol. 77, no. 2, pp. 257-286, Feb 1989. ↩

Expectation Maximization Algorithm and Gaussian Mixture Model

2016-03-12T00:00:00-08:00

In statistical modeling, it is possible that some observations are just missing. For example, when flipping two biased coins with unknown biases, we only have a sequence of observations on heads and tails, but forgot to record which coin each observation comes from. In this case, the conventional maximum likelihood (ML) or maximum a posteriori (MAP) algorithm would no longer be able to work, and it is time for the expectation maximization (EM) algorithm to come into play.

A Motivating Example

Although the two-biased-coin example above works as a valid example, another example will be discussed here, as it is more relevant to practical needs. Let us assume that we have a collection of float numbers, which come from two different Gaussian distributions. Unfortunately, we do not know which distribution each number comes from. Now we are supposed to learn the two Gaussian distributions (i,e, their means and variances) from the given data. This is the well-known Gaussian mixture model (GMM). What makes things difficult is that we have missing observations, i.e., membership of each number towards the two distributions. Though conventional ML or MAP would not work here, this is a perfect problem that EM can handle.

Expectation Maximization (EM) Algorithm¹

Let us consider a statistical model with a vector of unknown parameters \(\boldsymbol\theta\), which generates a set of observed data \(\textbf{X}\) and a set of missing observations \(\textbf{Z}\). The likelihood function, \(p(\textbf{X},\textbf{Z}|\boldsymbol\theta)\), characterizes the probability that \(\textbf{X}\) and \(\textbf{Z}\) appear given the model with parameters \(\boldsymbol\theta\). An intuitive idea to estimate \(\boldsymbol\theta\) would be trying to perform the maximum likelihood estimation (MLE) considering all possible \(\textbf{Z}\), i.e.,

\begin{equation} \max_{\boldsymbol\theta}{\ln{p(\textbf{X}|\boldsymbol\theta)}}=\max_{\boldsymbol\theta}{\ln{\sum_{\textbf{Z}}{p(\textbf{X},\textbf{Z}|\boldsymbol\theta)}}}=\max_{\boldsymbol\theta}{\ln{\sum_{\textbf{Z}}{p(\textbf{X}|\textbf{Z},\boldsymbol\theta)p(\textbf{Z}|\boldsymbol\theta)}}}. \end{equation}

Unfortunately, the problem above is not directly tractable, since we do not have any prior knowledge on the missing observations \(\textbf{Z}\).

The EM algorithm aims to solve the problem above by starting with a guess on \(\boldsymbol\theta=\boldsymbol\theta_{0}\) and then iteratively applying the two steps as indicated below:

Expectation Step (E Step): Calculate the log likelihood with respect to \(\boldsymbol\theta\) given \(\boldsymbol\theta_{t}\) by
\begin{equation} \mathcal{L}(\boldsymbol\theta|\boldsymbol\theta_{t})=\ln{\sum_{\textbf{Z}}{p(\textbf{X}|\textbf{Z},\boldsymbol\theta_{t})p(\textbf{Z}|\boldsymbol\theta_{t})}}; \end{equation}
Maximization Step (M Step): Find the parameter vector that maximizes the log likelihood above and then update it as
\begin{equation} \theta_{t+1}={\arg \, \max}_{\theta}{\mathcal{L}(\boldsymbol\theta|\boldsymbol\theta_{t})}. \end{equation}

There are two things that should be noted here:

There are two categories of EM: hard EM and soft EM. The algorithm illustrated above is soft EM, because the log likelihood in the E step is weighted upon all possible \(\textbf{Z}\) with their probabilities. While in hard EM, instead of using weighted average, we simply select the most probable \(\textbf{Z}\) and then move forward. The k-means algorithm is a good example of hard EM algorithm.
The EM algorithm typically converges to a local optimum, and cannot guarantee global optimum. With this being said, the solution might differ with different initialization, and it is possibly helpful to try more than one initialization when applying EM practically.

Gaussian Mixture Model (GMM)

In the motivating example, a GMM with two Gaussian distributions was introduced. Here we are going to extend it to a general case with \(K\) Gaussian distributions, and the data points will be generalized to be multidimensional. At the same time, we will discuss how it can be used for clustering.

Given a data set containing \(N\) data points, \(\mathcal{D}=\{\textbf{x}_1,\textbf{x}_2,...,\textbf{x}_N\}\), in which each data point is a \(M\)-dimensional column vector and comes from one of \(K\) Gaussian distributions. Here we will introduce \(\mathcal{Z}=\{z_1,z_2,...,z_N\}\) with \(z_i\in\{1,2,...,K\}\) as latent (hidden) variables to represent the cluster membership of the data points in \(\mathcal{D}\). The \(K\) Gaussian distributions are characterized by \(\mathcal{N}(\boldsymbol\mu_j,\boldsymbol\Sigma_j)\) for \(j=1,2,...,K\), and the \(j\)th distribution has a weight of \(\pi_j\) accounted in the overall distribution. Let us first try to map this GMM model to the EM algorithm component by component:

\(\mathcal{D}\) is the observed data;
\(\mathcal{Z}\) is the missing observations;
\(\boldsymbol\mu_j\), \(\boldsymbol\Sigma_j\) and \(\pi_j\) are the unknown model parameters.

Following the EM algorithm, we will start with a guess on the unknown parameters, and then iteratively applying E step and M step until convergence. In the E step, we calculate the log likelihood based on given model parameters by

\begin{align} \begin{aligned} LL&=\ln{p(\textbf{x}_1,\textbf{x}_2,...,\textbf{x}_N)}\\ &=\ln{\prod_{i=1}^{N}{p(\textbf{x}_i)}}\\ &=\ln{\prod_{i=1}^{N}{\sum_{j=1}^{K}{p(z_i=j)p(\textbf{x}_i|z_i=j)}}}\\ &=\sum_{i=1}^{N}{\ln\left(\sum_{j=1}^{K}{\pi_j\mathcal{N}(\textbf{x}_i|\boldsymbol\mu_j,\boldsymbol\Sigma_j)}\right)}. \end{aligned} \end{align}

In the M step, we maximize the log likelihood by solving the optimization problem below:

\begin{align} \begin{aligned} \max_{\boldsymbol\mu_j,\boldsymbol\Sigma_j,\pi_j}~~&{LL}\\ &s.t.~\sum_{j=1}^{K}{\pi_j}=1. \end{aligned} \end{align}

We can apply Lagrange multiplier to solve the problem above. Let

\begin{equation} L=\sum_{i=1}^{N}{\ln\left(\sum_{j=1}^{K}{\pi_j\mathcal{N}(\textbf{x}_i|\boldsymbol\mu_j,\boldsymbol\Sigma_j)}\right)}-\lambda\left(\sum_{j=1}^{K}{\pi_j}-1\right), \end{equation}

where \(\lambda\) is the Lagrange multiplier. Taking partial derivatives and setting them to zero, we can obtain the optimal parameters as below:

\begin{equation} \label{Eqn:Maximization_1} \boldsymbol\mu_j=\frac{\sum_{i=1}^{N}{\gamma_{ij}}\textbf{x}_i}{\sum_{i=1}^{N}{\gamma_{ij}}}, \end{equation}

\begin{equation} \label{Eqn:Maximization_2} \boldsymbol\Sigma_j=\frac{\sum_{i=1}^{N}{\gamma_{ij}}(\textbf{x}_i-\boldsymbol\mu_j)(\textbf{x}_i-\boldsymbol\mu_j)^T}{\sum_{i=1}^{N}{\gamma_{ij}}}, \end{equation}

\begin{equation} \label{Eqn:Maximization_3} \pi_j=\frac{1}{N}{\sum_{i=1}^{N}{\gamma_{ij}}}, \end{equation}

where \(\gamma_{ij}=p(z_i=j|\textbf{x}_i)\) is the cluster membership, which can be calculated using Bayes theorem,

\begin{equation} \begin{split} \gamma_{ij}&=p(z_i=j|\textbf{x}_i)\\ &=\frac{p(z_i=j)p(\textbf{x}_i|z_i=j)}{\sum_{j=1}^{K}{p(z_i=j)p(\textbf{x}_i|z_i=j)}}\\ &=\frac{\pi_j\mathcal{N}(\textbf{x}_i|\boldsymbol\mu_j,\boldsymbol\Sigma_j)}{\sum_{j=1}^{K}{\pi_j\mathcal{N}(\textbf{x}_i|\boldsymbol\mu_j,\boldsymbol\Sigma_j)}}. \end{split} \end{equation}

To summarize, the GMM model can be learned using EM algorithm as in the following steps:

Initialize \(\boldsymbol\mu_j\), \(\boldsymbol\Sigma_j\) and \(\pi_j\) for \(j=1,2,...,K\);
Repeat the following two steps until the log likelihood converges:
- E Step: Estimate cluster membership \(\gamma_{ij}\) by the equation right above for all data point \(x_i\) and cluster \(z_i=j\);
- M Step: Maximize the log likelihood and update the model parameters by (\ref{Eqn:Maximization_1})-(\ref{Eqn:Maximization_3}) based on cluster membership \(\gamma_{ij}\).

References

Wikipedia, Expectation–maximization algorithm, accessed on Mar 12, 2016. ↩

Locating and Filling Missing Words in Sentences

2016-03-05T00:00:00-08:00

There has been many occasions that we have incomplete sentences that are needed to completed. One example is that in speech recognition noisy environment can lead to unrecognizable words, but we still hope to recover and understand the complete sentence (e.g., by inference); another example is sentence completion questions that appear in language tests (e.g., SAT, GRE, etc.).

What are Exactly the Problem?

Generally, the problem we are aiming to solve is locating and filling any missing words in incomplete sentences. However, this problem seems too ambitious so far, and we direct ourselves to a simplified version of this problem. To simplify the problem, we assume that there is only one missing word in a sentence, and the missing word is neither the first word nor the last word of the sentence. This problem originally comes from here.

Locating the Missing Word

Two approaches are presented here so as to locate the missing word.

N-gram Model

For a given training data set, define \(C(w_1,w_2)\) as the number of occurrences of the bigram pattern \((w_1,w_2)\), and \(C(w_1,w,w_2)\) the number of occurrences of the trigram pattern \((w_1,w,w_2)\). Then, the number of occurrences of the pattern, where there is one and only one word between \(w_1\) and \(w_2\), can be calculated by

\begin{equation} D(w_1,w_2)=\sum_{w\in{V}}C(w_1,w,w_2), \end{equation}

where \(V\) is the vocabulary.

Consider a particular location, \(l\), of an incomplete sentence of length \(L\), and let \(w_l\) be the \(l\)th word in the sentence. \(D(w_{l-1},w_{l})\) would be the number of positive votes from the training data set for missing word at this location, while \(C(w_{l-1},w_{l})\) would be correspondingly the number of negative votes. We define the score indicating there is a missing word at location \(l\) as

\begin{equation} \label{Eqn:Score} S_l=\frac{D(w_{l-1},w_{l})^{1+\gamma}}{C(w_{l-1},w_{l})+D(w_{l-1},w_{l})}-\frac{C(w_{l-1},w_{l})^{1+\gamma}}{C(w_{l-1},w_{l})+D(w_{l-1},w_{l})}, \end{equation}

where \(\gamma\) is a small positive constant. Hence, the missing word location can be identified by

\begin{equation} \hat{l}={\arg \, \max}_{1 \leq l \leq L-1} S_l. \end{equation}

Note that in (\ref{Eqn:Score}), if we set \(\gamma=0\), the left part would be exactly the percentage of positive votes for missing word at that location, and the right part is the percentage of negative votes. It seems a fairly reasonable score, then why do we still need a positive \(\gamma\)? The underlying reason is that intuitively the more number of votes for a particular decision, the more confident we are on that decision. This trend is reflected in a positive \(\gamma\), which can be viewed as sparse vote penalty and is useful in breaking ties in the missing word location voting. That is, if we have exactly the same ratio of positive votes relative to negative votes for two candidate locations, e.g., 80 positive votes v.s. 20 negative votes for location A, and 8 positive votes v.s. 2 negative votes for location B, we would believe that location A is more likely to be the missing word location compared with location B.

Word Distance Statistics (WDS)

In view of the fact that the statistics of the two words immediately adjacent to a given location contribute a lot in deciding whether the location has a word missing, we tentatively guess that all the words within a window centered at that location would more or less contribute some information as well. As a result, we introduce the concept of word distance statistics (WDS).

More specifically, we use \(\widetilde{C}(w_1,w_2,m)\) to denote the number of occurrences of the pattern, where there is exactly \(m\) words between \(w_1\) and \(w_2\), i.e., the word distance of \(w_1\) and \(w_2\) is \(m\). For a given location \(l\) in an incomplete sentence and a word window size \(W\), we are interested in the word distance statistics of each word pair, in which one word \(w_i\) is on the left of the location \(l\), and the other word \(w_j\) is on the right, as illustrated in Fig. 1.

Fig. 1: Word distance illustration.

Formally, for any \(l-W/2 \leq i \leq l-1\) and \(l \leq j \leq l+W/2-1\), \(\widetilde{C}(w_i,w_j,j-i)\) would be the number of positive votes for missing word at this location, while \(\widetilde{C}(w_i,w_j,j-i-1)\) is the number of negative votes. Applying the idea in (\ref{Eqn:Score}), for each word pair \((w_i,w_j)\), we extract its feature as the score indicating there is a missing word at location \(l\), i.e.,

\begin{equation} \label{Eqn:ScoreGeneralized} S_l(i,j)=\frac{\widetilde{C}(w_i,w_j,j-i)^{1+\gamma}}{\widetilde{C}(w_i,w_j,j-i)+\widetilde{C}(w_i,w_j,j-i-1)}-\frac{\widetilde{C}(w_i,w_j,j-i-1)^{1+\gamma}}{\widetilde{C}(w_i,w_j,j-i)+\widetilde{C}(w_i,w_j,j-i-1)}. \end{equation}

As a special example, let \(i=l-1\) and \(j=l\), (\ref{Eqn:ScoreGeneralized}) would be reduced to (\ref{Eqn:Score}).

To find the missing word location, we need to assign different weights to the extracted features, \(S_l(i,j)\). Then, the missing word location can be determined by

\begin{equation} \label{Eqn:LocationDetermination} \hat{l}={\arg \, \max}_{1 \leq l \leq L-1} \sum_{l-\frac{W}{2} \leq i \leq l-1}\sum_{l \leq j \leq l+\frac{W}{2}-1}v(i,j)S_l(i,j), \end{equation}

where the weight, \(v(i,j)\), should be monotonically decreasing with respect to \(|j-i|\).

Filling the Missing Word

To find the most probable word in the given missing word location, we take into account five conditional probabilities, as shown in Table 1, to explore the statistical connection between the candidate words and the surrounding words at the missing word location. Ultimately, the most probable missing word can be determined by

\begin{equation} \hat{w}={\arg \, \max}_{w\in{B}} \sum_{1 \leq i \leq 5} v_iP_i, \end{equation}

where \(B\) is the candidate word space (detailed here), and the weight \(v_i\) is used to reflect the importance of each conditional probability in contributing to the final score.

Table 1: Conditional probabilities considered in missing word filling, in which "*" denotes an arbitrary word.

Experimental Results

The training data contains \(30,301,028\) complete sentences, of which the average sentence length is approximately \(25\). In the vocabulary with a size of \(2,425,337\), \(14,216\) words that have occurred in at least \(0.1\%\) of total sentences are labeled as high-frequency words, and the remaining \(58,417,315\) words are labeled as 'UNKA'. To perform the cross validation, in our experiments, the training data is splitted into two part, TRAIN and DEV. The TRAIN set is used to train our models, and the DEV set is applied to test our models.

Missing Word Location

Table 2 shows the estimation accuracy of the missing word locations for the two proposed approaches, N-gram and WDS. For comparison, we list the corresponding probabilities by chance as well. Each entry shows the probabilities that the correct location is included in the ranked candidate location list returned by each approach, where the list size varies from \(1\) to \(10\). The sparse vote penalty coefficient, \(\gamma\), is set to 0.01. In the WDS approach, we consider a word window size \(W=4\), i.e., four pairs of words are taken into account.

	Top 1	Top 2	Top 3	Top 5	Top 10
Chance	4%	8%	12%	20%	40%
N-gram	51.47%	63.70%	71.00%	80.26%	91.54%
WDS	52.06%	64.50%	71.76%	80.91%	91.93%

Table 2: Accuracy of missing word location.

Missing Word Filling

Table 3 shows the accuracies of filling the missing word given the location. Each row of the second column shows the probability that the correct word is included in the ranked candidate words list returned by the proposed approach.

	Top 1	Top 2	Top 3	Top 5	Top 10
Accuracy	32.15%	41.49%	46.23%	52.02%	59.15%

Table 3: Accuracy of missing word filling.

Acknowledgement

I did this project with my partner, Zhe Wang. To see the codes and/or report, click here for more information.

Binary and Multiclass Logistic Regression Classifiers

2016-02-28T00:00:00-08:00

The generative classification model, such as Naive Bayes, tries to learn the probabilities and then predict by using Bayes rules to calculate the posterior, \(p(y|\textbf{x})\). However, discrimitive classifiers model the posterior directly. As one of the most popular discrimitive classifiers, logistic regression directly models the linear decision boundary.

Binary Logistic Regression Classifier¹

Let us start with the binary case. For an M-dimensional feature vector \(\textbf{x}=[x_1,x_2,...,x_M]^T\), the posterior probability of class \(y\in\{\pm{1}\}\) given \(\textbf{x}\) is assumed to satisfy

\begin{equation} \ln{\frac{p(y=1|\textbf{x})}{p(y=-1|\textbf{x})}}=\textbf{w}^T\textbf{x}, \end{equation}

where \(\textbf{w}=[w_1,w_2,...,w_M]^T\) is the weighting vector to be learned. Given the constraint that \(p(y=1|\textbf{x})+p(y=-1|\textbf{x})=1\), it follows that

\begin{equation} \label{Eqn:Prob_Binary} p(y|\textbf{x})=\frac{1}{1+\exp(-y\textbf{w}^T\textbf{x})}=\sigma(y\textbf{w}^T\textbf{x}), \end{equation}

in which we can observe the logistic sigmoid function \(\sigma(a)=\frac{1}{1+\exp(-a)}\).

Based on the assumptions above, the weighting vector, \(\textbf{w}\), can be learned by maximum likelihood estimation (MLE). More specifically, given training data set \(\mathcal{D}=\{(\textbf{x}_1,y_1),(\textbf{x}_2,y_2),...,(\textbf{x}_N,y_N)\}\),

\begin{align} \begin{aligned} \textbf{w}^*&=\max_{\textbf{w}}{\mathcal{L}(\textbf{w})}\\ &=\max_{\textbf{w}}{\sum_{i=1}^N\ln{{p(y_i|\textbf{x}_i)}}}\\ &=\max_{\textbf{w}}{\sum_{i=1}^N{\ln{\frac{1}{1+\exp(-y_i\textbf{w}^T\textbf{x}_i)}}}}\\ &=\min_{\textbf{w}}{\sum_{i=1}^N{\ln{(1+\exp(-y_i\textbf{w}^T\textbf{x}_i))}}}. \end{aligned} \end{align}

We have a convex objective function here, and we can calculate the optimal solution by applying gradient descent. The gradient can be drawn as

\begin{align} \begin{aligned} \nabla{\mathcal{L}(\textbf{w})}&=\sum_{i=1}^N{\frac{-y_i\textbf{x}_i\exp(-y_i\textbf{w}^T\textbf{x}_i)}{1+\exp(-y_i\textbf{w}^T\textbf{x}_i)}}\\ &=-\sum_{i=1}^N{y_i\textbf{x}_i(1-p(y_i|\textbf{x}_i))}. \end{aligned} \end{align}

Then, we can learn the optimal \(\textbf{w}\) by starting with an initial \(\textbf{w}_0\) and iterating as follows:

\begin{equation} \label{Eqn:Iteration_Binary} \textbf{w}_{t+1}=\textbf{w}_{t}-\eta_t\nabla{\mathcal{L}(\textbf{w})}, \end{equation}

where \(\eta_t\) is the learning step size. It can be invariant to time, but time-varying step sizes could potential reduce the convergence time, e.g., setting \(\eta_t\propto{1/\sqrt{t}}\) such that the step size decreases with an increasing time \(t\).

Multiclass Logistic Regression Classifier¹

When it is generalized to multiclass case, the logistic regression model needs to adapt accordingly. Now we have \(K\) possible classes, that is, \(y\in\{1,2,..,K\}\). It is assumed that the posterior probability of class \(y=k\) given \(\textbf{x}\) follows

\begin{equation} \ln{p(y=k|\textbf{x})}\propto\textbf{w}_k^T\textbf{x}, \end{equation}

where \(\textbf{w}_k\) is a column weighting vector corresponding to class \(k\). Considering all classes \(k=1,2,...,K\), we would have a weighting matrix that includes all \(K\) weighting vectors. That is, \(\textbf{W}=[\textbf{w}_1,\textbf{w}_2,...,\textbf{w}_K]\). Under the constraint

\begin{equation} \sum_{k=1}^K{p(y=k|\textbf{x})}=1, \end{equation}

it then follows that

\begin{equation} \label{Eqn:Prob_Multiple} p(y=k|\textbf{x})=\frac{\exp(\textbf{w}_k^T\textbf{x})}{\sum_{j=1}^K{\exp(\textbf{w}_j^T\textbf{x})}}. \end{equation}

The weighting matrix, \(\textbf{W}\), can be similarly learned by maximum likelihood estimation (MLE). More specifically, given training data set \(\mathcal{D}=\{(\textbf{x}_1,y_1),(\textbf{x}_2,y_2),...(\textbf{x}_N,y_N)\}\),

\begin{align} \begin{aligned} \textbf{W}^*&=\max_{\textbf{W}}{\mathcal{L}(\textbf{W})}\\ &=\max_{\textbf{W}}{\sum_{i=1}^N\ln{{p(y_i|\textbf{x}_i)}}}\\ &=\max_{\textbf{W}}{\sum_{i=1}^N{\ln{\frac{\exp(\textbf{w}_{y_i}^T\textbf{x})}{\sum_{j=1}^K{\exp(\textbf{w}_j^T\textbf{x})}}}}}. \end{aligned} \end{align}

The gradient of the objective function with respect to each \(\textbf{w}_k\) can be calculated as

\begin{align} \begin{aligned} \frac{\partial{\mathcal{L}(\textbf{W})}}{\partial{\textbf{w}_k}}&=\sum_{i=1}^N{\textbf{x}_i\left(I(y_i=k)-\frac{\exp(\textbf{w}_k^T\textbf{x})}{\sum_{j=1}^K{\exp(\textbf{w}_j^T\textbf{x})}}\right)}\\ &=\sum_{i=1}^N{\textbf{x}_i(I(y_i=k)-p(y_i=k|\textbf{x}_i))}, \end{aligned} \end{align}

where \(I(\cdot)\) is a binary indicator function. Applying gradient descent, the optimal solution can be obtained by iterating as follows:

\begin{equation}\label{Eqn:Iteration_Multiple} \textbf{w}_{k,t+1}=\textbf{w}_{k,t}+\eta_{t}\frac{\partial{\mathcal{L}(\textbf{W})}}{\partial{\textbf{w}_k}}. \end{equation}

Note that we have "\(+\)" in (\ref{Eqn:Iteration_Multiple}) instead of "\(-\)" in (\ref{Eqn:Iteration_Binary}), because the maximum likelihood estimation in the binary case is eventually converted to a minimization problem, while here we keep performing maximization.

How to Perform Predictions?

Once the optimal weights are learned from the logistic regression model, for any new feature vector \(\textbf{x}\), we can easily calculate the probability that it is associated to each class label \(k\) by (\ref{Eqn:Prob_Binary}) in the binary case or (\ref{Eqn:Prob_Multiple}) in the multiclass case. With the probabilities for each class label available, we can then perform:

a hard decision by identifying the class label with the highest probability, or
a soft decision by showing the top \(k\) most probable class labels with their corresponding probabilities.

An Example Applying Multiclass Logistic Regression

To see an example applying multiclass logistic regression classification, click here for more information.

References

C. M. Bishop, Pattern Recognition and Machine Learning. New York: Springer, 2006. ↩

Tianlong's Blog

Language Detection from Speech: Chinese or English?

Raw Data Collection

Data Preprocessing

Model Training and Evaluation

Results & Discussion

Code Repository

Acknowledgment

An Approach Of Scaling Airflow To A Corporate Level

A Guide On How To Build An Airflow Server/Cluster

A Glimpse at Airflow under the Hood

Phase 1: Start with Standalone Mode Using Sequential Executor

Phase 2: Adopt Pseudo-distributed Mode Using Local Executor

Phase 3: Extend to Distributed Mode Using Celery Executor

Monte Carlo Tree Search and Its Application in AlphaGo

Warm Up: Bandit-Based Methods2

Monte Carlo Tree Search (MCTS)2

Four Fundamental Steps in Each Iteration

The Full Algorithm Description

How is MCTS used by Google's AlphaGo?1

Acknowledgement

References

Neural Networks and Deep Learning

Neural Network (NN) Basics

Sigmoid Neurons

The Architecture of NNs

Learning with Gradient Descent

The Backpropagation Algorithm: How to Compute Gradients of the Cost Function?

Matrix Notation for NNs

Four Fundamental Equations behind Backpropagation

The Backpropagation Algorithm

Deep Learning

Convolutional Networks

Deep Learning in Practice

Acknowledgement

References

Latent Dirichlet Allocation and Topic Modeling

Latent Dirichlet Allocation (LDA) Topic Model1

Joint Distribution Emulation by Gibbs Sampling2

LDA: Training and Inference2

LDA v.s. Probabilistic Latent Semantic Analysis (PLSA)

References

Hidden Markov Model and Part of Speech Tagging

Elements of a Hidden Markov Model (HMM)1

Three Basic Problems for HMMs1

Problem 1: Evaluation Problem

Problem 2: Decoding Problem

Problem 3: Model Learning

Part of Speech (POS) Tagging

References

Expectation Maximization Algorithm and Gaussian Mixture Model

A Motivating Example

Expectation Maximization (EM) Algorithm1

Gaussian Mixture Model (GMM)

References

Locating and Filling Missing Words in Sentences

What are Exactly the Problem?

Locating the Missing Word

N-gram Model

Word Distance Statistics (WDS)

Filling the Missing Word

Experimental Results

Missing Word Location

Missing Word Filling

Acknowledgement

Binary and Multiclass Logistic Regression Classifiers

Binary Logistic Regression Classifier1

Multiclass Logistic Regression Classifier1

How to Perform Predictions?

An Example Applying Multiclass Logistic Regression

References

Warm Up: Bandit-Based Methods²

Monte Carlo Tree Search (MCTS)²

How is MCTS used by Google's AlphaGo?¹

Latent Dirichlet Allocation (LDA) Topic Model¹

Joint Distribution Emulation by Gibbs Sampling²

LDA: Training and Inference²

Elements of a Hidden Markov Model (HMM)¹

Three Basic Problems for HMMs¹

Expectation Maximization (EM) Algorithm¹

Binary Logistic Regression Classifier¹

Multiclass Logistic Regression Classifier¹