Tianlong's Bloghttps://stlong0521.github.io/2017-07-15T00:00:00-04:00An Approach Of Scaling Airflow To A Corporate Level2017-07-15T00:00:00-04:00Tianlong Songtag:stlong0521.github.io,2017-07-15:20170715 - Airflow in Corporate.html<p><a href="https://stlong0521.github.io/20161023%20-%20Airflow.html">The last post on Airflow</a> provides step-by-step instructions on how to build an Airflow cluster from scratch. It could serve the development purpose well, but lacks critical features to work in prod, e.g., CI/CD compliance, resource monitoring, service recovery, and so on.</p>
<p>I have been leading the efforts to build the Airflow backbone at Zillow's Data Science and Engineering (DSE) team, and I would like to introduce <a href="https://www.zillow.com/data-science/airflow-at-zillow/">a post</a> from Zillow's tech blog site. It describes how Airflow is adopted and working at Zillow, and can possibly give you an idea on how Airflow can be configured to run in a corporate level.</p>A Guide On How To Build An Airflow Server/Cluster2016-10-23T00:00:00-04:00Tianlong Songtag:stlong0521.github.io,2016-10-23:20161023 - Airflow.html<p><a href="https://github.com/apache/incubator-airflow">Airflow</a> is an open-source platform to author, schedule and monitor workflows and data pipelines. When you have periodical jobs, which most likely involve various data transfer and/or show dependencies on each other, you should consider Airflow. This blog post briefly introduces Airflow, and provides the instructions to build an Airflow server/cluster from scratch.</p>
<h3>A Glimpse at Airflow under the Hood</h3>
<p>Generally, Airflow works in a distributed environment, as you can see in the diagram below. The airflow scheduler schedules jobs according to the dependencies defined in directed acrylic graphs (DAGs), and the airflow workers pick up and run jobs with their loads properly balanced. All job information is stored in the meta DB, which is updated in a timely manner. The users can monitor their jobs via a shiny Airflow web UI and/or the logs.</p>
<figure align="center">
<img src="/figures/20161023/Airflow.png" alt="Airflow">
<figcaption align="center">Fig. 1: Airflow Diagram.</figcaption>
</figure>
<p>Although you do not necessarily need to run a fully distributed version of Airflow, this page will go through all three modes: standalone, pseudo-distributed and distributed modes.</p>
<h3>Phase 1: Start with Standalone Mode Using Sequential Executor</h3>
<blockquote>
<p>Under the standalone mode with a sequential executor, the executor picks up and runs jobs sequentially, which means there is no parallelism for this choice. Although not often used in production, it enables you to get familiar with Airflow quickly.</p>
</blockquote>
<ol>
<li>
<p>Install and configure airflow</p>
<div class="highlight"><pre># Set the airflow home
export AIRFLOW_HOME=~/airflow
# Install from pypi using pip
pip install airflow
# Install necessary sub-packages
pip install airflow[crypto] # For connection credentials protection
pip install airflow[postgres] # For PostgreSQL DBs
pip install airflow[celery] # For distributed mode: celery executor
pip install airflow[rabbitmq] # For message queuing and passing between airflow server and workers
... # Anything more you need
# Configure airflow: modify AIRFLOW_HOME/airflow.cfg if necessary
# For the standalone mode, we will leave the configuration to default
</pre></div>
</li>
<li>
<p>Initialize the meta database (home for almost all airflow information)</p>
<div class="highlight"><pre># For the standalone mode, it could be a sqlite database, which applies to sequential executor only
airflow initdb
</pre></div>
</li>
<li>
<p>Start the airflow webserver and explore the web UI</p>
<div class="highlight"><pre>airflow webserver -p 8080 # Test it out by opening a web browser and go to localhost:8080
</pre></div>
</li>
<li>
<p>Create your dags and place them into your DAGS_FOLDER (AIRFLOW_HOME/dags by default); refer to this <a href="https://pythonhosted.org/airflow/tutorial.html">tutorial</a> for how to create a dag, and keep the key commands below in mind</p>
<div class="highlight"><pre># Check syntax errors for your dag
python ~/airflow/dags/tutorial.py
# Print the list of active DAGs
airflow list_dags
# Print the list of tasks the "tutorial" dag_id
airflow list_tasks tutorial
# Print the hierarchy of tasks in the tutorial DAG
airflow list_tasks tutorial --tree
# Test your tasks in your dag
airflow test [DAG_ID] [TASK_ID] [EXECUTION_DATE]
airflow test tutorial sleep 2015-06-01
# Backfill: execute jobs that are not done in the past
airflow backfill tutorial -s 2015-06-01 -e 2015-06-07
</pre></div>
</li>
<li>
<p>Start the airflow scheduler and monitor the tasks via the web UI</p>
<div class="highlight"><pre>airflow scheduler # Monitor the your tasks via the web UI (success/failure/scheduling, etc.)
# Remember to turn on the dags you want to run via the web UI, if they are not on yet
</pre></div>
</li>
<li>
<p>[Optional] Put your dags in remote storage, and sync them with your local dag folder</p>
<div class="highlight"><pre># Create a daemon using crons to sync up dags; below is an example for remote dags in S3 (you can also put them in remote repo)
# Note: you need to have the aws command line tool installed and your AWS credentials properly configured
crontab -e
* * * * * /usr/local/bin/aws s3 sync s3://your_bucket/your_prefix YOUR_AIRFLOW_HOME/dags # Sync up every minute
</pre></div>
</li>
<li>
<p>[Optional] Add access control to the web UI; add users with password protection, see <a href="https://pythonhosted.org/airflow/security.html">here</a>. You may need to install the dependency below</p>
<div class="highlight"><pre>pip install flask-bcrypt
</pre></div>
</li>
</ol>
<h3>Phase 2: Adopt Pseudo-distributed Mode Using Local Executor</h3>
<blockquote>
<p>Under the pseudo-distributed mode with a local executor, the local workers pick up and run jobs locally via multiprocessing. If you have only a moderate amount of scheduled jobs, this could be the right choice.</p>
</blockquote>
<ol>
<li>
<p>Adopt another DB server to support executors other than the sequential executor; MySQL and PostgreSQL are recommended; here PostgreSQL is used as an example</p>
<div class="highlight"><pre># Install postgres
brew install postgresql # For Mac, the command varies for different OS
# Connect to the database
psql -d postgres # This will open a prompt
# Operate on the database server
\l # List all databases
\du # List all users/roles
\dt # Show all tables in database
\h # List help information
\q # Quit the prompt
# Create a meta db for airflow
CREATE DATABASE database_name;
\l # Check for success
</pre></div>
</li>
<li>
<p>Modify the configuration in AIRFLOW_HOME/airflow.cfg</p>
<div class="highlight"><pre># Change the executor to Local Executor
executor = LocalExecutor
# Change the meta db configuration
# Note: the postgres username and password do not matter for now, since the database server and clients are still on the same host
sql_alchemy_conn = postgresql+psycopg2://your_postgres_user_name:your_postgres_password@host_name/database_name
</pre></div>
</li>
<li>
<p>Restart airflow to test your dags</p>
<div class="highlight"><pre>airflow initdb
airflow webserver
airflow scheduler
</pre></div>
</li>
<li>
<p>Establish your own connections via the web UI; you can test your DB connections via the Ad Hoc Query (see <a href="https://pythonhosted.org/airflow/profiling.html">here</a>)</p>
<div class="highlight"><pre><span class="c"># Go to the web UI: Admin -> Connection -> Create</span>
<span class="n">Connection</span> <span class="n">ID</span><span class="p">:</span> <span class="n">name</span> <span class="n">it</span>
<span class="n">Connection</span> <span class="n">Type</span><span class="p">:</span> <span class="n">e</span><span class="o">.</span><span class="n">g</span><span class="o">.</span><span class="p">,</span> <span class="n">database</span><span class="o">/</span><span class="n">AWS</span>
<span class="n">Host</span><span class="p">:</span> <span class="n">e</span><span class="o">.</span><span class="n">g</span><span class="o">.</span><span class="p">,</span> <span class="n">your</span> <span class="n">database</span> <span class="n">server</span> <span class="n">name</span> <span class="ow">or</span> <span class="n">address</span>
<span class="n">Scheme</span><span class="p">:</span> <span class="n">e</span><span class="o">.</span><span class="n">g</span><span class="o">.</span><span class="p">,</span> <span class="n">your</span> <span class="n">database</span>
<span class="n">Username</span><span class="p">:</span> <span class="n">your</span> <span class="n">user</span> <span class="n">name</span>
<span class="n">Password</span><span class="p">:</span> <span class="n">will</span> <span class="n">be</span> <span class="n">encrypted</span> <span class="k">if</span> <span class="n">airflow</span><span class="p">[</span><span class="n">crypto</span><span class="p">]</span> <span class="ow">is</span> <span class="n">installed</span>
<span class="n">Extra</span><span class="p">:</span> <span class="n">additional</span> <span class="n">configuration</span> <span class="ow">in</span> <span class="n">JSON</span><span class="p">,</span> <span class="n">e</span><span class="o">.</span><span class="n">g</span><span class="o">.</span><span class="p">,</span> <span class="n">AWS</span> <span class="n">credentials</span>
<span class="c"># Encrypt your credentials</span>
<span class="c"># Generate a valid Fernet key and place it into airflow.cfg</span>
<span class="n">FERNET_KEY</span><span class="o">=</span><span class="err">$</span><span class="p">(</span><span class="n">python</span> <span class="o">-</span><span class="n">c</span> <span class="s">"from cryptography.fernet import Fernet; FERNET_KEY = Fernet.generate_key().decode(); print FERNET_KEY"</span><span class="p">)</span>
</pre></div>
</li>
</ol>
<h3>Phase 3: Extend to Distributed Mode Using Celery Executor</h3>
<blockquote>
<p>Under the distributed mode with a celery executor, remote workers pick up and run jobs as scheduled and load-balanced. As being highly scalable, it is the choice when you expect heavy and expanding loads.</p>
</blockquote>
<ol>
<li>
<p>Install and configure the message queuing/passing engine on the airflow server: RabbitMQ/Reddis/etc; RabbitMQ (resources: <a href="http://docs.celeryproject.org/en/latest/getting-started/brokers/rabbitmq.html">link1</a> and <a href="http://blog.genesino.com/2016/05/airflow/">link2</a>)</p>
<div class="highlight"><pre><span class="err">#</span><span class="x"> Install RabbitMQ</span>
<span class="x">brew install rabbitmq </span><span class="err">#</span><span class="x"> For Mac, the command varies for different OS</span>
<span class="err">#</span><span class="x"> Add the following path to your .bash_profile or .profile</span>
<span class="x">PATH=</span><span class="p">$</span><span class="nv">PATH</span><span class="x">:/usr/local/sbin</span>
<span class="err">#</span><span class="x"> Start the RabbitMQ server</span>
<span class="x">sudo rabbitmq-server </span><span class="err">#</span><span class="x"> run in foreground; or</span>
<span class="x">sudo rabbitmq-server -detached </span><span class="err">#</span><span class="x"> run in background</span>
<span class="err">#</span><span class="x"> Configure RabbitMQ: create user and grant privileges</span>
<span class="x">rabbitmqctl add_user rabbitmq_user_name rabbitmq_password</span>
<span class="x">rabbitmqctl add_vhost rabbitmq_virtual_host_name</span>
<span class="x">rabbitmqctl set_user_tags rabbitmq_user_name rabbitmq_tag_name</span>
<span class="x">rabbitmqctl set_permissions -p rabbitmq_virtual_host_name rabbitmq_user_name ".*" ".*" ".*"</span>
<span class="err">#</span><span class="x"> Make the RabbitMQ server open to remote connections</span>
<span class="x">Go to /usr/local/etc/rabbitmq/rabbitmq-env.conf, and change NODE_IP_ADDRESS from 127.0.0.1 to 0.0.0.0 (development only, restrict access for prod)</span>
</pre></div>
</li>
<li>
<p>Modify the configuration in AIRFLOW_HOME/airflow.cfg</p>
<div class="highlight"><pre># Change the executor to Celery Executor
executor = CeleryExecutor
# Set up the RabbitMQ broker url and celery result backend
broker_url = amqp://rabbitmq_user_name:rabbitmq_password@host_name/rabbitmq_virtual_host_name # host_name=localhost on server
celery_result_backend = same as above
</pre></div>
</li>
<li>
<p>Open the meta DB (PostgreSQL) to remote connections</p>
<div class="highlight"><pre># Modify /usr/local/var/postgres/pg_hba.conf to add Client Authentication Record
host all all 0.0.0.0/0 md5 # 0.0.0.0/0 stands for all ips; use CIDR address to restrict access; md5 for pwd authentication
# Change the Listen Address in /usr/local/var/postgres/postgresql.conf
listen_addresses = '*'
# Create a user and grant privileges (run the commands below under superuser of postgres)
CREATE USER your_postgres_user_name WITH ENCRYPTED PASSWORD 'your_postgres_pwd';
GRANT ALL PRIVILEGES ON DATABASE your_database_name TO your_postgres_user_name;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO your_postgres_user_name;
# Restart the PostgreSQL server and test it out
brew services restart postgresql
psql -U [postgres_user_name] -h [postgres_host_name] -d [postgres_database_name]
# IMPORTANT: update your sql_alchemy_conn string in airflow.cfg
</pre></div>
</li>
<li>
<p>Configure your airflow workers; follow most steps for the airflow server, except that they do not have PostgreSQL and RabbitMQ servers</p>
</li>
<li>
<p>Test it out</p>
<div class="highlight"><pre># Start your airflow workers, on each worker, run:
airflow worker # The prompt will show the worker is ready to pick up tasks if everything goes well
# Start you airflow server
airflow webserver
airflow scheduler
airflow worker # [Optional] Let your airflow server be a worker as well
</pre></div>
</li>
<li>
<p>Your airflow workers should be now picking up and running jobs from the airflow server!</p>
</li>
</ol>Monte Carlo Tree Search and Its Application in AlphaGo2016-04-09T00:00:00-04:00Tianlong Songtag:stlong0521.github.io,2016-04-09:20160409 - MCTS.html<p>As one of the most important methods in artificial intelligence (AI), especially for playing games, Monte Carlo tree search (MCTS) has received considerable interest due to its spectacular success in the difficult problem of computer Go. In fact, most successful computer Go algorithms are powered by MCTS, including the recent success of Google's AlphaGo<sup id="fnref:1"><a class="footnote-ref" href="#fn:1" rel="footnote">1</a></sup>. This post introduces MCTS and explains how it is used in AlphaGo.</p>
<h3>Warm Up: Bandit-Based Methods<sup id="fnref:2"><a class="footnote-ref" href="#fn:2" rel="footnote">2</a></sup></h3>
<p><em>Bandit problems</em> are a well-known class of sequential decision problems, in which one needs to choose among <span class="math">\(K\)</span> actions (e.g. the <span class="math">\(K\)</span> arms of a multi-armed bandit slot machine) in order to maximize the cumulative reward by consistently taking the optimal action. The choice of action is difficult as the underlying reward distributions are unknown, and potential rewards must be estimated based on past observations. This leads to the <em>exploitation-exploration dilemma</em>: one needs to balance the exploitation of the action currently believed to be optimal with the exploration of other actions that currently appear suboptimal but may turn out to be superior in the long run.</p>
<p>The primary goal is to find a policy that can minimize the player's regret after <span class="math">\(n\)</span> plays, which is the difference between: (i) the best possible total reward if the player could at the beginning have the knowledge of the reward distributions that is actually learned afterwards; and (ii) the actual total reward from the <span class="math">\(n\)</span> finished plays. In other words, the regret is the expected loss due to not playing the best bandit. An upper confidence bound (UCB) policy has been proposed, which has an expected logarithmic growth of regret uniformly over the total number of plays <span class="math">\(n\)</span> without any prior knowledge regarding the reward distributions. According to the UCB policy, to minimize his regret, for the current play, the player should choose arm <span class="math">\(j\)</span> that maximizes:
</p>
<div class="math">\begin{equation}
\overline{X}_j+\sqrt{\frac{2\ln{n}}{n_j}},
\end{equation}</div>
<p>
where <span class="math">\(\overline{X}_j\)</span> is the average reward from arm <span class="math">\(j\)</span>, <span class="math">\(n_j\)</span> is the number of times arm <span class="math">\(j\)</span> was played and <span class="math">\(n\)</span> is the total number of plays so far. The physical meaning is that: the term <span class="math">\(\overline{X}_j\)</span> encourages the exploitation of higher-rewarded choices, while the term <span class="math">\(\sqrt{\frac{2\ln{n}}{n_j}}\)</span> encourages the exploration of less-visited choices.</p>
<h3>Monte Carlo Tree Search (MCTS)<sup id="fnref:2"><a class="footnote-ref" href="#fn:2" rel="footnote">2</a></sup></h3>
<p>Let us use the board game as an example. Given a board state, the primary goal would be finding out the best action that should be taken currently, which should naturally be chosen according to some precomputed value of each action. The purpose of MCTS is to approximate the (true) values of actions that may be taken from the current board state. This is achieved by iteratively building a partial search tree.</p>
<h4>Four Fundamental Steps in Each Iteration</h4>
<p>The basic algorithm involves iteratively building a search tree until some predefined computational budget (e.g., time, memory or iteration constraint) is reached, at which point the search is halted and the best-performing root action returned. Each node in the search tree represents a state, and directed links to child nodes represent actions leading to subsequent states.</p>
<figure align="center">
<img src="/figures/20160409/MCTS.png" alt="MCTS">
<figcaption align="center">Fig. 1: Four steps in one iteration of MCTS.</figcaption>
</figure>
<p>As illustrate in Fig. 1, four steps are applied for each iteration<sup id="fnref:2"><a class="footnote-ref" href="#fn:2" rel="footnote">2</a></sup>:</p>
<ol>
<li>Selection: Starting from the root node (i.e., current state), a tree policy for child selection is recursively applied to descend through the tree until an expandable node is reached. A node is expandable if it represents a non-terminal state and has unvisited (i.e., expandable) children.</li>
<li>Expansion: For the expandable node we reached in the selection step, one child node is added to expand the tree, according to the available actions.</li>
<li>Simulation: A simulation is run from the newly expanded node according to the default policy to produce an outcome (e.g., win or lose when reaching a terminal state).</li>
<li>Backpropagation: The simulated result is backpropagated through the selected nodes in the selection step to update their statistics.</li>
</ol>
<p>There are two essential ideas that should be highlighted here:</p>
<ul>
<li>The tree policy for child selection should be able to give the high-value nodes priorities in value approximation, and meanwhile explore the less-visited nodes. This is quite similar to the bandit problem, so we can apply the UCB policy to choose the child node.</li>
<li>The value of each node is approximated in an incremental way. That is, its initial value is obtained from a random simulation by the default policy (e.g., a win/lose result along a random path), and then refined by the backpropagation steps during the following iterations.</li>
</ul>
<h4>The Full Algorithm Description</h4>
<p>Before describing the algorithm, let us define some notations first.</p>
<blockquote>
<p><span class="math">\(s(v)\)</span>: the associated state to node <span class="math">\(v\)</span></p>
<p><span class="math">\(a(v)\)</span>: the incoming action that leads to node <span class="math">\(v\)</span></p>
<p><span class="math">\(N(v)\)</span>: the visit count of node <span class="math">\(v\)</span></p>
<p><span class="math">\(Q(v)\)</span>: the vector of total simulation rewards of node <span class="math">\(v\)</span> for all players</p>
</blockquote>
<p>The main procedure of the MCTS algorithm is described below, which essentially executes the four fundamental steps for each iteration until the computational budget is reached. It returns the best action that should be taken for the current state.</p>
<p><img src="/figures/20160409/Main.png" alt="Main"></p>
<p>The selection step is described below, which returns the expandable node according to the tree policy.</p>
<p><img src="/figures/20160409/TreePolicy.png" alt="TreePolicy"></p>
<p>The child selection is described below, which returns the best child of a given node. It essentially applies the UCB method, which uses a constant <span class="math">\(c\)</span> to balance the exploitation with the exploration. It should be noted that there might be multiple players, but the best child is selected as per the interest of the player who is supposed to play in this state.</p>
<p><img src="/figures/20160409/BestChild.png" alt="BestChild"></p>
<p>The selected node after the selection step is expanded by choosing one of its unvisited children, and then adding the associated data to the new node. The procedure is described below.</p>
<p><img src="/figures/20160409/Expand.png" alt="Expand"></p>
<p>Given the state associating to the newly expanded node, a random simulation is run as indicated below, which finds a random path to a terminal state and returns the simulated reward.</p>
<p><img src="/figures/20160409/DefaultPolicy.png" alt="DefaultPolicy"></p>
<p>Once the simulated reward of the newly expanded node is obtained, it is backpropagated through the selected nodes in the selection step. The visit counts are updated at the same time.</p>
<p><img src="/figures/20160409/Backup.png" alt="Backup"></p>
<p>Recall the board game example, assume that the rewards of winning and losing a game are 1 and 0, respectively. After applying the MCTS algorithm, for each node <span class="math">\(v\)</span> in the tree, <span class="math">\(Q(v)\)</span> would be the number of wins that is accumulated from <span class="math">\(N(v)\)</span> visits of this node, and thus <span class="math">\(\frac{Q(v)}{N(v)}\)</span> would be the winning rate. This is exactly the information we could rely on to choose the best action to take in the current state.</p>
<h3>How is MCTS used by Google's AlphaGo?<sup id="fnref:1"><a class="footnote-ref" href="#fn:1" rel="footnote">1</a></sup></h3>
<p>We now understand how MCTS works. Can MCTS be directly applied to computer Go? Yes, but there could be a better way to do that. The reason is that Go is a very high-branching game. Consider the number of all possible sequences of moves, <span class="math">\(b^d\)</span>, where <span class="math">\(b\)</span> is the game's breadth (number of legal moves per state, <span class="math">\(b \approx 250\)</span> for Go), and <span class="math">\(d\)</span> is game's depth (game length, <span class="math">\(d \approx 150\)</span> for Go). As a result, exhaustive search is computationally impossible. Applying MCTS to Go in a straightforward way helps, but the benefits of MCTS are not really fully exploited, since the limited number of simulations could only scratch the surface of the giant search space.</p>
<p>Aided by the useful information learned by two deep convolutional neural networks (a.k.a., <a href="https://stlong0521.github.io/20160403%20-%20NN%20and%20DL.html">deep learning</a>), policy network and value network, Google's AlphaGo applies MCTS in an innovative way. </p>
<ul>
<li>First, for the tree policy to select child, instead of using the UCB method, AlphaGo takes into account the prior probability of actions learned by the policy network. More specifically, for node <span class="math">\(v_0\)</span>, the child <span class="math">\(v\)</span> is selected by maximizing
<div class="math">\begin{equation}
\frac{Q(v)}{N(v)}+\frac{P(v|v_0)}{1+N(v)},
\end{equation}</div>
where <span class="math">\(P(v|v_0)\)</span> is the prior probability that is provided by the policy network. This greatly improves the child selection policy, and thus grants more professional moves (e.g., by human experts) priorities in MCTS simulation.</li>
<li>Second, for the default policy to evaluate expanded nodes, AlphaGo combines the outcomes from simulation steps and node values learned by the value network, and their weights are balanced by a constant <span class="math">\(\lambda\)</span>.</li>
</ul>
<p>Note that both the policy network and value network are trained offline, which greatly reduces the time cost in a real-time contest.</p>
<h3>Acknowledgement</h3>
<p>A large majority of this post, including Fig. 1 and the pseudo codes, comes from the survey paper<sup id="fnref:2"><a class="footnote-ref" href="#fn:2" rel="footnote">2</a></sup>. More details about MCTS and its variants can be found there.</p>
<h3>References</h3>
<div class="footnote">
<hr />
<ol>
<li id="fn:1">
<p>D. Silver, et al., <em><a href="http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html">Mastering the game of Go with deep neural networks and tree search</a></em>, Nature, 2016. <a class="footnote-backref" href="#fnref:1" rev="footnote" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:2">
<p>C. Browne, et al., <em><a href="http://www.cameronius.com/cv/mcts-survey-master.pdf">A Survey of Monte Carlo Tree Search Methods</a></em>, IEEE Transactions on Computational Intelligence and AI in Gamges, 2012. <a class="footnote-backref" href="#fnref:2" rev="footnote" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
</ol>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
var location_protocol = (false) ? 'https' : document.location.protocol;
if (location_protocol !== 'http' && location_protocol !== 'https') location_protocol = 'https:';
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = location_protocol + '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' }, Macros: {} }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Neural Networks and Deep Learning2016-04-03T00:00:00-04:00Tianlong Songtag:stlong0521.github.io,2016-04-03:20160403 - NN and DL.html<p>It has been a long time since the idea of neural networks was proposed, but it is really during the last few years that neural networks have become widely used. One of the major enablers is the infrastructure with high computational capability (e.g., cloud computing), which makes the training of large and deep (multilayer) neural networks possible. This post is in no way an exhaustive review of neural networks or deep learning, but rather an entry-level introduction excerpted from a very popular book<sup id="fnref:1"><a class="footnote-ref" href="#fn:1" rel="footnote">1</a></sup>.</p>
<h3>Neural Network (NN) Basics</h3>
<p>Let us start with sigmoid neurons, and then find out how they are used in NNs.</p>
<h4>Sigmoid Neurons</h4>
<p>As the smallest unit in NNs, a sigmoid neuron mimics the behaviour of a real neuron in human neural systems. It takes multiple inputs and generates one single output, as in a process of local or partial decision making. More specifically, given a series of inputs <span class="math">\([x_1,x_2,...]\)</span>, a neuron applies the sigmoid function to the weighted sum of the inputs plus a bias, i.e., the output of the neuron is computed as
</p>
<div class="math">\begin{equation}
\sigma(z)=\sigma(\sum_j{w_jx_j}+b)=\frac{1}{1+\exp(-\sum_j{w_jx_j}-b)},
\end{equation}</div>
<p>
where <span class="math">\(z=\sum_j{w_jx_j}+b\)</span> is the weighted input to the neuron, and the sigmoid function, <span class="math">\(\sigma(z)=\frac{1}{1+\exp(-z)}\)</span>, is to approximate the step function as usually used in binary decision making. A natural question is: why do not we just use the step function? The answer is that the step function is not smooth (not differentiable at origin), which disables the gradient method in model learning. With the smoothed version of the step function, we are safe to relate the change at the output to the weight/bias changes by
</p>
<div class="math">\begin{equation}
\Delta{output}=\sum_j{\frac{\partial{output}}{\partial{w_j}}\Delta{w_j}}+\frac{\partial{output}}{\partial{b}}\Delta{b}.
\end{equation}</div>
<h4>The Architecture of NNs</h4>
<p>The architecture of a typical NN is depicted in Fig. 1. As shown, the leftmost layer in this network is called the input layer, and the neurons within this layer are called input neurons. The rightmost or output layer contains the output neuron(s). The middle layer is called a hidden layer, since the neurons in this layer are neither inputs nor outputs.</p>
<figure align="center">
<img src="/figures/20160403/NN.png" alt="Neural Network Architecture">
<figcaption align="center">Fig. 1: The neural network architecture.</figcaption>
</figure>
<p>It was proved that NNs with a single hidden layer can be used to approximate any continuous function to any desired precision. Click <a href="http://neuralnetworksanddeeplearning.com/chap4.html">here</a> to see how. It can be expected that a NN with more neurons/layers could be more accurate in the approximation.</p>
<h4>Learning with Gradient Descent</h4>
<p>Before training a model, we first need to find a way to quantify how well we are achieving. That is, we need to introduce a cost function. Although there are many cost functions available, we will start with the quadratic cost function, which is defined as
</p>
<div class="math">\begin{equation}
C(w,b)=\frac{1}{2n}\sum_x{||y(x)-a||^2},
\end{equation}</div>
<p>
where <span class="math">\(w\)</span> denotes the collection of all weights in the network, <span class="math">\(b\)</span> all the biases, <span class="math">\(n\)</span> is the total number of training inputs, <span class="math">\(a\)</span> is the vector of outputs from the network when <span class="math">\(x\)</span> is input, and the sum is over all training inputs, <span class="math">\(x\)</span>.</p>
<p>Applying the gradient descent method, we can learn the weights and biases by
</p>
<div class="math">\begin{equation}
w_k'=w_k-\eta\frac{\partial{C}}{\partial{w_k}},
\end{equation}</div>
<div class="math">\begin{equation}
b_l'=b_l-\eta\frac{\partial{C}}{\partial{b_l}}.
\end{equation}</div>
<p>
To compute each gradient, we need to take into account all the training input <span class="math">\(x\)</span> in each iteration. However, this slows the learning down if the training data size is large. An idea called <em>stochastic gradient descent (a.k.a., mini-batch learning)</em> can be used to speed up learning. The idea is to estimate the gradient by computing it for a small sample of randomly chosen training inputs.</p>
<h3>The Backpropagation Algorithm: How to Compute Gradients of the Cost Function?</h3>
<p>So far we have known that the model parameters can be learned by the gradient descent method, but the computation of the gradients can be challenging by itself. Note that the network size and the data size can both be very large. In this section, we will see how the backpropagation algorithm helps compute the gradients efficiently.</p>
<h4>Matrix Notation for NNs</h4>
<p>For ease of presentation, let us define some notations first.</p>
<blockquote>
<p><span class="math">\(w_{jk}^l\)</span>: weight from the <span class="math">\(k\)</span>th neuron in layer <span class="math">\(l-1\)</span> to the <span class="math">\(j\)</span>th neuron in layer <span class="math">\(l\)</span>;</p>
<p><span class="math">\(w^l=\{w_{jk}^l\}\)</span>: matrix including all weights from each neuron in layer <span class="math">\(l-1\)</span> to each neuron in layer <span class="math">\(l\)</span>;</p>
<p><span class="math">\(b_j^l\)</span>: bias for the <span class="math">\(j\)</span>th neuron in layer <span class="math">\(l\)</span>;</p>
<p><span class="math">\(b^l=\{b_j^l\}\)</span>: column vector including all biases for each neuron in layer <span class="math">\(l\)</span>;</p>
<p><span class="math">\(a_j^l=\sigma(\sum_k{w_{jk}^la_k^{l-1}+b_j^l})\)</span>: activation of the <span class="math">\(j\)</span>th neuron in layer <span class="math">\(l\)</span>;</p>
<p><span class="math">\(a^l=\{a_j^l\}=\sigma(w^la^{l-1}+b^l)\)</span>: column vector including all activations of each neuron in layer <span class="math">\(l\)</span>;</p>
<p><span class="math">\(z_j^l=\sum_k{w_{jk}^la_k^{l-1}+b_j^l}\)</span>: weighted input to the <span class="math">\(j\)</span>th neuron in layer <span class="math">\(l\)</span>;</p>
<p><span class="math">\(z^l=\{z_j^l\}=w^la^{l-1}+b^l\)</span>: column vector including all weighted inputs to each neuron in layer <span class="math">\(l\)</span>;</p>
<p><span class="math">\(\delta_j^l=\frac{\partial{C}}{\partial{z_j^l}}\)</span>: gradient of the cost function w.r.t. the weighted input to the <span class="math">\(j\)</span>th neuron in layer <span class="math">\(l\)</span>, <span class="math">\(z_j^l\)</span>;</p>
<p><span class="math">\(\delta^l=\{\delta_j^l\}\)</span>: column vector including all gradients of the cost function w.r.t. the weighted input to each neuron in layer <span class="math">\(l\)</span>.</p>
</blockquote>
<h4>Four Fundamental Equations behind Backpropagation</h4>
<p>There are four fundamental equations behind backpropagation, which will be explained one by one as below.</p>
<p><em>First</em>, the gradient of the cost function w.r.t. the weighted input to each neuron in output layer <span class="math">\(L\)</span> can be computed as
</p>
<div class="math">\begin{equation}
\delta^L=\nabla_{a^L}C \odot \sigma'(z^L),
\end{equation}</div>
<p>
where <span class="math">\(\nabla_{a^L}C=\{\frac{\partial{C}}{\partial{a_j^L}}\}\)</span> is defined to be a column vector whose components are the partial derivatives <span class="math">\(\frac{\partial{C}}{\partial{a_j^L}}\)</span>, <span class="math">\(\sigma'(z)\)</span> is the first-order derivative of the sigmoid function <span class="math">\(\sigma(z)\)</span>, and <span class="math">\(\odot\)</span> represents an element-wise product.</p>
<p><em>Second</em>, the gradient of the cost function w.r.t. the weighted input to each neuron in layer <span class="math">\(l(l<L)\)</span> can be computed from the results of layer <span class="math">\(l+1\)</span> (backpropagation), i.e.,
</p>
<div class="math">\begin{equation}
\delta^l=((w^{l+1})^T\delta^{l+1}) \odot \sigma'(z^l).
\end{equation}</div>
<p><em>Third</em>, the gradient of the cost function w.r.t. each bias can be computed as
</p>
<div class="math">\begin{equation}
\frac{\partial{C}}{\partial{b_j^l}}=\delta_j^l.
\end{equation}</div>
<p><em>Fourth</em>, the gradient of the cost function w.r.t. each weight can be computed as
</p>
<div class="math">\begin{equation}
\frac{\partial{C}}{\partial{w_{jk}^l}}=\delta_j^la_k^{l-1}.
\end{equation}</div>
<p>The four equations above are not straightforward at first sight, but they are all consequences of the chain rule from multivariable calculus. The proof can be found <a href="http://neuralnetworksanddeeplearning.com/chap2.html#proof_of_the_four_fundamental_equations_(optional)">here</a>.</p>
<h4>The Backpropagation Algorithm</h4>
<p>The backpropagation algorithm essentially includes a feedforward process and a backpropagation process. More specifically, in each iteration:</p>
<ol>
<li>Input a mini-batch of <span class="math">\(m\)</span> training examples;</li>
<li>For each training example <span class="math">\(x\)</span>:<ul>
<li>Initialization: set the activations of the input layer by <span class="math">\(a^{x,1}=x\)</span>;</li>
<li>Feedforward: for each <span class="math">\(l=2,3,...,L\)</span>, compute <span class="math">\(z^{x,l}=w^la^{x,l-1}+b^l\)</span> and <span class="math">\(a^{x,l}=\sigma(z^{x,l})\)</span>;</li>
<li>Output error: compute the error vector <span class="math">\(\delta^{x,L}=\nabla_{a^L}C_x \odot \sigma'(z^{x,L})\)</span>;</li>
<li>Backpropagate the error: for <span class="math">\(l=L-1,L-2,...,2\)</span>, compute <span class="math">\(\delta^{x,l}=((w^{l+1})^T\delta^{x,l+1}) \odot \sigma'(z^{x,l})\)</span>.</li>
</ul>
</li>
<li>Compute gradients and apply the gradient descent method by
<div class="math">\begin{equation}
\frac{\partial{C_x}}{\partial{w_{jk}^l}}=\delta_j^{x,l}a_k^{x,l-1},~~\frac{\partial{C_x}}{\partial{b_j^l}}=\delta_j^{x,l},~~w^l=w^l-\frac{\eta}{m}\sum_x{\delta^{x,l}(a^{x,l-1})^T},~~b^l=b^l-\frac{\eta}{m}\sum_x{\delta^{x,l}}.
\end{equation}</div>
</li>
</ol>
<h3>Deep Learning</h3>
<p>Everything seems going well so far! What if our NNs are deep, i.e., with a lot of hidden layers? Typically we expect a deep NN could deliver better performance than shallow ones. Unfortunately it was observed that: for deep NNs, the learning speeds of the first few layers of neurons can be much higher/lower than those of the last few layers, in which case the NNs cannot be well trained. Click <a href="http://neuralnetworksanddeeplearning.com/chap5.html#the_vanishing_gradient_problem">here</a> to see why. This is known as <em>vanishing/exploding gradient problem</em>. To resolve this problem, it is suggested that we should resort to convolutional networks.</p>
<h4>Convolutional Networks</h4>
<p>Convolutional neural networks use three basic ideas: local receptive fields, shared weights, and pooling. A typical convolutional network is depicted in Fig. 2, where the network includes one input layer, one convolutional layer, one pooling layer and one output layer (fully connected). We will try to understand the three ideas using this network as an example.</p>
<figure align="center">
<img src="/figures/20160403/ConvN.png" alt="Convolutional Networks">
<figcaption align="center">Fig. 2: The convolutional networks.</figcaption>
</figure>
<p><em>Local receptive fields</em>: Given a 28x28 image as the input layer, instead of directly connecting to all 28x28 pixels, the convolutional layer only connects to small, localized regions (local receptive fields) of the input image in the hope of detecting some localized features. In this example, the small region size is 5x5, so exploring all those regions with a stride length of one results into 24x24 neurons. Here we have three feature maps, so the convolutional layer eventually has 3x24x24 neurons.</p>
<p><em>Shared weights and biases</em>: We use the same weights and bias for each neuron in a 24x24 feature map. It makes sense to use the same parameters in detecting the same feature but only in different local regions, and it can greatly reduce the number of parameters involved in a convolutional network compared to a fully connected layer.</p>
<p><em>The pooling layer</em>: A pooling layer is usually used immediately after a convolutional layer. What the pooling layer do is simplify the information in the output from the convolutional layer. In this example, each unit in the pooling layer summarizes a region of 2x2 in the previous convolutional layer, which results into 3x12x12 neurons. The pooling can be a max-pooling (finding the maximum activation of the region to be summarized) or L2 pooling (obtaining the square root of the sum of the squares of the activations in the region to be summarized).</p>
<p>To summarize, a convolutional network tries to detect localized features and at the same time greatly reduces the number of parameters. In principle, we could include any number of convolutional/pooling layers and also fully connected layers in our networks. What is exciting is that the backpropagation algorithm can apply to convolutional networks with only necessary minor modifications.</p>
<p>Does convolutional networks avoid the vanishing/exploding gradient problem? Not really, but it helps us to proceed anyway, e.g., by reducing the number of parameters. The problem can be considerably alleviated by convolutional/pooling layers as well as other techniques, such as powerful regularization, advanced artificial neurons and more training epochs by GPU.</p>
<h4>Deep Learning in Practice</h4>
<p>It may be not that hard to construct a very basic (deep) NN and achieve a nice performance. However, there are still many techniques that could help to improve the NN's performance. A couple of insightful ideas are listed below, which should be kept in mind if we are working towards a better NN.</p>
<ul>
<li>Use more convolutional/pooling layers: reduce the number of parameters and make learning faster;</li>
<li>Use <a href="http://neuralnetworksanddeeplearning.com/chap3.html#the_cross-entropy_cost_function">right cost functions</a>: resolve the problem of learning slow down by the cross-entropy cost function;</li>
<li>Use <a href="http://neuralnetworksanddeeplearning.com/chap3.html#overfitting_and_regularization">regularization</a>: combat overfitting by L2/L1/drop-out regularization;</li>
<li>Use <a href="http://neuralnetworksanddeeplearning.com/chap3.html#other_models_of_artificial_neuron">advanced neurons</a>: for example, tanh neurons and rectified linear units;</li>
<li>Use <a href="http://neuralnetworksanddeeplearning.com/chap3.html#weight_initialization">good weight initialization</a>: avoid neuron saturation and learn faster;</li>
<li>Use an expanded training data set: for example, rotate/shift images as new training data in image classification;</li>
<li>Use an ensemble of NNs: heavy computation, but multiple models beat one;</li>
<li>Use GPU: gain more training epochs;</li>
</ul>
<h3>Acknowledgement</h3>
<p>A large majority of this post comes from Michael Nielsen's book<sup id="fnref:1"><a class="footnote-ref" href="#fn:1" rel="footnote">1</a></sup> entitled "Neural Networks and Deep Learning", which I strongly recommend to anyone interested in discovering how essentially neural networks work.</p>
<h3>References</h3>
<div class="footnote">
<hr />
<ol>
<li id="fn:1">
<p>M. Nielsen, <em><a href="http://neuralnetworksanddeeplearning.com/">Neural Networks and Deep Learning</a></em>, Determination Press, 2015. <a class="footnote-backref" href="#fnref:1" rev="footnote" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
</ol>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
var location_protocol = (false) ? 'https' : document.location.protocol;
if (location_protocol !== 'http' && location_protocol !== 'https') location_protocol = 'https:';
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = location_protocol + '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' }, Macros: {} }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Latent Dirichlet Allocation and Topic Modeling2016-03-26T00:00:00-04:00Tianlong Songtag:stlong0521.github.io,2016-03-26:20160326 - LDA.html<p>When reading an article, we humans are able to easily identify the topics the article talks about. An interesting question is: can we automate this process, i.e., train a machine to find out the underlying topics in articles? In this post, a very popular topic modeling method, Latent Dirichlet allocation (LDA), will be discussed.</p>
<h3>Latent Dirichlet Allocation (LDA) Topic Model<sup id="fnref:1"><a class="footnote-ref" href="#fn:1" rel="footnote">1</a></sup></h3>
<p>Given a library of <span class="math">\(M\)</span> documents, <span class="math">\(\mathcal{L}=\{d_1,d_2,...,d_M\}\)</span>, where each document <span class="math">\(d_m\)</span> contains a sequence of words, <span class="math">\(d_m=\{w_{m,1},w_{m,2},...,w_{m,N_m}\}\)</span>, we need to think of a model which describes how essentially these documents are generated. Considering <span class="math">\(K\)</span> topics and a vocabulary <span class="math">\(V\)</span>, the LDA topic model assumes that the documents are generated by the following two steps:</p>
<ul>
<li>For each document <span class="math">\(d_m\)</span>, use a doc-to-topic model parameterized by <span class="math">\(\boldsymbol\vartheta_m\)</span> to generate the topic for the <span class="math">\(n\)</span>th word and denote it as <span class="math">\(z_{m,n}\)</span>, for all <span class="math">\(1 \leq n \leq N_m\)</span>;</li>
<li>For each generated topic <span class="math">\(k=z_{m,n}\)</span> corresponding to each word in each document, use a topic-to-word model parameterized by <span class="math">\(\boldsymbol\varphi_k\)</span> to generate the word <span class="math">\(w_{m,n}\)</span>.</li>
</ul>
<figure align="center">
<img src="/figures/20160326/LDA.png" alt="LDA Model">
<figcaption align="center">Fig. 1: LDA topic model.</figcaption>
</figure>
<p>The two steps are graphically illustrated in Fig. 1. Considering that the doc-to-topic model and the topic-to-word model essentially follow multinomial distributions (counts of each topic in a document or each word in a topic), a good prior for their parameters, <span class="math">\(\boldsymbol\vartheta_m\)</span> and <span class="math">\(\boldsymbol\varphi_k\)</span>, would be the conjugate prior of multinomial distribution, <a href="https://en.wikipedia.org/wiki/Dirichlet_distribution">Dirichlet distribution</a>.</p>
<blockquote>
<p>A conjugate prior, <span class="math">\(p(\boldsymbol\varphi)\)</span>, of a likelihood, <span class="math">\(p(\textbf{x}|\boldsymbol\varphi)\)</span>, is a distribution that results in a posterior distribution, <span class="math">\(p(\boldsymbol\varphi|\textbf{x})\)</span> with the same functional form as the prior (but different parameters). For example, the conjugate prior of a multinomial distribution is Dirichlet distribution. That is, for a multinomial distribution parameterized by <span class="math">\(\boldsymbol\varphi\)</span>, if the prior for <span class="math">\(\boldsymbol\varphi\)</span> is a Dirichlet distribution characterized by <span class="math">\(Dir(\boldsymbol\varphi|\boldsymbol\alpha)\)</span>, after observing <span class="math">\(\textbf{x}\)</span>, the posterior for <span class="math">\(\boldsymbol\varphi\)</span> still follows a Dirichlet distribution <span class="math">\(Dir(\boldsymbol\varphi|\textbf{n}_x+\boldsymbol\alpha)\)</span>, but incorporating the counting result <span class="math">\(\textbf{n}_x\)</span> of observation <span class="math">\(\textbf{x}\)</span>.</p>
</blockquote>
<p>Keep this in mind, let us take a closer look at the two steps:</p>
<ol>
<li>In the first step, for the <span class="math">\(m\)</span>th document, assume the prior for the doc-to-topic model's parameter <span class="math">\(\boldsymbol\vartheta_m\)</span> follows <span class="math">\(Dir(\boldsymbol\vartheta_m|\boldsymbol\alpha)\)</span>, after observing topics in the document and obtaining the counting result <span class="math">\(\textbf{n}_m\)</span>, we have the posterior for <span class="math">\(\boldsymbol\vartheta_m\)</span> as <span class="math">\(Dir(\boldsymbol\vartheta_m|\textbf{n}_m+\boldsymbol\alpha)\)</span>. After some calculation, we can obtain the topic distribution for the <span class="math">\(m\)</span>th document as
<div class="math">\begin{equation}
p(\textbf{z}_m|\boldsymbol\alpha)=\frac{\Delta(\textbf{n}_m+\boldsymbol\alpha)}{\Delta(\boldsymbol\alpha)},
\end{equation}</div>
where <span class="math">\(\Delta(\boldsymbol\alpha)\)</span> is the normalization factor for <span class="math">\(Dir(\textbf{p}|\boldsymbol\alpha)\)</span>, i.e., <span class="math">\(\Delta(\boldsymbol\alpha)=\int{\prod_{k=1}^K{p_k^{\alpha_k-1}}}d\textbf{p}\)</span>. Taking all documents into account,
<div class="math">\begin{equation} \label{Eqn:Doc2Topic}
p(\textbf{z}|\boldsymbol\alpha)=\prod_{m=1}^M{p(\textbf{z}_m|\boldsymbol\alpha)}=\prod_{m=1}^M{\frac{\Delta(\textbf{n}_m+\boldsymbol\alpha)}{\Delta(\boldsymbol\alpha)}}.
\end{equation}</div>
</li>
<li>In the second step, similarly, for the <span class="math">\(k\)</span>th topic, assume the prior for the topic-to-word model's parameter <span class="math">\(\boldsymbol\varphi_k\)</span> follows <span class="math">\(Dir(\boldsymbol\varphi_k|\boldsymbol\beta)\)</span>, after observing words in the topic and obtaining the counting result <span class="math">\(\textbf{n}_k\)</span>, we have the posterior for <span class="math">\(\boldsymbol\varphi_k\)</span> as <span class="math">\(Dir(\boldsymbol\varphi_k|\textbf{n}_k+\boldsymbol\beta)\)</span>. After some calculation, we can obtain the word distribution for the <span class="math">\(k\)</span>th topic as
<div class="math">\begin{equation}
p(\textbf{w}_k|\textbf{z}_k,\boldsymbol\beta)=\frac{\Delta(\textbf{n}_k+\boldsymbol\beta)}{\Delta(\boldsymbol\beta)}.
\end{equation}</div>
Taking all topics into account,
<div class="math">\begin{equation} \label{Eqn:Topic2Word}
p(\textbf{w}|\textbf{z},\boldsymbol\beta)=\prod_{k=1}^K{p(\textbf{w}_k|\textbf{z}_k,\boldsymbol\beta)}=\prod_{k=1}^K{\frac{\Delta(\textbf{n}_k+\boldsymbol\beta)}{\Delta(\boldsymbol\beta)}}.
\end{equation}</div>
Combining (\ref{Eqn:Doc2Topic}) and (\ref{Eqn:Topic2Word}), we have
<div class="math">\begin{equation} \label{Eqn:Joint_Distribution}
p(\textbf{w},\textbf{z}|\boldsymbol\alpha,\boldsymbol\beta)=p(\textbf{w}|\textbf{z},\boldsymbol\beta)p(\textbf{z}|\boldsymbol\alpha)=\prod_{k=1}^K{\frac{\Delta(\textbf{n}_k+\boldsymbol\beta)}{\Delta(\boldsymbol\beta)}}\prod_{m=1}^M{\frac{\Delta(\textbf{n}_m+\boldsymbol\alpha)}{\Delta(\boldsymbol\alpha)}}.
\end{equation}</div>
</li>
</ol>
<h3>Joint Distribution Emulation by Gibbs Sampling<sup id="fnref:2"><a class="footnote-ref" href="#fn:2" rel="footnote">2</a></sup></h3>
<p>So far we know that the documents can be characterized by a joint distribution of topics and words as shown in (\ref{Eqn:Joint_Distribution}). The words are given, but the associating topics are not. Now we are thinking how to properly associate a topic to each word in each document, such that the result will best fit the joint distribution in (\ref{Eqn:Joint_Distribution}). This is a typical problem that can be solved by <a href="https://en.wikipedia.org/wiki/Gibbs_sampling">Gibbs sampling</a>.</p>
<blockquote>
<p>Gibbs sampling, a special case of Monte Carlo Markov Chain (MCMC) sampling, is a method to emulate high-dimensional probability distributions <span class="math">\(p(\textbf{x})\)</span> by the stationary behaviour of a Markov chain. A typical Gibbs sampling works by: (i) Initialize <span class="math">\(\textbf{x}\)</span>; (ii) Repeat until convergence: for all <span class="math">\(i\)</span>, sample <span class="math">\(x_i\)</span> from <span class="math">\(p(x_i|\textbf{x}_{\neg i})\)</span>, where <span class="math">\(\neg i\)</span> indicates excluding the <span class="math">\(i\)</span>th dimension. According to the stationary behaviour of a Markov chain, a sufficiently large collection of samples after convergence would well approximate the desired distribution <span class="math">\(p(\textbf{x})\)</span>.</p>
</blockquote>
<p>In light of the observations above, to apply Gibbs sampling, it is essential to calculate <span class="math">\(p(x_i|\textbf{x}_{\neg i})\)</span>. In our case, it is <span class="math">\(p(z_i=k|\textbf{z}_{\neg i},\textbf{w})\)</span>, where <span class="math">\(i=(m,n)\)</span> is a two dimensional coordinate indicating the <span class="math">\(n\)</span>th word of the <span class="math">\(m\)</span>th document. Since <span class="math">\(z_i=k,w_i=t\)</span> involves the <span class="math">\(m\)</span>th document and the <span class="math">\(k\)</span>th topic only, <span class="math">\(p(z_i=k|\textbf{z}_{\neg i},\textbf{w})\)</span> eventually depends only on two probabilities: (i) the probability of document <span class="math">\(m\)</span> emitting topic <span class="math">\(k\)</span>, <span class="math">\(\hat{\vartheta}_{mk}\)</span>; (ii) the probability of topic <span class="math">\(k\)</span> emitting word <span class="math">\(t\)</span>, <span class="math">\(\hat{\varphi}_{kt}\)</span>. Formally,
</p>
<div class="math">\begin{equation} \label{Eqn:Gibbs_Sampling}
p(z_i=k|\textbf{z}_{\neg i},\textbf{w}) \propto \hat{\vartheta}_{mk}\hat{\varphi}_{kt}=\frac{n_{m,\neg i}^{(k)}+\alpha_k}{\sum_{k=1}^K{(n_{m,\neg i}^{(k)}+\alpha_k)}}\frac{n_{k,\neg i}^{(t)}+\beta_t}{\sum_{t=1}^V{(n_{k,\neg i}^{(t)}+\beta_t)}},
\end{equation}</div>
<p>
where <span class="math">\(n_m^{(k)}\)</span> is the count of topic <span class="math">\(k\)</span> in document <span class="math">\(m\)</span>, <span class="math">\(n_k^{(t)}\)</span> is the count of word <span class="math">\(t\)</span> for topic <span class="math">\(k\)</span>, and <span class="math">\(\neg i\)</span> indicates that <span class="math">\(w_i\)</span> should not be counted. Besides, <span class="math">\(\alpha_k\)</span> and <span class="math">\(\beta_t\)</span> are the prior knowledge (pseudo counts) for topic <span class="math">\(k\)</span> and word <span class="math">\(t\)</span>, respectively. The underlying physical meaning is that (\ref{Eqn:Gibbs_Sampling}) actually characterizes a word-generating path <span class="math">\(p(topic~k|doc~m)p(word~t|topic~k)\)</span>.</p>
<h3>LDA: Training and Inference<sup id="fnref:2"><a class="footnote-ref" href="#fn:2" rel="footnote">2</a></sup></h3>
<p>With the LDA model built, we want to: (i) estimate the model parameters, <span class="math">\(\boldsymbol\vartheta_m\)</span> and <span class="math">\(\boldsymbol\varphi_k\)</span>, from training documents; (ii) find out the topic distribution, <span class="math">\(\boldsymbol\vartheta_{new}\)</span>, for each new document.</p>
<p>The training procedure is:</p>
<ol>
<li>Initialization: assign a topic to each word in each document randomly;</li>
<li>For each word in each document, update its topic by the Gibbs sampling equation (\ref{Eqn:Gibbs_Sampling});</li>
<li>Repeat 2 until the Gibbs sampling converges;</li>
<li>Calculate the topic-to-word model parameter by <span class="math">\(\hat{\varphi}_{kt}=\frac{n_k^{(t)}+\beta_t}{\sum_{v=1}^V{(n_k^{(t)}+\beta_t)}}\)</span>, and save them as the model parameters.</li>
</ol>
<p>Once the LDA model is trained, we are ready to analyze the topic distribution of any new document. The inference works in the following procedure:</p>
<ol>
<li>Initialization: assign a topic to each word in the new document randomly;</li>
<li>For each word in the new document, update its topic by the Gibbs sampling equation (\ref{Eqn:Gibbs_Sampling}) (Note that the <span class="math">\(\hat{\varphi}_{kt}\)</span> part is directly available from the trained model, and only <span class="math">\(\hat{\vartheta}_{mk}\)</span> needs to be calculated regarding the new document);</li>
<li>Repeat 2 until the Gibbs sampling converges;</li>
<li>Calculate the topic distribution by <span class="math">\(\hat{\vartheta}_{new,k}=\frac{n_{new}^{(k)}+\alpha_k}{\sum_{k=1}^K{(n_{new}^{(k)}+\alpha_k)}}\)</span>.</li>
</ol>
<p>There are multiple open-source LDA implementations available online. To learn how LDA could be implemented, a Python implementation can be found <a href="https://github.com/nrolland/pyLDA/blob/master/src/pyLDA.py">here</a>.</p>
<h3>LDA v.s. Probabilistic Latent Semantic Analysis (PLSA)</h3>
<p>PLSA is a maximum likelihood (ML) model, while LDA is a maximum a posterior (MAP) model (Bayesian estimation). With that said, LDA would reduce to PLSA if a uniform Dirichlet prior is used. LDA is actually more complex than PLSA, so what could be the key advantages of LDA? The answer is the PRIOR! LDA would defeat PLSA, if there is a good prior for the data and the data by itself is not sufficient to train a model well. </p>
<h3>References</h3>
<div class="footnote">
<hr />
<ol>
<li id="fn:1">
<p>G. Heinrich, <em><a href="http://www.arbylon.net/publications/text-est.pdf">Parameter estimation for text analysis</a></em>, 2005. <a class="footnote-backref" href="#fnref:1" rev="footnote" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
<li id="fn:2">
<p>Z. Jin, <a href="http://cos.name/2013/03/lda-math-lda-text-modeling/"><em>LDA Topic Modeling (in Chinese)</em></a>, accessed on Mar 26, 2016. <a class="footnote-backref" href="#fnref:2" rev="footnote" title="Jump back to footnote 2 in the text">↩</a></p>
</li>
</ol>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
var location_protocol = (false) ? 'https' : document.location.protocol;
if (location_protocol !== 'http' && location_protocol !== 'https') location_protocol = 'https:';
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = location_protocol + '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' }, Macros: {} }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Hidden Markov Model and Part of Speech Tagging2016-03-19T00:00:00-04:00Tianlong Songtag:stlong0521.github.io,2016-03-19:20160319 - HMM and POS.html<p>In a Markov model, we generally assume that the states are directly observable or one state corresponds to one observation/event only. However, this is not always true. A good example would be: in speech recognition, we are supposed to identify a sequence of words given a sequence of utterances, in which case the states (words) are not directly observable and one single state (word) could have different observations (utterances). This is a perfect example that could be treated as a hidden Markov model (HMM), by which the hidden states can be inferred from the observations.</p>
<h3>Elements of a Hidden Markov Model (HMM)<sup id="fnref:1"><a class="footnote-ref" href="#fn:1" rel="footnote">1</a></sup></h3>
<p>A hidden Markov model, <span class="math">\(\Phi\)</span>, typically includes the following elements:</p>
<ul>
<li>Time: <span class="math">\(t=\{1,2,...,T\}\)</span>;</li>
<li><span class="math">\(N\)</span> States: <span class="math">\(Q=\{1,2,...,N\}\)</span>;</li>
<li><span class="math">\(M\)</span> Observations: <span class="math">\(O=\{1,2,...,M\}\)</span>;</li>
<li>Initial Probabilities: <span class="math">\(\pi_i=p(q_1=i),~1 \leq i \leq N\)</span>;</li>
<li>Transition Probabilities: <span class="math">\(a_{ij}=p(q_{t+1}=j|q_t=i),~1 \leq i,j \leq N\)</span>;</li>
<li>Observation Probabilities: <span class="math">\(b_j(k)=p(o_t=k|q_t=j)~1 \leq j \leq N, 1 \leq k \leq M\)</span>.</li>
</ul>
<p>The entire model can be characterized by <span class="math">\(\Phi=(A,B,\pi)\)</span>, where <span class="math">\(A=\{a_{ij}\}\)</span>, <span class="math">\(B=\{b_j(k)\}\)</span> and <span class="math">\(\pi=\{\pi_i\}\)</span>. The states are "hidden", since they are not directly observable, but reflected in observations with uncertainty.</p>
<h3>Three Basic Problems for HMMs<sup id="fnref:1"><a class="footnote-ref" href="#fn:1" rel="footnote">1</a></sup></h3>
<p>There are three basic problems that are very important to real-world applications of HMMs:</p>
<h4>Problem 1: Evaluation Problem</h4>
<blockquote>
<p>Given the observation sequence <span class="math">\(O=o_1o_2...o_T\)</span> and a model <span class="math">\(\Phi=(A,B,\pi)\)</span>, how to efficiently compute the probability of the observation sequence given the model, i.e., <span class="math">\(p(O|\Phi)\)</span>?</p>
</blockquote>
<p>Let
</p>
<div class="math">\begin{equation}
\alpha_t(i)=p(o_1o_2...o_t,q_t=i|\Phi)
\end{equation}</div>
<p>
denote the probability that the state is <span class="math">\(i\)</span> at time <span class="math">\(t\)</span> and we have a sequence of observations <span class="math">\(o_1o_2...o_t\)</span>. The evaluation problem can be solved by the forward algorithm as illustrated below:</p>
<ol>
<li>Base case:
<div class="math">\begin{equation}
\alpha_1(i)=p(o_1,q_1=i|\Phi)=p(o_1|q_1=i,\Phi)p(q_1=i|\Phi)=\pi_ib_i(o_1),~1 \leq i \leq N;
\end{equation}</div>
</li>
<li>Induction:
<div class="math">\begin{equation}
\alpha_{t+1}(j)=\left[\sum_{i=1}^N{\alpha_{t}(i)a_{ij}}\right]b_j(o_{t+1}),~1 \leq j \leq N;
\end{equation}</div>
</li>
<li>Termination:
<div class="math">\begin{equation}
p(O|\Phi)=\sum_{i=1}^N{\alpha_T(i)},
\end{equation}</div>
<div class="math">\begin{equation}
p(q_T=i|O,\Phi)=\frac{\alpha_T(i)}{\sum_{j=1}^N{\alpha_T(j)}}.
\end{equation}</div>
</li>
</ol>
<p>The algorithm above essentially applies dynamic programming, and its complexity is <span class="math">\(O(N^2T)\)</span>.</p>
<h4>Problem 2: Decoding Problem</h4>
<blockquote>
<p>Given the observation sequence <span class="math">\(O=o_1o_2...o_T\)</span> and a model <span class="math">\(\Phi=(A,B,\pi)\)</span>, how to choose the "best" state sequence <span class="math">\(Q=q_1q_2...q_T\)</span> (the most probable path) in terms of how good it explains the observations?</p>
</blockquote>
<p>Define
</p>
<div class="math">\begin{equation}
v_t(i)=\max_{q_1q_2...q_{t-1}}{p(q_1q_2...q_{t-1},q_t=i,o_1o_2...o_t|\Phi)}
\end{equation}</div>
<p>
as the best state sequence through which the state arrives at <span class="math">\(i\)</span> at time <span class="math">\(t\)</span> with a sequence of observations <span class="math">\(o_1o_2...o_t\)</span>. The decoding problem can be solved by the Viterbi algorithm as illustrated below:</p>
<ol>
<li>Base case:
<div class="math">\begin{equation}
v_1(i)=p(q_1=i,o_1|\Phi)=p(o_1|q_1=i,\Phi)p(q_1=i|\Phi)=\pi_ib_i(o_1),~1 \leq i \leq N;
\end{equation}</div>
</li>
<li>Induction:
<div class="math">\begin{equation}
v_{t+1}(j)=\left[\max_i{v_{t}(i)a_{ij}}\right]b_j(o_{t+1}),~1 \leq j \leq N,
\end{equation}</div>
in which the optimal <span class="math">\(i\)</span> from the maximization should be stored properly for backtracking;</li>
<li>Termination: The best state sequence can be determined by first finding the optimal final state
<div class="math">\begin{equation}
q_T=\max_i{v_T(i)},
\end{equation}</div>
and then backtracking all the way to the initial state.</li>
</ol>
<p>The algorithm above also applies dynamic programming, and its complexity is <span class="math">\(O(N^2T)\)</span> as well.</p>
<h4>Problem 3: Model Learning</h4>
<blockquote>
<p>Given the observation sequence <span class="math">\(O=o_1o_2...o_T\)</span>, how to find the model <span class="math">\(\Phi=(A,B,\pi)\)</span> that maximizes <span class="math">\(p(O|\Phi)\)</span>?</p>
</blockquote>
<p>A general maximum likelihood (ML) learning approach could determine the optimal <span class="math">\(\Phi\)</span> as
</p>
<div class="math">\begin{equation}
\hat{\Phi}=\max_{\Phi}{p(O|\Phi)}.
\end{equation}</div>
<p>
It is much easier to perform supervised learning, where the true state are tagged to each observation. Given <span class="math">\(V\)</span> training sequences in total, the model parameters can be estimated as
</p>
<div class="math">\begin{equation} \label{Eqn:Supervised_Learning}
\hat{a}_{ij}=\frac{Count(q:i \rightarrow j)}{Count(q:i)},~~\hat{b}_j(k)=\frac{Count(q:j,o:k)}{Count(q:j)},~~\hat{\pi}_i=\frac{Count(q_1=i)}{V}.
\end{equation}</div>
<p>It becomes a little bit tricky for unsupervised learning, where the true state are not tagged. To facilitate our model learning, we need to first introduce the following definition/calculation:
</p>
<div class="math">\begin{equation} \label{Eqn:Episilon}
\varepsilon_t(i,j)=p(q_t=i,q_{t+1}=j|O,\Phi)=\frac{p(q_t=i,q_{t+1}=j,O|\Phi)}{p(O|\Phi)}=\frac{\alpha_t(i)a_{ij}b_j(o_{t+1})\beta_{t+1}(j)}{\sum_{i=1}^N{\alpha_T(i)}},
\end{equation}</div>
<p>
where <span class="math">\(p(O|\Phi)\)</span> is exactly Problem 1 we have yet talked about. <span class="math">\(\beta_{t+1}(j)\)</span> can be calculated using the backward algorithm, which is very similar to the forward algorithm in Problem 1 to calculate <span class="math">\(\alpha_t(i)\)</span> except the difference in the direction of calculation. Following (\ref{Eqn:Episilon}), we further introduce
</p>
<div class="math">\begin{equation} \label{Eqn:Gamma}
\gamma_t(i)=p(q_t=i|O,\Phi)=\sum_{j=1}^N{\varepsilon_t(i,j)}.
\end{equation}</div>
<p>
Then the model parameters can be recomputed as
</p>
<div class="math">\begin{equation}
\begin{split}
\hat{a}_{ij}&=\frac{Expected~number~of~transitions~from~state~i~to~j}{Expected~number~of~transitions~from~state~i}\\
&=\frac{\sum_{t=1}^{T-1}{\varepsilon_t(i,j)}}{\sum_{t=1}^{T-1}{\gamma_t(i)}},
\end{split}
\end{equation}</div>
<div class="math">\begin{equation}
\begin{split}
\hat{b}_j(k)&=\frac{Expected~number~of~times~in~state~j~and~observing~k}{Expected~number~of~times~in~state~j}\\
&=\frac{\sum_{t=1,~s.t.~o_t=k}^{T}{\gamma_t(j)}}{\sum_{t=1}^{T}{\gamma_t(j)}},
\end{split}
\end{equation}</div>
<div class="math">\begin{equation}
\begin{split}
\hat{\pi}_i&=Expected~number~of~times~in~state~i~at~time~t=1\\
&=\gamma_1(i).
\end{split}
\end{equation}</div>
<p>Now we are ready to apply the <a href="20160312 - EM and GMM.html">expectation maximization (EM) algorithm</a> for HMM learning. More specifically:</p>
<ol>
<li>Initialize the HMM, <span class="math">\(\Phi\)</span>;</li>
<li>Repeat the two steps below until convergence:<ul>
<li>E Step: Given observations <span class="math">\(o_1o_2...o_T\)</span> and the model <span class="math">\(\Phi\)</span>, compute <span class="math">\(\varepsilon_t(i,j)\)</span> by (\ref{Eqn:Episilon}) and <span class="math">\(\gamma_t(i)\)</span> by (\ref{Eqn:Gamma});</li>
<li>M Step: Update the model <span class="math">\(\Phi\)</span> by recomputing parameters using the three equations right above.</li>
</ul>
</li>
</ol>
<h3>Part of Speech (POS) Tagging</h3>
<p>In natural language processing, part of speech (POS) tagging is to associate with each word in a sentence a lexical tag. As an example, Janet (NNP) will (MD) back (VB) the (DT) bill (NN), in which each POS tag describes what its corresponding word is about. In this particular example, "VB" tells that "back" is a verb, and "NN" tells that "bill" is a noun, etc.</p>
<p>POS tagging is very useful, because it is usually the first step of many practical tasks, e.g., speech synthesis, grammatical parsing and information extraction. For instance, if we want to pronounce the word "record" correctly, we need to first learn from context if it is a noun or verb and then determine where the stress is in its pronunciation. A similar argument applies to grammatical parsing and information extraction as well.</p>
<p>We need to do some preprocessing before performing POS tagging using HMM. First, because the vocabulary size could be very large while most of the words are not frequently used, we replace each low-frequency word with a special word "UNKA". This is very helpful to reduce the vocabulary size, and thus reduce the memory cost on storing the probability matrix. Second, for each sentence, we add two tags to represent sentence boundaries, e.g., "START" and "END".</p>
<p>Now we are ready to apply HMM to perform POS tagging. The model can be characterized by:</p>
<ul>
<li>Time: length of each sentence;</li>
<li><span class="math">\(N\)</span> States: POS tags, e.g., 45 POS tags from Penn Treebank;</li>
<li><span class="math">\(M\)</span> Observations: vocabulary (compressed by replacing low-frequency words with "UNKA");</li>
<li>Initial Probabilities: probability of each tag associated to the first word;</li>
<li>Transition Probabilities: <span class="math">\(p(t_{i+1}|t_i)\)</span>, where <span class="math">\(t_i\)</span> represents the tag for the <span class="math">\(i\)</span>th word;</li>
<li>Observation Probabilities: <span class="math">\(p(w|t)\)</span>, where <span class="math">\(t\)</span> stands for a tag and <span class="math">\(w\)</span> stands for a word.</li>
</ul>
<p>Once we finish training the model, e.g., under supervised learning by (\ref{Eqn:Supervised_Learning}), we will then be able to tag new sentences applying the Viterbi algorithm as previously illustrated in Problem 2 for HMM. To see details about implementing POS tagging using HMM, <a href="https://github.com/stlong0521/hmm-pos">click here</a> for demo codes.</p>
<h3>References</h3>
<div class="footnote">
<hr />
<ol>
<li id="fn:1">
<p>L. R. Rabiner, <em>A tutorial on hidden Markov models and selected applications in speech recognition</em>, in Proceedings of the IEEE, vol. 77, no. 2, pp. 257-286, Feb 1989. <a class="footnote-backref" href="#fnref:1" rev="footnote" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
</ol>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
var location_protocol = (false) ? 'https' : document.location.protocol;
if (location_protocol !== 'http' && location_protocol !== 'https') location_protocol = 'https:';
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = location_protocol + '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' }, Macros: {} }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Expectation Maximization Algorithm and Gaussian Mixture Model2016-03-12T00:00:00-05:00Tianlong Songtag:stlong0521.github.io,2016-03-12:20160312 - EM and GMM.html<p>In statistical modeling, it is possible that some observations are just missing. For example, when flipping two biased coins with unknown biases, we only have a sequence of observations on heads and tails, but forgot to record which coin each observation comes from. In this case, the conventional maximum likelihood (ML) or maximum a posteriori (MAP) algorithm would no longer be able to work, and it is time for the expectation maximization (EM) algorithm to come into play.</p>
<h3>A Motivating Example</h3>
<p>Although the two-biased-coin example above works as a valid example, another example will be discussed here, as it is more relevant to practical needs. Let us assume that we have a collection of float numbers, which come from two different Gaussian distributions. Unfortunately, we do not know which distribution each number comes from. Now we are supposed to learn the two Gaussian distributions (i,e, their means and variances) from the given data. This is the well-known Gaussian mixture model (GMM). What makes things difficult is that we have missing observations, i.e., membership of each number towards the two distributions. Though conventional ML or MAP would not work here, this is a perfect problem that EM can handle.</p>
<h3>Expectation Maximization (EM) Algorithm<sup id="fnref:1"><a class="footnote-ref" href="#fn:1" rel="footnote">1</a></sup></h3>
<p>Let us consider a statistical model with a vector of unknown parameters <span class="math">\(\boldsymbol\theta\)</span>, which generates a set of observed data <span class="math">\(\textbf{X}\)</span> and a set of missing observations <span class="math">\(\textbf{Z}\)</span>. The likelihood function, <span class="math">\(p(\textbf{X},\textbf{Z}|\boldsymbol\theta)\)</span>, characterizes the probability that <span class="math">\(\textbf{X}\)</span> and <span class="math">\(\textbf{Z}\)</span> appear given the model with parameters <span class="math">\(\boldsymbol\theta\)</span>. An intuitive idea to estimate <span class="math">\(\boldsymbol\theta\)</span> would be trying to perform the maximum likelihood estimation (MLE) considering all possible <span class="math">\(\textbf{Z}\)</span>, i.e.,
</p>
<div class="math">\begin{equation}
\max_{\boldsymbol\theta}{\ln{p(\textbf{X}|\boldsymbol\theta)}}=\max_{\boldsymbol\theta}{\ln{\sum_{\textbf{Z}}{p(\textbf{X},\textbf{Z}|\boldsymbol\theta)}}}=\max_{\boldsymbol\theta}{\ln{\sum_{\textbf{Z}}{p(\textbf{X}|\textbf{Z},\boldsymbol\theta)p(\textbf{Z}|\boldsymbol\theta)}}}.
\end{equation}</div>
<p>
Unfortunately, the problem above is not directly tractable, since we do not have any prior knowledge on the missing observations <span class="math">\(\textbf{Z}\)</span>.</p>
<p>The EM algorithm aims to solve the problem above by starting with a guess on <span class="math">\(\boldsymbol\theta=\boldsymbol\theta_{0}\)</span> and then iteratively applying the two steps as indicated below:</p>
<ul>
<li><em>Expectation Step (E Step):</em> Calculate the log likelihood with respect to <span class="math">\(\boldsymbol\theta\)</span> given <span class="math">\(\boldsymbol\theta_{t}\)</span> by
<div class="math">\begin{equation}
\mathcal{L}(\boldsymbol\theta|\boldsymbol\theta_{t})=\ln{\sum_{\textbf{Z}}{p(\textbf{X}|\textbf{Z},\boldsymbol\theta_{t})p(\textbf{Z}|\boldsymbol\theta_{t})}};
\end{equation}</div>
</li>
<li><em>Maximization Step (M Step):</em> Find the parameter vector that maximizes the log likelihood above and then update it as
<div class="math">\begin{equation}
\theta_{t+1}={\arg \, \max}_{\theta}{\mathcal{L}(\boldsymbol\theta|\boldsymbol\theta_{t})}.
\end{equation}</div>
</li>
</ul>
<p>There are two things that should be noted here:</p>
<ul>
<li>There are two categories of EM: <em>hard</em> EM and <em>soft</em> EM. The algorithm illustrated above is soft EM, because the log likelihood in the E step is weighted upon all possible <span class="math">\(\textbf{Z}\)</span> with their probabilities. While in hard EM, instead of using weighted average, we simply select the most probable <span class="math">\(\textbf{Z}\)</span> and then move forward. The <a href="https://en.wikipedia.org/wiki/K-means_clustering">k-means algorithm</a> is a good example of hard EM algorithm.</li>
<li>The EM algorithm typically converges to a local optimum, and <em>cannot</em> guarantee global optimum. With this being said, the solution might differ with different initialization, and it is possibly helpful to try more than one initialization when applying EM practically.</li>
</ul>
<h3>Gaussian Mixture Model (GMM)</h3>
<p>In the motivating example, a GMM with two Gaussian distributions was introduced. Here we are going to extend it to a general case with <span class="math">\(K\)</span> Gaussian distributions, and the data points will be generalized to be multidimensional. At the same time, we will discuss how it can be used for clustering.</p>
<p>Given a data set containing <span class="math">\(N\)</span> data points, <span class="math">\(\mathcal{D}=\{\textbf{x}_1,\textbf{x}_2,...,\textbf{x}_N\}\)</span>, in which each data point is a <span class="math">\(M\)</span>-dimensional column vector and comes from one of <span class="math">\(K\)</span> Gaussian distributions. Here we will introduce <span class="math">\(\mathcal{Z}=\{z_1,z_2,...,z_N\}\)</span> with <span class="math">\(z_i\in\{1,2,...,K\}\)</span> as latent (hidden) variables to represent the cluster membership of the data points in <span class="math">\(\mathcal{D}\)</span>. The <span class="math">\(K\)</span> Gaussian distributions are characterized by <span class="math">\(\mathcal{N}(\boldsymbol\mu_j,\boldsymbol\Sigma_j)\)</span> for <span class="math">\(j=1,2,...,K\)</span>, and the <span class="math">\(j\)</span>th distribution has a weight of <span class="math">\(\pi_j\)</span> accounted in the overall distribution. Let us first try to map this GMM model to the EM algorithm component by component:</p>
<ul>
<li><span class="math">\(\mathcal{D}\)</span> is the observed data;</li>
<li><span class="math">\(\mathcal{Z}\)</span> is the missing observations;</li>
<li><span class="math">\(\boldsymbol\mu_j\)</span>, <span class="math">\(\boldsymbol\Sigma_j\)</span> and <span class="math">\(\pi_j\)</span> are the unknown model parameters.</li>
</ul>
<p>Following the EM algorithm, we will start with a guess on the unknown parameters, and then iteratively applying E step and M step until convergence. In the E step, we calculate the log likelihood based on given model parameters by
</p>
<div class="math">\begin{align}
\begin{aligned}
LL&=\ln{p(\textbf{x}_1,\textbf{x}_2,...,\textbf{x}_N)}\\
&=\ln{\prod_{i=1}^{N}{p(\textbf{x}_i)}}\\
&=\ln{\prod_{i=1}^{N}{\sum_{j=1}^{K}{p(z_i=j)p(\textbf{x}_i|z_i=j)}}}\\
&=\sum_{i=1}^{N}{\ln\left(\sum_{j=1}^{K}{\pi_j\mathcal{N}(\textbf{x}_i|\boldsymbol\mu_j,\boldsymbol\Sigma_j)}\right)}.
\end{aligned}
\end{align}</div>
<p>In the M step, we maximize the log likelihood by solving the optimization problem below:
</p>
<div class="math">\begin{align}
\begin{aligned}
\max_{\boldsymbol\mu_j,\boldsymbol\Sigma_j,\pi_j}~~&{LL}\\
&s.t.~\sum_{j=1}^{K}{\pi_j}=1.
\end{aligned}
\end{align}</div>
<p>
We can apply Lagrange multiplier to solve the problem above. Let
</p>
<div class="math">\begin{equation}
L=\sum_{i=1}^{N}{\ln\left(\sum_{j=1}^{K}{\pi_j\mathcal{N}(\textbf{x}_i|\boldsymbol\mu_j,\boldsymbol\Sigma_j)}\right)}-\lambda\left(\sum_{j=1}^{K}{\pi_j}-1\right),
\end{equation}</div>
<p>
where <span class="math">\(\lambda\)</span> is the Lagrange multiplier. Taking partial derivatives and setting them to zero, we can obtain the optimal parameters as below:
</p>
<div class="math">\begin{equation} \label{Eqn:Maximization_1}
\boldsymbol\mu_j=\frac{\sum_{i=1}^{N}{\gamma_{ij}}\textbf{x}_i}{\sum_{i=1}^{N}{\gamma_{ij}}},
\end{equation}</div>
<div class="math">\begin{equation} \label{Eqn:Maximization_2}
\boldsymbol\Sigma_j=\frac{\sum_{i=1}^{N}{\gamma_{ij}}(\textbf{x}_i-\boldsymbol\mu_j)(\textbf{x}_i-\boldsymbol\mu_j)^T}{\sum_{i=1}^{N}{\gamma_{ij}}},
\end{equation}</div>
<div class="math">\begin{equation} \label{Eqn:Maximization_3}
\pi_j=\frac{1}{N}{\sum_{i=1}^{N}{\gamma_{ij}}},
\end{equation}</div>
<p>
where <span class="math">\(\gamma_{ij}=p(z_i=j|\textbf{x}_i)\)</span> is the cluster membership, which can be calculated using Bayes theorem,
</p>
<div class="math">\begin{equation}
\begin{split}
\gamma_{ij}&=p(z_i=j|\textbf{x}_i)\\
&=\frac{p(z_i=j)p(\textbf{x}_i|z_i=j)}{\sum_{j=1}^{K}{p(z_i=j)p(\textbf{x}_i|z_i=j)}}\\
&=\frac{\pi_j\mathcal{N}(\textbf{x}_i|\boldsymbol\mu_j,\boldsymbol\Sigma_j)}{\sum_{j=1}^{K}{\pi_j\mathcal{N}(\textbf{x}_i|\boldsymbol\mu_j,\boldsymbol\Sigma_j)}}.
\end{split}
\end{equation}</div>
<p>To summarize, the GMM model can be learned using EM algorithm as in the following steps:</p>
<ol>
<li>Initialize <span class="math">\(\boldsymbol\mu_j\)</span>, <span class="math">\(\boldsymbol\Sigma_j\)</span> and <span class="math">\(\pi_j\)</span> for <span class="math">\(j=1,2,...,K\)</span>;</li>
<li>Repeat the following two steps until the log likelihood converges:<ul>
<li>E Step: Estimate cluster membership <span class="math">\(\gamma_{ij}\)</span> by the equation right above for all data point <span class="math">\(x_i\)</span> and cluster <span class="math">\(z_i=j\)</span>;</li>
<li>M Step: Maximize the log likelihood and update the model parameters by (\ref{Eqn:Maximization_1})-(\ref{Eqn:Maximization_3}) based on cluster membership <span class="math">\(\gamma_{ij}\)</span>.</li>
</ul>
</li>
</ol>
<h3>References</h3>
<div class="footnote">
<hr />
<ol>
<li id="fn:1">
<p>Wikipedia, <a href="https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm"><em>Expectationâ€“maximization algorithm</em></a>, accessed on Mar 12, 2016. <a class="footnote-backref" href="#fnref:1" rev="footnote" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
</ol>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
var location_protocol = (false) ? 'https' : document.location.protocol;
if (location_protocol !== 'http' && location_protocol !== 'https') location_protocol = 'https:';
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = location_protocol + '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' }, Macros: {} }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Locating and Filling Missing Words in Sentences2016-03-05T00:00:00-05:00Tianlong Songtag:stlong0521.github.io,2016-03-05:20160305 - Missing Word.html<p>There has been many occasions that we have incomplete sentences that are needed to completed. One example is that in speech recognition noisy environment can lead to unrecognizable words, but we still hope to recover and understand the complete sentence (e.g., by inference); another example is sentence completion questions that appear in language tests (e.g., SAT, GRE, etc.).</p>
<h3>What are Exactly the Problem?</h3>
<p>Generally, the problem we are aiming to solve is locating and filling any missing words in incomplete sentences. However, this problem seems too ambitious so far, and we direct ourselves to a simplified version of this problem. To simplify the problem, we assume that there is only one missing word in a sentence, and the missing word is neither the first word nor the last word of the sentence. This problem originally comes from <a href="https://www.kaggle.com/c/billion-word-imputation">here</a>.</p>
<h3>Locating the Missing Word</h3>
<p>Two approaches are presented here so as to locate the missing word.</p>
<h4>N-gram Model</h4>
<p>For a given training data set, define <span class="math">\(C(w_1,w_2)\)</span> as the number of occurrences of the bigram pattern <span class="math">\((w_1,w_2)\)</span>, and <span class="math">\(C(w_1,w,w_2)\)</span> the number of occurrences of the trigram pattern <span class="math">\((w_1,w,w_2)\)</span>. Then, the number of occurrences of the pattern, where there is one and only one word between <span class="math">\(w_1\)</span> and <span class="math">\(w_2\)</span>, can be calculated by
</p>
<div class="math">\begin{equation}
D(w_1,w_2)=\sum_{w\in{V}}C(w_1,w,w_2),
\end{equation}</div>
<p>
where <span class="math">\(V\)</span> is the vocabulary.</p>
<p>Consider a particular location, <span class="math">\(l\)</span>, of an incomplete sentence of length <span class="math">\(L\)</span>, and let <span class="math">\(w_l\)</span> be the <span class="math">\(l\)</span>th word in the sentence. <span class="math">\(D(w_{l-1},w_{l})\)</span> would be the number of positive votes from the training data set for missing word at this location, while <span class="math">\(C(w_{l-1},w_{l})\)</span> would be correspondingly the number of negative votes. We define the score indicating there is a missing word at location <span class="math">\(l\)</span> as
</p>
<div class="math">\begin{equation} \label{Eqn:Score}
S_l=\frac{D(w_{l-1},w_{l})^{1+\gamma}}{C(w_{l-1},w_{l})+D(w_{l-1},w_{l})}-\frac{C(w_{l-1},w_{l})^{1+\gamma}}{C(w_{l-1},w_{l})+D(w_{l-1},w_{l})},
\end{equation}</div>
<p>
where <span class="math">\(\gamma\)</span> is a small positive constant. Hence, the missing word location can be identified by
</p>
<div class="math">\begin{equation}
\hat{l}={\arg \, \max}_{1 \leq l \leq L-1} S_l.
\end{equation}</div>
<p>Note that in (\ref{Eqn:Score}), if we set <span class="math">\(\gamma=0\)</span>, the left part would be exactly the percentage of positive votes for missing word at that location, and the right part is the percentage of negative votes. It seems a fairly reasonable score, then <em>why do we still need a positive <span class="math">\(\gamma\)</span></em>? The underlying reason is that intuitively the more number of votes for a particular decision, the more confident we are on that decision. This trend is reflected in a positive <span class="math">\(\gamma\)</span>, which can be viewed as <em>sparse vote penalty</em> and is useful in breaking ties in the missing word location voting. That is, if we have exactly the same ratio of positive votes relative to negative votes for two candidate locations, e.g., 80 positive votes v.s. 20 negative votes for location A, and 8 positive votes v.s. 2 negative votes for location B, we would believe that location A is more likely to be the missing word location compared with location B.</p>
<h4>Word Distance Statistics (WDS)</h4>
<p>In view of the fact that the statistics of the two words immediately adjacent to a given location contribute a lot in deciding whether the location has a word missing, we tentatively guess that all the words within a window centered at that location would more or less contribute some information as well. As a result, we introduce the concept of word distance statistics (WDS).</p>
<p>More specifically, we use <span class="math">\(\widetilde{C}(w_1,w_2,m)\)</span> to denote the number of occurrences of the pattern, where there is exactly <span class="math">\(m\)</span> words between <span class="math">\(w_1\)</span> and <span class="math">\(w_2\)</span>, i.e., the word distance of <span class="math">\(w_1\)</span> and <span class="math">\(w_2\)</span> is <span class="math">\(m\)</span>. For a given location <span class="math">\(l\)</span> in an incomplete sentence and a word window size <span class="math">\(W\)</span>, we are interested in the word distance statistics of each word pair, in which one word <span class="math">\(w_i\)</span> is on the left of the location <span class="math">\(l\)</span>, and the other word <span class="math">\(w_j\)</span> is on the right, as illustrated in Fig. 1.</p>
<figure align="center">
<img src="/figures/20160305/WDS.png" alt="WDS">
<figcaption align="center">Fig. 1: Word distance illustration.</figcaption>
</figure>
<p>Formally, for any <span class="math">\(l-W/2 \leq i \leq l-1\)</span> and <span class="math">\(l \leq j \leq l+W/2-1\)</span>, <span class="math">\(\widetilde{C}(w_i,w_j,j-i)\)</span> would be the number of positive votes for missing word at this location, while <span class="math">\(\widetilde{C}(w_i,w_j,j-i-1)\)</span> is the number of negative votes. Applying the idea in (\ref{Eqn:Score}), for each word pair <span class="math">\((w_i,w_j)\)</span>, we extract its feature as the score indicating there is a missing word at location <span class="math">\(l\)</span>, i.e.,
</p>
<div class="math">\begin{equation} \label{Eqn:ScoreGeneralized}
S_l(i,j)=\frac{\widetilde{C}(w_i,w_j,j-i)^{1+\gamma}}{\widetilde{C}(w_i,w_j,j-i)+\widetilde{C}(w_i,w_j,j-i-1)}-\frac{\widetilde{C}(w_i,w_j,j-i-1)^{1+\gamma}}{\widetilde{C}(w_i,w_j,j-i)+\widetilde{C}(w_i,w_j,j-i-1)}.
\end{equation}</div>
<p>
As a special example, let <span class="math">\(i=l-1\)</span> and <span class="math">\(j=l\)</span>, (\ref{Eqn:ScoreGeneralized}) would be reduced to (\ref{Eqn:Score}).</p>
<p>To find the missing word location, we need to assign different weights to the extracted features, <span class="math">\(S_l(i,j)\)</span>. Then, the missing word location can be determined by
</p>
<div class="math">\begin{equation} \label{Eqn:LocationDetermination}
\hat{l}={\arg \, \max}_{1 \leq l \leq L-1} \sum_{l-\frac{W}{2} \leq i \leq l-1}\sum_{l \leq j \leq l+\frac{W}{2}-1}v(i,j)S_l(i,j),
\end{equation}</div>
<p>
where the weight, <span class="math">\(v(i,j)\)</span>, should be monotonically decreasing with respect to <span class="math">\(|j-i|\)</span>.</p>
<h3>Filling the Missing Word</h3>
<p>To find the most probable word in the given missing word location, we take into account five conditional probabilities, as shown in Table 1, to explore the statistical connection between the candidate words and the surrounding words at the missing word location. Ultimately, the most probable missing word can be determined by
</p>
<div class="math">\begin{equation}
\hat{w}={\arg \, \max}_{w\in{B}} \sum_{1 \leq i \leq 5} v_iP_i,
\end{equation}</div>
<p>
where <span class="math">\(B\)</span> is the candidate word space (detailed <a href="https://github.com/stlong0521/missing-word/blob/master/Project%20Report.pdf">here</a>), and the weight <span class="math">\(v_i\)</span> is used to reflect the importance of each conditional probability in contributing to the final score.</p>
<figure align="center">
<img src="/figures/20160305/CondProb.png" alt="CondProb">
<figcaption align="center">Table 1: Conditional probabilities considered in missing word filling, in which "*" denotes an arbitrary word.</figcaption>
</figure>
<h3>Experimental Results</h3>
<p>The training data contains <span class="math">\(30,301,028\)</span> complete sentences, of which the average sentence length is approximately <span class="math">\(25\)</span>. In the vocabulary with a size of <span class="math">\(2,425,337\)</span>, <span class="math">\(14,216\)</span> words that have occurred in at least <span class="math">\(0.1\%\)</span> of total sentences are labeled as high-frequency words, and the remaining <span class="math">\(58,417,315\)</span> words are labeled as 'UNKA'. To perform the cross validation, in our experiments, the training data is splitted into two part, TRAIN and DEV. The TRAIN set is used to train our models, and the DEV set is applied to test our models.</p>
<h4>Missing Word Location</h4>
<p>Table 2 shows the estimation accuracy of the missing word locations for the two proposed approaches, N-gram and WDS. For comparison, we list the corresponding probabilities by chance as well. Each entry shows the probabilities that the correct location is included in the ranked candidate location list returned by each approach, where the list size varies from <span class="math">\(1\)</span> to <span class="math">\(10\)</span>. The sparse vote penalty coefficient, <span class="math">\(\gamma\)</span>, is set to 0.01. In the WDS approach, we consider a word window size <span class="math">\(W=4\)</span>, i.e., four pairs of words are taken into account.</p>
<table class="table table-striped table-bordered table-hover">
<thead>
<tr>
<th align="center"></th>
<th align="center">Top 1</th>
<th align="center">Top 2</th>
<th align="center">Top 3</th>
<th align="center">Top 5</th>
<th align="center">Top 10</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">Chance</td>
<td align="center">4%</td>
<td align="center">8%</td>
<td align="center">12%</td>
<td align="center">20%</td>
<td align="center">40%</td>
</tr>
<tr>
<td align="center">N-gram</td>
<td align="center">51.47%</td>
<td align="center">63.70%</td>
<td align="center">71.00%</td>
<td align="center">80.26%</td>
<td align="center">91.54%</td>
</tr>
<tr>
<td align="center">WDS</td>
<td align="center">52.06%</td>
<td align="center">64.50%</td>
<td align="center">71.76%</td>
<td align="center">80.91%</td>
<td align="center">91.93%</td>
</tr>
</tbody>
</table>
<figcaption align="center">Table 2: Accuracy of missing word location.</figcaption>
<h4>Missing Word Filling</h4>
<p>Table 3 shows the accuracies of filling the missing word given the location. Each row of the second column shows the probability that the correct word is included in the ranked candidate words list returned by the proposed approach.</p>
<table class="table table-striped table-bordered table-hover">
<thead>
<tr>
<th align="center"></th>
<th align="center">Top 1</th>
<th align="center">Top 2</th>
<th align="center">Top 3</th>
<th align="center">Top 5</th>
<th align="center">Top 10</th>
</tr>
</thead>
<tbody>
<tr>
<td align="center">Accuracy</td>
<td align="center">32.15%</td>
<td align="center">41.49%</td>
<td align="center">46.23%</td>
<td align="center">52.02%</td>
<td align="center">59.15%</td>
</tr>
</tbody>
</table>
<figcaption align="center">Table 3: Accuracy of missing word filling.</figcaption>
<h3>Acknowledgement</h3>
<p>I did this project with my partner, <a href="https://zhwa.github.io/">Zhe Wang</a>. To see the codes and/or report, <a href="https://github.com/stlong0521/missing-word">click here</a> for more information.</p>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
var location_protocol = (false) ? 'https' : document.location.protocol;
if (location_protocol !== 'http' && location_protocol !== 'https') location_protocol = 'https:';
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = location_protocol + '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' }, Macros: {} }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>Binary and Multiclass Logistic Regression Classifiers2016-02-28T00:00:00-05:00Tianlong Songtag:stlong0521.github.io,2016-02-28:20160228 - Logistic Regression.html<p>The generative classification model, such as Naive Bayes, tries to learn the probabilities and then predict by using Bayes rules to calculate the posterior, <span class="math">\(p(y|\textbf{x})\)</span>. However, discrimitive classifiers model the posterior directly. As one of the most popular discrimitive classifiers, logistic regression directly models the linear decision boundary.</p>
<h3>Binary Logistic Regression Classifier<sup id="fnref:1"><a class="footnote-ref" href="#fn:1" rel="footnote">1</a></sup></h3>
<p>Let us start with the binary case. For an M-dimensional feature vector <span class="math">\(\textbf{x}=[x_1,x_2,...,x_M]^T\)</span>, the posterior probability of class <span class="math">\(y\in\{\pm{1}\}\)</span> given <span class="math">\(\textbf{x}\)</span> is assumed to satisfy
</p>
<div class="math">\begin{equation}
\ln{\frac{p(y=1|\textbf{x})}{p(y=-1|\textbf{x})}}=\textbf{w}^T\textbf{x},
\end{equation}</div>
<p>
where <span class="math">\(\textbf{w}=[w_1,w_2,...,w_M]^T\)</span> is the weighting vector to be learned. Given the constraint that <span class="math">\(p(y=1|\textbf{x})+p(y=-1|\textbf{x})=1\)</span>, it follows that
</p>
<div class="math">\begin{equation} \label{Eqn:Prob_Binary}
p(y|\textbf{x})=\frac{1}{1+\exp(-y\textbf{w}^T\textbf{x})}=\sigma(y\textbf{w}^T\textbf{x}),
\end{equation}</div>
<p>
in which we can observe the logistic sigmoid function <span class="math">\(\sigma(a)=\frac{1}{1+\exp(-a)}\)</span>.</p>
<p>Based on the assumptions above, the weighting vector, <span class="math">\(\textbf{w}\)</span>, can be learned by maximum likelihood estimation (MLE). More specifically, given training data set <span class="math">\(\mathcal{D}=\{(\textbf{x}_1,y_1),(\textbf{x}_2,y_2),...,(\textbf{x}_N,y_N)\}\)</span>,
</p>
<div class="math">\begin{align}
\begin{aligned}
\textbf{w}^*&=\max_{\textbf{w}}{\mathcal{L}(\textbf{w})}\\
&=\max_{\textbf{w}}{\sum_{i=1}^N\ln{{p(y_i|\textbf{x}_i)}}}\\
&=\max_{\textbf{w}}{\sum_{i=1}^N{\ln{\frac{1}{1+\exp(-y_i\textbf{w}^T\textbf{x}_i)}}}}\\
&=\min_{\textbf{w}}{\sum_{i=1}^N{\ln{(1+\exp(-y_i\textbf{w}^T\textbf{x}_i))}}}.
\end{aligned}
\end{align}</div>
<p>
We have a convex objective function here, and we can calculate the optimal solution by applying gradient descent. The gradient can be drawn as
</p>
<div class="math">\begin{align}
\begin{aligned}
\nabla{\mathcal{L}(\textbf{w})}&=\sum_{i=1}^N{\frac{-y_i\textbf{x}_i\exp(-y_i\textbf{w}^T\textbf{x}_i)}{1+\exp(-y_i\textbf{w}^T\textbf{x}_i)}}\\
&=-\sum_{i=1}^N{y_i\textbf{x}_i(1-p(y_i|\textbf{x}_i))}.
\end{aligned}
\end{align}</div>
<p>
Then, we can learn the optimal <span class="math">\(\textbf{w}\)</span> by starting with an initial <span class="math">\(\textbf{w}_0\)</span> and iterating as follows:
</p>
<div class="math">\begin{equation} \label{Eqn:Iteration_Binary}
\textbf{w}_{t+1}=\textbf{w}_{t}-\eta_t\nabla{\mathcal{L}(\textbf{w})},
\end{equation}</div>
<p>
where <span class="math">\(\eta_t\)</span> is the learning step size. It can be invariant to time, but time-varying step sizes could potential reduce the convergence time, e.g., setting <span class="math">\(\eta_t\propto{1/\sqrt{t}}\)</span> such that the step size decreases with an increasing time <span class="math">\(t\)</span>.</p>
<h3>Multiclass Logistic Regression Classifier<sup id="fnref:1"><a class="footnote-ref" href="#fn:1" rel="footnote">1</a></sup></h3>
<p>When it is generalized to multiclass case, the logistic regression model needs to adapt accordingly. Now we have <span class="math">\(K\)</span> possible classes, that is, <span class="math">\(y\in\{1,2,..,K\}\)</span>. It is assumed that the posterior probability of class <span class="math">\(y=k\)</span> given <span class="math">\(\textbf{x}\)</span> follows
</p>
<div class="math">\begin{equation}
\ln{p(y=k|\textbf{x})}\propto\textbf{w}_k^T\textbf{x},
\end{equation}</div>
<p>
where <span class="math">\(\textbf{w}_k\)</span> is a column weighting vector corresponding to class <span class="math">\(k\)</span>. Considering all classes <span class="math">\(k=1,2,...,K\)</span>, we would have a weighting matrix that includes all <span class="math">\(K\)</span> weighting vectors. That is, <span class="math">\(\textbf{W}=[\textbf{w}_1,\textbf{w}_2,...,\textbf{w}_K]\)</span>.
Under the constraint
</p>
<div class="math">\begin{equation}
\sum_{k=1}^K{p(y=k|\textbf{x})}=1,
\end{equation}</div>
<p>
it then follows that
</p>
<div class="math">\begin{equation} \label{Eqn:Prob_Multiple}
p(y=k|\textbf{x})=\frac{\exp(\textbf{w}_k^T\textbf{x})}{\sum_{j=1}^K{\exp(\textbf{w}_j^T\textbf{x})}}.
\end{equation}</div>
<p>The weighting matrix, <span class="math">\(\textbf{W}\)</span>, can be similarly learned by maximum likelihood estimation (MLE). More specifically, given training data set <span class="math">\(\mathcal{D}=\{(\textbf{x}_1,y_1),(\textbf{x}_2,y_2),...(\textbf{x}_N,y_N)\}\)</span>,
</p>
<div class="math">\begin{align}
\begin{aligned}
\textbf{W}^*&=\max_{\textbf{W}}{\mathcal{L}(\textbf{W})}\\
&=\max_{\textbf{W}}{\sum_{i=1}^N\ln{{p(y_i|\textbf{x}_i)}}}\\
&=\max_{\textbf{W}}{\sum_{i=1}^N{\ln{\frac{\exp(\textbf{w}_{y_i}^T\textbf{x})}{\sum_{j=1}^K{\exp(\textbf{w}_j^T\textbf{x})}}}}}.
\end{aligned}
\end{align}</div>
<p>
The gradient of the objective function with respect to each <span class="math">\(\textbf{w}_k\)</span> can be calculated as
</p>
<div class="math">\begin{align}
\begin{aligned}
\frac{\partial{\mathcal{L}(\textbf{W})}}{\partial{\textbf{w}_k}}&=\sum_{i=1}^N{\textbf{x}_i\left(I(y_i=k)-\frac{\exp(\textbf{w}_k^T\textbf{x})}{\sum_{j=1}^K{\exp(\textbf{w}_j^T\textbf{x})}}\right)}\\
&=\sum_{i=1}^N{\textbf{x}_i(I(y_i=k)-p(y_i=k|\textbf{x}_i))},
\end{aligned}
\end{align}</div>
<p>
where <span class="math">\(I(\cdot)\)</span> is a binary indicator function. Applying gradient descent, the optimal solution can be obtained by iterating as follows:
</p>
<div class="math">\begin{equation}\label{Eqn:Iteration_Multiple}
\textbf{w}_{k,t+1}=\textbf{w}_{k,t}+\eta_{t}\frac{\partial{\mathcal{L}(\textbf{W})}}{\partial{\textbf{w}_k}}.
\end{equation}</div>
<p>
Note that we have "<span class="math">\(+\)</span>" in (\ref{Eqn:Iteration_Multiple}) instead of "<span class="math">\(-\)</span>" in (\ref{Eqn:Iteration_Binary}), because the maximum likelihood estimation in the binary case is eventually converted to a minimization problem, while here we keep performing maximization.</p>
<h3>How to Perform Predictions?</h3>
<p>Once the optimal weights are learned from the logistic regression model, for any new feature vector <span class="math">\(\textbf{x}\)</span>, we can easily calculate the probability that it is associated to each class label <span class="math">\(k\)</span> by (\ref{Eqn:Prob_Binary}) in the binary case or (\ref{Eqn:Prob_Multiple}) in the multiclass case. With the probabilities for each class label available, we can then perform:</p>
<ul>
<li>a hard decision by identifying the class label with the highest probability, or</li>
<li>a soft decision by showing the top <span class="math">\(k\)</span> most probable class labels with their corresponding probabilities.</li>
</ul>
<h3>An Example Applying Multiclass Logistic Regression</h3>
<p>To see an example applying multiclass logistic regression classification, <a href="https://github.com/stlong0521/logistic-classification">click here</a> for more information.</p>
<h3>References</h3>
<div class="footnote">
<hr />
<ol>
<li id="fn:1">
<p>C. M. Bishop, <em>Pattern Recognition and Machine Learning</em>. New York: Springer, 2006. <a class="footnote-backref" href="#fnref:1" rev="footnote" title="Jump back to footnote 1 in the text">↩</a></p>
</li>
</ol>
</div>
<script type="text/javascript">if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var align = "center",
indent = "0em",
linebreak = "false";
if (false) {
align = (screen.width < 768) ? "left" : align;
indent = (screen.width < 768) ? "0em" : indent;
linebreak = (screen.width < 768) ? 'true' : linebreak;
}
var mathjaxscript = document.createElement('script');
var location_protocol = (false) ? 'https' : document.location.protocol;
if (location_protocol !== 'http' && location_protocol !== 'https') location_protocol = 'https:';
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = location_protocol + '//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' }, Macros: {} }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: '"+ align +"'," +
" displayIndent: '"+ indent +"'," +
" showMathMenu: true," +
" messageStyle: 'normal'," +
" tex2jax: { " +
" inlineMath: [ ['\\\\(','\\\\)'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'inherit ! important'} }," +
" linebreaks: { automatic: "+ linebreak +", width: '90% container' }," +
" }, " +
"}); " +
"if ('default' !== 'default') {" +
"MathJax.Hub.Register.StartupHook('HTML-CSS Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax['HTML-CSS'].FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"MathJax.Hub.Register.StartupHook('SVG Jax Ready',function () {" +
"var VARIANT = MathJax.OutputJax.SVG.FONTDATA.VARIANT;" +
"VARIANT['normal'].fonts.unshift('MathJax_default');" +
"VARIANT['bold'].fonts.unshift('MathJax_default-bold');" +
"VARIANT['italic'].fonts.unshift('MathJax_default-italic');" +
"VARIANT['-tex-mathit'].fonts.unshift('MathJax_default-italic');" +
"});" +
"}";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>About Tianlong Song2016-02-25T00:00:00-05:00Tianlong Songtag:stlong0521.github.io,2016-02-25:about.html<p>Tianlong Song is currently a software development engineer in the Data Science and Engineering team at Zillow. His interests are primarily focused on software engineering, big data platforms and artificial intelligence.</p>
<p>His blog keeps the records on how he moved forward little by little in these areas, and he would love to share them with anyone who might be interested. He is happy to exchange ideas in any way, so please do not hesitate to reach him via the email below.</p>
<p>Tianlong's Email: stlong0521@gmail.com</p>