# Binary and Multiclass Logistic Regression Classifiers

Sun 28 Feb 2016 by Tianlong Song Tags Machine Learning Data Mining

The generative classification model, such as Naive Bayes, tries to learn the probabilities and then predict by using Bayes rules to calculate the posterior, $$p(y|\textbf{x})$$. However, discrimitive classifiers model the posterior directly. As one of the most popular discrimitive classifiers, logistic regression directly models the linear decision boundary.

### Binary Logistic Regression Classifier1

Let us start with the binary case. For an M-dimensional feature vector $$\textbf{x}=[x_1,x_2,...,x_M]^T$$, the posterior probability of class $$y\in\{\pm{1}\}$$ given $$\textbf{x}$$ is assumed to satisfy

$$\ln{\frac{p(y=1|\textbf{x})}{p(y=-1|\textbf{x})}}=\textbf{w}^T\textbf{x},$$

where $$\textbf{w}=[w_1,w_2,...,w_M]^T$$ is the weighting vector to be learned. Given the constraint that $$p(y=1|\textbf{x})+p(y=-1|\textbf{x})=1$$, it follows that

$$\label{Eqn:Prob_Binary} p(y|\textbf{x})=\frac{1}{1+\exp(-y\textbf{w}^T\textbf{x})}=\sigma(y\textbf{w}^T\textbf{x}),$$

in which we can observe the logistic sigmoid function $$\sigma(a)=\frac{1}{1+\exp(-a)}$$.

Based on the assumptions above, the weighting vector, $$\textbf{w}$$, can be learned by maximum likelihood estimation (MLE). More specifically, given training data set $$\mathcal{D}=\{(\textbf{x}_1,y_1),(\textbf{x}_2,y_2),...,(\textbf{x}_N,y_N)\}$$,

\begin{align} \begin{aligned} \textbf{w}^*&=\max_{\textbf{w}}{\mathcal{L}(\textbf{w})}\\ &=\max_{\textbf{w}}{\sum_{i=1}^N\ln{{p(y_i|\textbf{x}_i)}}}\\ &=\max_{\textbf{w}}{\sum_{i=1}^N{\ln{\frac{1}{1+\exp(-y_i\textbf{w}^T\textbf{x}_i)}}}}\\ &=\min_{\textbf{w}}{\sum_{i=1}^N{\ln{(1+\exp(-y_i\textbf{w}^T\textbf{x}_i))}}}. \end{aligned} \end{align}

We have a convex objective function here, and we can calculate the optimal solution by applying gradient descent. The gradient can be drawn as

\begin{align} \begin{aligned} \nabla{\mathcal{L}(\textbf{w})}&=\sum_{i=1}^N{\frac{-y_i\textbf{x}_i\exp(-y_i\textbf{w}^T\textbf{x}_i)}{1+\exp(-y_i\textbf{w}^T\textbf{x}_i)}}\\ &=-\sum_{i=1}^N{y_i\textbf{x}_i(1-p(y_i|\textbf{x}_i))}. \end{aligned} \end{align}

Then, we can learn the optimal $$\textbf{w}$$ by starting with an initial $$\textbf{w}_0$$ and iterating as follows:

$$\label{Eqn:Iteration_Binary} \textbf{w}_{t+1}=\textbf{w}_{t}-\eta_t\nabla{\mathcal{L}(\textbf{w})},$$

where $$\eta_t$$ is the learning step size. It can be invariant to time, but time-varying step sizes could potential reduce the convergence time, e.g., setting $$\eta_t\propto{1/\sqrt{t}}$$ such that the step size decreases with an increasing time $$t$$.

### Multiclass Logistic Regression Classifier1

When it is generalized to multiclass case, the logistic regression model needs to adapt accordingly. Now we have $$K$$ possible classes, that is, $$y\in\{1,2,..,K\}$$. It is assumed that the posterior probability of class $$y=k$$ given $$\textbf{x}$$ follows

$$\ln{p(y=k|\textbf{x})}\propto\textbf{w}_k^T\textbf{x},$$

where $$\textbf{w}_k$$ is a column weighting vector corresponding to class $$k$$. Considering all classes $$k=1,2,...,K$$, we would have a weighting matrix that includes all $$K$$ weighting vectors. That is, $$\textbf{W}=[\textbf{w}_1,\textbf{w}_2,...,\textbf{w}_K]$$. Under the constraint

$$\sum_{k=1}^K{p(y=k|\textbf{x})}=1,$$

it then follows that

$$\label{Eqn:Prob_Multiple} p(y=k|\textbf{x})=\frac{\exp(\textbf{w}_k^T\textbf{x})}{\sum_{j=1}^K{\exp(\textbf{w}_j^T\textbf{x})}}.$$

The weighting matrix, $$\textbf{W}$$, can be similarly learned by maximum likelihood estimation (MLE). More specifically, given training data set $$\mathcal{D}=\{(\textbf{x}_1,y_1),(\textbf{x}_2,y_2),...(\textbf{x}_N,y_N)\}$$,

\begin{align} \begin{aligned} \textbf{W}^*&=\max_{\textbf{W}}{\mathcal{L}(\textbf{W})}\\ &=\max_{\textbf{W}}{\sum_{i=1}^N\ln{{p(y_i|\textbf{x}_i)}}}\\ &=\max_{\textbf{W}}{\sum_{i=1}^N{\ln{\frac{\exp(\textbf{w}_{y_i}^T\textbf{x})}{\sum_{j=1}^K{\exp(\textbf{w}_j^T\textbf{x})}}}}}. \end{aligned} \end{align}

The gradient of the objective function with respect to each $$\textbf{w}_k$$ can be calculated as

\begin{align} \begin{aligned} \frac{\partial{\mathcal{L}(\textbf{W})}}{\partial{\textbf{w}_k}}&=\sum_{i=1}^N{\textbf{x}_i\left(I(y_i=k)-\frac{\exp(\textbf{w}_k^T\textbf{x})}{\sum_{j=1}^K{\exp(\textbf{w}_j^T\textbf{x})}}\right)}\\ &=\sum_{i=1}^N{\textbf{x}_i(I(y_i=k)-p(y_i=k|\textbf{x}_i))}, \end{aligned} \end{align}

where $$I(\cdot)$$ is a binary indicator function. Applying gradient descent, the optimal solution can be obtained by iterating as follows:

$$\label{Eqn:Iteration_Multiple} \textbf{w}_{k,t+1}=\textbf{w}_{k,t}+\eta_{t}\frac{\partial{\mathcal{L}(\textbf{W})}}{\partial{\textbf{w}_k}}.$$

Note that we have "$$+$$" in (\ref{Eqn:Iteration_Multiple}) instead of "$$-$$" in (\ref{Eqn:Iteration_Binary}), because the maximum likelihood estimation in the binary case is eventually converted to a minimization problem, while here we keep performing maximization.

### How to Perform Predictions?

Once the optimal weights are learned from the logistic regression model, for any new feature vector $$\textbf{x}$$, we can easily calculate the probability that it is associated to each class label $$k$$ by (\ref{Eqn:Prob_Binary}) in the binary case or (\ref{Eqn:Prob_Multiple}) in the multiclass case. With the probabilities for each class label available, we can then perform:

• a hard decision by identifying the class label with the highest probability, or
• a soft decision by showing the top $$k$$ most probable class labels with their corresponding probabilities.