Training Classifiers via Stochastic Gradient Descent
总阅读次
Abstract
This report states how I implement two algorithms: Logistic Regression and Ridge Regression(Classifier) which are tailored to large-scale classification. Given the datasets with a large number of examples, we will train classifiers via stochastic gradient descent(SGD) in order to make training phase more efficient.
Introduction to Datasets
There are two datasets which come from LIBSVM, Dataset 1(Training) contains 32561 rows and 124 columns, and dataset 1(Testing) contains 16281 rows and 124 columns; Dataset 2 (Training) contains 290507 rows and 54 columns, and dataset 2 (Testing) contains 290505 rows and 54 columns. Each row represents an example. The last column represents the label of the corresponding example, and the remaining columns represent the features of the corresponding example. For each dataset, label $\mathcal{Y} = {-1, +1}$. After carefully observing the datasets, I find Dataset 1 contains only 0 or 1, but Dataset 2 contains both float numbers between 0 and 1 and integer 0 or 1, indicating that Dataset 2 may not be scaled, so I scaled then to [0, 1], using Min-Max scaling, fortunately Dataset 1 was immune to the scaling because of above description.
Logistic Regression via SGD with L1-Regularization
Logistic Regression is one of most popular classification algorithm in our daily life and industrial application. In this section, I will show you a profile of Logistic Regression at first, and then give the pseudo code of my implementation, then I will show results in two datasets after runing my code, and compare it to a python machine learning library Scikit-Learn’s results, and at last show some figures of results for you understanding the whole process.
What is Logistic Regression?
In linear regression, we note the model as $$y = w^Tx + b$$ but can we let predictions approach something derive from y ? The answer is absolutely yes, for example, we can use log of y as the objective that linear models should approach to, which is called log-linear regression.
$$\log{y} = w^Tx + b$$ Generally, consider a differentiable monotone function $ g(\cdot) $, let $$y = g^{-1}(w^Tx+b)$$ models like above form is called generalized linear model. Logistic regression can be seen as a special case of the generalized linear model and thus analogous to
linear regression. The binary logistic model is used to estimate the probability of a binary response based on one or more predictor (or independent) variables (features),[@wiki] it’s $ g^{-1}(\cdot) $ is $$y = \frac {1} {1 + e^{-w^Tx+b} }$$ above equation is actually using prediction results of linear regression model to approach logit probability of true label, thus, the corresponding model is called “Logistic Regression”, although it’s name contains “Regression”, it is truly a classifier. Logistic regression have many advantages in classification task, including that it directly modeling on the real data, unnecessary to make hypothesis of data distribution, which avoided problems with inaccurate hypothesis of data distribution, moreover, it not only predict “categories”, but approximation of categories’ probability, last but not least, logistic function is a convex function which have many merits.
We can do some conversion Eq.4 that: $$\ln{\frac {y} {1-y} } = w^Tx+b$$ and we can see y as posterior probability estimates $ p(y=1|x) $, so Eq.5 can be rewrite as $$\ln{\frac {p(y=1|x)} {p(y=-1|x)} } = w^Tx+b$$ absolutely we can infer following two equation:
$$p(y=1|x) = \frac{e^{w^Tx+b} } {1+e^{w^Tx+b} } = \frac{1}{1+e^{-(w^Tx+b)} }$$
$$p(y=1|x) = \frac{1}{1+e^{w^Tx+b} }$$ thus we can use maximum likelihood
method to estimate $w$ and $b$, and for convenience, we note $ \beta = (b;w) $ and $ x = (1;x) $, and we add L1-regularization to our loss function, so our loss function can be written as
$$ {Loss = \frac{1}{N} \sum_{i=1}^N {\log(1+e^{-y_i\beta^Tx_i}) } + \lambda||\beta||_1} $$
after add L1-regularization, we note the final goal is to
$$ min {\frac{1}{N} \sum_{i=1}^N {\log(1+e^{-y_i\beta^Tx_i}) } + \lambda||\beta||_1} $$
Stochastic Gradient Descent method usually used to solve the above optimization problem, which from Gradient Descent, because of GD’s low efficiency in large-scale dataset, people want to speed up the rate of
convergence, at the same time without losing accuracy, so, they use a randomly selected every time to update parameters, that is to say, the batch size(the number of training examples) used for approximation of Eq.10 is 1. Using SGD to solve Eq.10, we first calculate the gradient of Eq.10 for a
sample $x_i$:
$$\frac {\partial{loss(\beta, x_i)} }{ {\partial\beta} } = \frac{-y_ix_i} {1+e^{y_i\beta^Tx_i} } + \lambda sign(\beta) $$
so we can update weights $\beta$ with every iteration and every data
sample using following equation:
$$\beta_{k+1} = \beta_k - \gamma \frac {\partial{loss(\beta, x_i)} } {\partial\beta}
= \beta_k + \gamma\frac{y_ix_i} {1+e^{y_i\beta^Tx_i} } - \gamma\lambda sign(\beta) $$
where $sign(x) = 1$ if x > 0, $sign(x) = -1$, and $sign(x) = 0$ if x = 0. This update equation was called “SGD-L1(Naive)”[@cumulative], and this naive method will cause two problems, one is that at each update, we need to perform the application of L1 penalty to all features, including the features that are not used in current training sample, and it does not produce a compact model, that is, when a dimension of $\beta$ is very small and we can neglect its impact, but we will update that dimension as 1 or -1 times regularization lambda and learning rate, regardless of its size, at next update, this will cause appearance of many non-zero values in $\beta$, but we want that with the same effect, the less features the better.
Occam’s Razor: Entities should not be multiplied unnecessarily.
For this reason, I tried some methods in Yoshimasa’s paper, including Clipping, which clips the weight that crosses zero, and the update function can refer his paper, and finally, I used L1 regularization with a cumulative penalty to apply a L1 penalty to object function, any details can be found in his paper.[@cumulative]
Implementation
After above analysis and derivation, we can easily write the algorithm using SGD to solve logistic regression with L1 regularization,($wh$ is a temporary variable).
In order to evaluate effectiveness of my algorithm implementation, I wrote an implementation using
Scikit-Learn, a python machine learning library, and you can see the comparison in Table.2. After conducting a simple Cross Validation and adopted some tricks[@leonbottou] on a dataset, I set hyperparameters as following values:
Hypermeters | Value |
---|---|
Learning Rate $\gamma$ | 0.001 |
Regularization Lambda $\lambda$ | 0.0001 |
Max Iteration Times on Dataset | 200(Dataset1), 30(Dataset2) |
Results of Experiment
After the programming, I evaluated the effectiveness and accuracy of my training algorithm on all 2 datasets, Figure.1 and Figure.2 show the change of key dependent variable including Train set error rate, Test error rate, Objective function and Sparsity in training process.
Figure.1 shows results of logistic regression with L1-regularization on dataset 1 that with increasing value of iteration times of training, the training error rate and test set error rate decrease, like one line, after nearly 500000 times of iteration, the training error and test set
error keep stable and smooth, only have some small fluctuations, indicating the loss function converged. Objective function has the same trend. See the red line indicating that with the increasing of training iteration, sparsity of weights vector increasing continually, final results after more than 6000000+ iterations, weights vector have sparsity of 31 dimensions of 124 dimensions, holding a quarter, that is to say, we just need use 3/4 of origin features to predict class labels of data sample, without loss of accuracy of prediction, see, we did a feature selection unconsciously.
Figure.2 shows results of conducting algorithm on dataset 2, and we can see almost the same trend on Figure.1, training error rate, test set error rate and objective function decrease with increasing of iteration times, and sparsity increase. What different is that test set error rate on data set 2 is about 0.45, more than stable training error rate, 0.1879, and the following form is the final results of logistic regression with L1 regularization on two Dataset.
Dataset | Training error rate | Test error rate | Sparsity | Scikit-Learn Test error rate |
---|---|---|---|---|
Dataset 1 | 0.1504 | 0.1501 | 31/124 | 0.1500 |
Dataset 2 | 0.1879 | 0.4475 | 9/55 | 0.4100 |
Comparison of Naive and Cumulative Penalty L1 regularization in LR
As I stated, there are many approaches to apply L1 penalty to our logistic regression’s loss function, one of that is so-called “Naive” method[@cumulative], which directly use absolute value of $sign(\beta)$ as it’s sub-gradient for weight update, but I used the cumulative penalty method, and following is the comparison of this two method. From Table.3 and Table.4, we can see that two methods have nearly the same performance, but cumulative penalty method has large sparsity and Naive method has 0 sparsity. See figures of the performance of Naive method in
Appendix.
Methods | Training error rate | Test error rate | Sparsity | Scikit-Learn Test error rate |
---|---|---|---|---|
Naive | 0.1507 | 0.1493 | 0/124 | 0.1500 |
Cumulative | 0.1504 | 0.1501 | 31/124 | 0.1500 |
Table. Comparison of Naive and Cumulative method in Dataset 1
Methods | Training error rate | Test error rate | Sparsity | Scikit-Learn Test error rate |
---|---|---|---|---|
Naive | 0.1878 | 0.4467 | 0/55 | 0.4100 |
Cumulative | 0.1879 | 0.4475 | 9/55 | 0.4100 |
Table. Comparison of Naive and Cumulative method in Dataset2
Ridge Regression via SGD with L2-regularization
Ridge Regression is a variant of Linear Regression, as we all know, there are many different methods to fit the linear model to a set of training data, but by far the most popular is the method of $least~squares$, in this approach, we pick the coefficients $w$ to minimize the residual sum of squares[@elmml]
$$RSS(w) = \sum_{i=1}^N {(y_i - w^Tx_i)^2}$$
because $RSS(w)$ is a quadratic function of the parameters, its’ minimum always exists, we can use many numerical optimization algorithms to optimize the function, like SGD. But what’s ridge regression, actually it’s very simple, ridge regression shrinks the regression coefficients by imposing a penalty on their size, the ridge coefficients minimize a penalized residual sum of squares
$$min {\frac{1}{N} \sum_{i=1}^N{ (y_i-w^Tx_i)^2 } + \lambda|w|_2^2} $$
and that is our objective. Using SGD to solve this optimization problem, we just need to get the gradient of objective in one training sample, which is
$$\nabla loss(w, x_i) = -2x_i(y_i-w^Tx_i) + 2\lambda w$$ So, according to Eq.15, we can easily write update function of weights by one iteration and one training example as:
$$w_{k+1} = w_k - \gamma \frac {\partial{loss(w, x_i)} }{ {\partial w} }
= w_k + 2\gamma {x_i(y_i-w^Tx_i)} - 2 \gamma\lambda w$$
Implementation
As we have the update function of ridge coefficient, we write it’s pseudo-code easily in Algorithm 2.
In order to evaluate the effectiveness of my algorithm implementation, I also wrote a implementation using Scikit-Learn, a python machine learning library, and you can see the comparison in Table.6. And after conduct a simple Cross Validation and use some tricks[@leonbottou]on dataset, I set hyperparameters as following values: ($t$ is iteration times)
Hyperparameters | Values |
---|---|
Learning Rate $\gamma$ | 0.001 / (1.0 + 0.001t0.003) |
Regularization Lambda $\lambda$ | 0.0001 |
Max Iteration Times on Dataset | 200(Dataset1), 30(Dataset2) |
Results of Experiment
After the programming, I evaluated the effectiveness and accuracy of my training algorithm on all 2 datasets, Figure.3 and Figure.4 show the change of key dependent variable including Train set error rate, Test error rate, Objective function and Sparsity in training process. Figure.3 shows results of ridge regression with L2-regularization on dataset 1 that with increasing value of iteration times of training, the training error rate and test set error rate decrease, like one line, after nearly 350000 times of iteration, the training error and test set error keep stable and smooth, only have some small fluctuations, indicating the loss function converged. Objective function has the same trend. Because of algorithm applying L2 regularization on loss function,
so there will be no sparsity, because of the property of L2 regularization. From here we can know that L1 regularization will cause more sparsity of weights vector than L2 regularization, which is the main reason of L1 regularization has been adopted/used in many scenarios, no matter in industry or academia. Figure.4 shows results of conducting algorithm on dataset 2, and we can see almost the same trend on Figure.1, training error rate, test set error rate and objective function decrease with increasing of iteration times, and sparsity increase. What different is that test set error rate on data set 2 is about 0.45, more than stable training error rate, 0.1921, and the following form is the final results of ridge regression with L2 regularization on two Dataset.
Results Appendix
Dataset | Training error rate | Test error rate | Scikit-Learn Test error rate |
---|---|---|---|
Dataset 1 | 0.1547 | 0.1548 | 0.1500 |
Dataset 2 | 0.1921 | 0.4449 | 0.4100 |