Monday, June 11, 2012

Predictive Analytics: Evaluate Model Performance

In previous posts, we have discussed various machine learning techniques including linear regression with regularization, SVM, Neural network, Nearest neighbor, Naive Bayes, Decision Tree and Ensemble models.  How do we pick which model is the best ?  or even whether the model we pick is better than a random guess ?  In this posts, we will cover how we evaluate the performance of the model and what can we do next to improve the performance.

Best guess with no model 

First of all, we need to understand the goal of our evaluation.  Are we trying to pick the best model ?  Are we trying to quantify the improvement of each model ?  Regardless of our goal, I found it is always useful to think about what the baseline should be.  Usually the baseline is what is your best guess if you don't have a model.

For classification problem, one approach is to do a random guess (with uniform probability) but a better approach is to guess the output class that has the largest proportion in the training samples.  For regression problem, the best guess will be the mean of output of training samples.

Prepare data for evaluation

In a typical setting, the set of data is divided into 3 disjoint groups; 20% data is set aside as testing data to evaluate the model we've trained.  The remaining 80% of data is dividing into k partitions.  k-1 partitions will be used as training data to train a model with a particular parameter value setting and 1 partition will be used as cross-validation data to pick the best parameter value that minimize the error of the cross-validation data.

As a concrete example, lets say we have 100 records available.  We'll set aside 20% which is 20 records for testing purposes and use the remaining 80 records to train the model.  Lets say the model has some tunable parameters (e.g. k in KNN, λ in linear model regularization).  For each particular parameter value, we'll conduct 10 rounds of training (ie: k = 10).  Within each round, we randomly select 90% which is 72 records to train a model and compute the error of prediction against the 8 unselected records.  Then we take the average error of these 10 rounds and pick the optimal parameter value that gives the minimal average error.  After picking the optimal tuning parameter,  we retrain the model using the whole 80 records.

To evaluate the predictive performance of the model, we'll test it against the 20 testing records we set aside at the beginning.  The details will be described below.

Measuring Regression Performance

For regression problem, measuring the distance between the estimated output from the actual output is used to quantify the model's performance.  Three measures are commonly used:  Root Mean Square Error, Relative Square Error and Coefficient of Determination.  Typically root mean square is used for measuring the absolute quantity of accuracy.

Mean Square Error MSE = (1/N) * ∑(yest – yactual)2
Root Mean Square Error RMSE = (MSE)1/2

To measure the accuracy with respect to the baseline, we use the ratio of MSE

Relative Square Error RSE = MSE / MSEbaseline
RSE = ∑(yest – yactual)2 / ∑(ymean – yactual)2

Coefficient Of Determination (also called R square) measures the variance that is explained by the model, which is the reduction of variance when using the model.  R square ranges from 0 to 1 while the model has strong predictive power when it is close to 1 and is not explaining anything when it is close to 0.

R2 = (MSEbaseline – MSE) / MSEbaseline
R2 = 1 – RSE

Here are some R code to compute these measures

> Prestige_clean <- Prestige[!$type),]
> model <- lm(prestige~., data=Prestige_clean)
> score <- predict(model, newdata=Prestige_clean)
> actual <- Prestige_clean$prestige
> rmse <- (mean((score - actual)^2))^0.5
> rmse
[1] 6.780719
> mu <- mean(actual)
> rse <- mean((score - actual)^2) / mean((mu - actual)^2) 
> rse
[1] 0.1589543
> rsquare <- 1 - rse
> rsquare
[1] 0.8410457

The Mean Square Error penalize the bigger difference more because of the square effect.  On the other hand, if we want to reduce the penalty of bigger difference, we can log transform the numeric quantity first.

Mean Square Log Error MSLE = (1/N) * ∑(log(y)est – log(y)actual)2
Root Mean Square Log Error RMSLE = (MSLE)1/2

Measuring Classification Performance

For classification problem, there are a couple of measures.
  • TP = Predict +ve when Actual +ve
  • TN = Predict -ve when Actual -ve
  • FP = Predict +ve when Actual -ve
  • FN = Predict -ve when Actual +ve
Precision = TP / Predict +ve = TP / (TP + FP)
Recall or Sensitivity = TP / Actual +ve = TP / (TP + FN)
Specificity = TN / Actual -ve = TN / (FP + TN)
Accuracy = (TP + TN) / (TP + TN + FP + FN)

Accuracy alone is not sufficient to represent the quality of prediction because the cost of making a FP may be different from the cost of making a FN.  F measures provides a tunable assigned weight for computing a final score and is commonly used to measure the quality of a classification model.

1/Fmeasure = α/recall + (1-α)/precision

Notice that most classification model is based on estimating a numeric score for each output class.  By choosing the cutoff point of this score, we can control the tradeoff between precision and recall.  We can plot the relationship between precision and recall at various cutoff points as follows ...

> library(ROCR)
> library(e1071)
> nb_model <- naiveBayes(Species~., data=iristrain)
> nb_prediction <- predict(nb_model,
> score <- nb_prediction[, c("virginica")]
> actual_class <- iristest$Species == 'virginica'
> # Mix some noise to the score
> # Make the score less precise for illustration
> score <- (score + runif(length(score))) / 2
> pred <- prediction(score, actual_class)
> perf <- performance(pred, "prec", "rec")
> plot(perf)

Another common plot is the ROC curve which plot the "sensitivity" (true positive rate) against 1 - "specificity" (false positive rate).  The area under curve "auc" is used to compare the quality between different models with varying cutoff points.  Here is how we produce the ROC curve.

> library(ROCR)
> library(e1071)
> nb_model <- naiveBayes(Species~., 
> nb_prediction <- predict(nb_model, 
> score <- nb_prediction[, c("virginica")]
> actual_class <- iristest$Species == 'virginica'
> # Mix some noise to the score
> # Make the score less precise for illustration
> score <- (score + runif(length(score))) / 2
> pred <- prediction(score, actual_class)
> perf <- performance(pred, "tpr", "fpr")
> auc <- performance(pred, "auc")
> auc <- unlist(slot(auc, "y.values"))
> plot(perf)
> legend(0.6,0.3,c(c(paste('AUC is', auc)),"\n"),
         box.col = "white")

We can also assign the relative cost of making a false +ve and false -ve decision to find the best cutoff threshold.  Here is how we plot the cost curve

> # Plot the cost curve to find the best cutoff
> # Assign the cost for False +ve and False -ve
> perf <- performance(pred, "cost", cost.fp=4, cost.fn=1)
> plot(perf)

From the curve, the best cutoff point 0.6 is where the cost is minimal.

Source of error: Bias and Variance

In model-based machine learning, we are making assumption that the underlying data follows some underlying mathematical model and during the training we try to fit the training data into this assumed model and determine the best model parameters which gives the minimal error.

One source of error is when our assumed model is fundamentally wrong (e.g. if the output has a non-linear relationship with the input but we are assuming a linear model).  This is known as the High Bias problem which we use an over-simplified model to represent the underlying data.

Another source of error is when the model parameters fits too specifically to the training data and not generalizing well to the underlying data pattern.  This is known as the High Variance problem and usually happen when there is insufficient training data compare to the number of model parameters.

High bias problem has the symptom that both training and cross-validation shows a high error rate and both error rate drops as the model complexity increases.  While the training error keep decreasing as the model complexity increases, the cross-validation error will increase after certain model complexity and this indicates the beginning of a high variance problem.  When the size of training data is fixed and the only thing we can choose is the model complexity, then we should choose the model complexity at the point where the cross-validation error is minimal.

Will getting more data help ?

When collecting more data cost both time and money, we need to carefully assess the situation before we spend our effort to do so.  There is a very pragmatic technique suggested by Andrew Ng from Stanford by plotting the error against the size of data.  In this approach, we sample different size of training data to train up different models and plot both the cross-validation error and the training error with respect to the training sample size.
If the problem is a high-bias problem, the error curve will look like the following.

In this case, collecting more data would not help.  We should spend our effort to do the following instead.
  • Add more input features by talking to domain experts to identify more input signals and collect them.
  • Add more input features by combining existing input features in a non-linear way.
  • Try more complex, machine learning models. (e.g. increase the number of hidden layers in Neural Network, or increase the number of Neurons at each layer)
If the problem is a high-variance problem because we over-estimate the model complexity, then we should reduce the model complexity by throwing away attributes that has low influence to the output.  We don't have to gather more data.

The only situation where having more data will be helpful is when the underlying data model is in fact complex.  Therefore we cannot just reduce the complexity as this will immediately results in a high-bias problem.  In this case, the error curve will have the following shape.

And the only way is to collect more training data such that overfitting is less likely to happen.

Evaluate the performance of a model is very important in the overall cycle of predictive analytics.  Hopefully this introductory post gives a basic idea on how this can be done.

Wednesday, June 6, 2012

Predictive Analytics: Decision Tree and Ensembles

Continue from my last post of walking down the list of machine learning technique.  In this post, I will covered Decision Tree and Ensemble methods.  We'll continue using the iris data we prepare in this earlier post.

Decision Tree

Decision Tree model is one of the oldest machine learning model and is usually used to illustrate the very basic idea of machine learning.  Based on a tree of decision nodes, the learning approach is to recursively divide the training data into buckets of homogeneous members through the most discriminative dividing criteria.  The measurement of "homogeneity" is based on the output label; when it is a numeric value, the measurement will be the variance of the bucket; when it is a category, the measurement will be the entropy or gini index of the bucket..

During the training, various dividing criteria based on the input will be tried (using in a greedy manner); when the input is a category (Mon, Tue, Wed ...), it will first be turned into binary (isMon, isTue, isWed ...) and then use the true/false as a decision boundary to evaluate the homogeneity; when the input is a numeric or ordinal value, the lessThan, greaterThan at each training data input value will be used as the decision boundary.

The training process stops when there is no significant gain in homogeneity by further split the Tree. The members of the bucket represented at leaf node will vote for the prediction; majority wins when the output is a category and member’s average is taken when the output is a numeric.

Here is an example in R

> install.packages('rpart')
> library(rpart)
> #Train the decision tree
> treemodel <- rpart(Species~., data=iristrain)
> plot(treemodel)
> text(treemodel, use.n=T)
> #Predict using the decision tree
> prediction <- predict(treemodel, newdata=iristest, type='class')
> #Use contingency table to see how accurate it is
> table(prediction, iristest$Species)
prediction   setosa versicolor virginica
  setosa         10          0         0
  versicolor      0         10         3
  virginica       0          0         7

Here is the decision tree that we have learned.

The good part of Tree is that it can take different data type of input and output variables which can be categorical, binary and numeric value.  It can handle missing attributes and outliers well.  The decision tree is also good in explaining reasoning for its prediction and therefore gives good insight about the underlying data.

The limitation of decision tree is that each decision boundary at each split point is a concrete binary decision. Also the decision criteria only consider one input attributes at a time but not a combination of multiple input variables. Another weakness of Tree is that once learned it cannot be updated incrementally. When new training data arrives, you have to throw away the old tree and retrain every data from scratch.  In practice, standalone decision tree and rarely used in practice as its predictive accuracy and relatively low.  Tree ensembles (described below) are the common way to use decision trees.

Tree Ensembles

Instead of picking a single model, Ensemble Method combines multiple models in a certain way to fit the training data.  There are two primary ways: “bagging” and “boosting”.  In “bagging”, we take a subset of training data (pick n random sample out of N training data with replacement) to train up each model.  After multiple models are trained, we use a voting scheme to predict future data.

“Boosting” is another approach in Ensemble Method.  Instead of sampling the input features, it samples the training data records, but it puts more emphasis on the training data that is wrongly predicted in previous iterations.  Initially each training data is equally weighted.  At each iteration, the data that is wrongly classified will have its weight increased. The final prediction will be voted by each tree learned at each iteration weighted by the accuracy of that tree.

Random Forest

Random Forest is one of the most popular bagging models; in additional to selecting n training data out of N, at each decision node of the tree, it randomly selects m input features from the total M input features (m ~ M^0.5) and learns a decision tree from it.  Finally each tree in the forest vote for the result.

Here is the R code to use Random Forest.

> install.packages(‘randomForest')
> library(randomForest)
> #Train 100 trees, random selected attributes
> model <- randomForest(Species~., data=iristrain, nTree=500)
> #Predict using the forest
> prediction <- predict(model, newdata=iristest, type='class')
> table(prediction, iristest$Species)
> importance(model)
Sepal.Length         7.807602
Sepal.Width          1.677239
Petal.Length        31.145822
Petal.Width         38.617223

Gradient Boosted Trees

Gradient Boosting Method is one of the most popular boosting methods. It is based on incrementally add a function that fits the gradient of a lost function of residuals
  1. Set i = 0 at the beginning, repeat until converge
  2. Learn a function F(X) to predict Y.  Basically find F that minimize expected(L(F(X) – Y)), where L is the lost function of the residual
  3. Learning another function g(X) to predict the gradient of above function
  4. Update F = F + a.g(X), where a is the learning rate
 Gradient Boosted Tree using decision tree as the learning model F.

Here is the sample code in R.

> library(gbm)
> iris2 <- iris
> newcol = data.frame(isVersicolor=(iris2$Species=='versicolor'))
> iris2 <- cbind(iris2, newcol)
> iris2[45:55,]
   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species isVersicolor
45          5.1         3.8          1.9         0.4     setosa        FALSE
46          4.8         3.0          1.4         0.3     setosa        FALSE
47          5.1         3.8          1.6         0.2     setosa        FALSE
48          4.6         3.2          1.4         0.2     setosa        FALSE
49          5.3         3.7          1.5         0.2     setosa        FALSE
50          5.0         3.3          1.4         0.2     setosa        FALSE
51          7.0         3.2          4.7         1.4 versicolor         TRUE
52          6.4         3.2          4.5         1.5 versicolor         TRUE
53          6.9         3.1          4.9         1.5 versicolor         TRUE
54          5.5         2.3          4.0         1.3 versicolor         TRUE
55          6.5         2.8          4.6         1.5 versicolor         TRUE
> formula <- isVersicolor ~ Sepal.Length 
                              + Sepal.Width
                              + Petal.Length
                              + Petal.Width
> model <- gbm(formula, data=iris2, 
Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.2714         -1.#IND     0.0010    0.0008
     2        1.2705         -1.#IND     0.0010    0.0004
     3        1.2688         -1.#IND     0.0010    0.0007
     4        1.2671         -1.#IND     0.0010    0.0008
     5        1.2655         -1.#IND     0.0010    0.0008
     6        1.2639         -1.#IND     0.0010    0.0007
     7        1.2621         -1.#IND     0.0010    0.0008
     8        1.2614         -1.#IND     0.0010    0.0003
     9        1.2597         -1.#IND     0.0010    0.0008
    10        1.2580         -1.#IND     0.0010    0.0008
   100        1.1295         -1.#IND     0.0010    0.0008
   200        1.0090         -1.#IND     0.0010    0.0005
   300        0.9089         -1.#IND     0.0010    0.0005
   400        0.8241         -1.#IND     0.0010    0.0004
   500        0.7513         -1.#IND     0.0010    0.0004
   600        0.6853         -1.#IND     0.0010    0.0003
   700        0.6266         -1.#IND     0.0010    0.0003
   800        0.5755         -1.#IND     0.0010    0.0002
   900        0.5302         -1.#IND     0.0010    0.0002
  1000        0.4901         -1.#IND     0.0010    0.0002

> prediction <- predict.gbm(model, iris2[45:55,], 
> round(prediction, 3)
 [1] 0.127 0.131 0.127 0.127 0.127 0.127 0.687 0.688 
     0.572 0.734 0.722
> summary(model)
           var       rel.inf
1 Petal.Length 61.4203761582
2  Petal.Width 34.7557511871
3  Sepal.Width  3.5407662531
4 Sepal.Length  0.2831064016

GBM R package also gave the relative importance of the input features.
I have gone through most of the common machine learning techniques.  In this next post, I will covered how to evaluate the performance of these models in their predictive power.

Friday, June 1, 2012

Predictive Analytics: NeuralNet, Bayesian, SVM, KNN

Continuing from my previous blog in walking down the list of Machine Learning techniques.  In this post, we'll be covering Neural Network, Support Vector Machine, Naive Bayes and Nearest Neighbor.  Again, we'll be using the same iris data set that we prepared in the last blog.

Neural Network

Neural Network emulates how the human brain works by having a network of neurons that are interconnected and sending stimulating signal to each other.

In the Neural Network model, each neuron is equivalent to a logistic regression unit.  Neurons are organized in multiple layers where every neuron at layer i connects out to every neuron at layer i+1 and nothing else.

The tuning parameters in Neural network includes the number of hidden layers, number of neurons in each layer, as well as the learning rate.

There are no fixed rules to set these parameters and depends a lot in the problem domain.  My default choice is to use a single hidden layer and set the number of neurons to be the same as the input variables. The number of neurons at the output layer depends on how many binary outputs need to be learned.  In a classification problem, this is typically the number of possible values at the output category.

The learning happens via an iterative feedback mechanism where the error of training data output is used to adjusted the corresponding weights of input.  This adjustment will be propagated back to previous layers and the learning algorithm is known as back-propagation.

Here is an example in R

> library(neuralnet)
> nnet_iristrain <-iristrain
> #Binarize the categorical output
> nnet_iristrain <- cbind(nnet_iristrain, 
                          iristrain$Species == 'setosa')
> nnet_iristrain <- cbind(nnet_iristrain,
                          iristrain$Species == 'versicolor')
> nnet_iristrain <- cbind(nnet_iristrain, 
                          iristrain$Species == 'virginica')
> names(nnet_iristrain)[6] <- 'setosa'
> names(nnet_iristrain)[7] <- 'versicolor'
> names(nnet_iristrain)[8] <- 'virginica'
> nn <- neuralnet(setosa+versicolor+virginica ~ 
> plot(nn)
> mypredict <- compute(nn, iristest[-5])$net.result
> # Put multiple binary output to categorical output
> maxidx <- function(arr) {
      return(which(arr == max(arr)))
> idx <- apply(mypredict, c(1), maxidx)
> prediction <- c('setosa', 'versicolor', 'virginica')[idx]
> table(prediction, iristest$Species)
prediction   setosa versicolor virginica
  setosa         10          0         0
  versicolor      0         10         3
  virginica       0          0         7

Here is the plot of the Neural network we learn
Neural network is very good at learning non-linear function and also multiple outputs can be learnt at the same time.  However, the training time is relatively long and it is also susceptible to local minimum traps.  This can be mitigated by doing multiple rounds and pick the best learned model.

Support Vector Machine

Support Vector Machine provides a binary classification mechanism based on finding a dividing hyperplane between a set of samples with +ve and -ve outputs.  It assumes the data is linearly separable.

The problem can be structured as a quadratic programming optimization problem as maximizing the margin subjected to a set of linear constraints (ie: data output on one side of the line must be +ve while the other side is -ve).  This can be solved by quadratic programming technique.

If the data is not linearly separable due to noise (majority is still linearly separable), then an error term will be added to penalize the optimization.

If the data distribution is fundamentally non-linear, the trick is to transform the data to a higher dimension and hopefully the data is linearly separable in that higher dimension.  The optimization term turns out to be a dot product of the transformed points in the high dimensional space, which was found to be equivalent to perform a kernal function in the original (before transformation) space.

The kernal function provides a cheap way to equivalently transform the original point to a high dimension (since we don’t actually transform it) and perform the quadratic optimization in that high dimension space.

There are a couple tuning parameters (like the penalty / cost) and so the usual way is to do it in 2 steps, find the optimal parameter and then train the SVM model using that.  Here is the example code in R.

> library(e1071)
> tune <- tune.svm(Species~., 
> summary(tune)
Parameter tuning of ‘svm’:
- sampling method: 10-fold cross validation 
- best parameters:
 gamma  cost
 0.001 10000
- best performance: 0.03333333 
> model <- svm(Species~., 
> prediction <- predict(model, iristest)
> table(iristest$Species, prediction)
             setosa versicolor virginica
  setosa         10          0         0
  versicolor      0         10         0
  virginica       0          3         7

SVM with Kernal function is a highly effective model and works well across a wide range of problem sets.  Although it is a binary classifier, it can be easily extended to multi-class classification by training a group of binary classifiers and using “one vs all” or “one vs one” to predict.

SVM predicts the output based on the distance to the dividing hyperplane, which doesn’t directly provide a probability estimation of its prediction.  Calibration technique can be used which basically learn a logistic regression model between the distance of the hyperplane and the binary output and using the logistic regression to estimate the probability.

SVM is a very powerful technique and perform good in a wide range of non-linear classification problems.  It works best when you have a small set of input features because it will expand the features into higher dimension space, providing that you also have a good size of training data (otherwise, overfit can happen).  However, SVM is not very scalable in dealing with large number (billions) of training data, so in that case, Logistic Regression with manually expanded feature set will be more pragmatic. 

Naive Bayes

From a probabilistic viewpoint, the predictive problem can be viewed as a conditional probability estimation; trying to find Y where P(Y | X) is maximized.

From Bayesian rule, P(Y | X) == P(X | Y) * P(Y) / P(X)

This is equivalent to finding Y where P(X | Y) * P(Y) is maximized.

Lets say input X contains 3 categorical features, X1, X2, X3.  In the general case, we assume each variable can potentially influence any other variable and therefore the joint distribution becomes:
P(X | Y) = P(X1 | Y) * P(X2 | X1, Y) * P(X3 | X1, X2, Y)

Notice the last term of the above equation has number of entries exponentially proportional to the number of input variables.

To reduce the complexity, we can make some independence assumption based on domain knowledge and remove some edges.  We form a Bayesian Network where the arrows indicate causal relationship.

In Naïve Bayes model, we take our independence assumption even further and assume each input variable are independent of each other given Y.

Since P(X | Y) == P(X1 | Y) * P(X2 | Y) * P(X3 | Y).
The problem is to find Y to maximize P(X1 | Y) * P(X2 | Y) * P(X3 | Y) * P(Y)

Each term can be learned by counting the training data.  Therefore we can estimate P(Y | X) and pick Y that maximize its value.

It is possible that some patterns never show up in training data ie: P(X1=a | Y=y) is 0.  To deal with this situation, we apply the smoothing technique by assuming we have seen the data of each possible value one more than actual.  Now ...

P(X1=a | Y=y) == (count(a, y) + 1) / (count(y) + m)
where m is the number of possible values in X1.

When the input features are numeric, say a = 2.75, we can assume X1 is normal distribution.  Find out the mean and standard deviation of X1 and then estimate P(X1=a) using the normal distribution function.

Here is how we use Naïve Bayes in R
> library(e1071)
> # Can handle both categorical and numeric input, 
> # but output must be categorical
> model <- naiveBayes(Species~., data=iristrain)
> prediction <- predict(model, iristest[,-5])
> table(prediction, iristest[,5])
prediction   setosa versicolor virginica
  setosa         10          0         0
  versicolor      0         10         2
  virginica       0          0         8

Notice the independence assumption is not true in most cases, nevertheless the system still perform incredibly well.  One strength of Naïve Bayes is it is highly scalable and can learn incrementally because all we do is to count the observed variables and update the probability distribution.

K Nearest Neighbor

K Nearest neighbor is also called instance-based learning, in contrast to model-based learning, because it is not learning any model at all.  The training process is basically memorizing all the training data.  To predict a new data point, we found the closest K (a tunable parameter) neighbors from the training set and let them vote for the final prediction.

To determine the “nearest neighbors”, a distance function need to be defined (e.g. Euclidean distance is a common one for numeric input variables).  The voting can also be weighted among the K-neighbors based on their distance from the new data point.

Here is the R code of using K-nearest neighbor for classification.

> library(class)
> train_input <- as.matrix(iristrain[,-5])
> train_output <- as.vector(iristrain[,5])
> test_input <- as.matrix(iristest[,-5])
> prediction <- knn(train_input, test_input, 
                    train_output, k=5)
> table(prediction, iristest$Species)
prediction   setosa versicolor virginica
  setosa         10          0         0
  versicolor      0         10         1
  virginica       0          0         9

The strength of K nearest neighbor is its simplicity as no model needs to be trained. Incremental learning is automatic when more data arrives (and old data can be deleted as well). The weakness of KNN is it doesn't handle high number of dimensions well.

In my next post, I will finish up the other machine learning techniques on decision trees and ensemble methods.