AI News, Automatic Weighting of ImbalancedDatasets

Automatic Weighting of ImbalancedDatasets

So when you build a statistical machine-learning model of an imbalanced dataset, the majority (i.e., most prevalent) class will outweigh the minority classes.

This problem is known as the class-imbalance problem and occurs in a multitude of domains (fraud prevention, intrusion detection, churn prediction, etc).

However, usually basic undersampling takes away instances that might turn to be informative and basic oversampling does not add any extra information to your model.

Let me illustrate the impact of weighting on model creation by means of two sunburst visualizations for models of the Forest Covertype dataset. This dataset has 581,012 instances that belong to 7 different classes distributed as follows: The first sunburst below corresponds to a single (512-node) model colored by prediction that I created without using any weighting. The second one corresponds to a single (512-node) weighted model created using BigML’s new balance objective option (more on it below).

So as you can see, weighting helped make us aware of outcomes of predicted classes that are under-represented in the input data that otherwise would be shadowed by over-represented values.

If every weight does end up being zero (this can happen, for instance, if sampling the dataset produces only instances of classes with zero weight) then the resulting model will have a single node with a nil output.

That is, if you know the cost of  a false positive and the cost of a false negative in your problem, then you will want to weigh each class to minimize the overall misclassification costs when you build your model.

Automatic Weighting of ImbalancedDatasets

So when you build a statistical machine-learning model of an imbalanced dataset, the majority (i.e., most prevalent) class will outweigh the minority classes.

This problem is known as the class-imbalance problem and occurs in a multitude of domains (fraud prevention, intrusion detection, churn prediction, etc).

However, usually basic undersampling takes away instances that might turn to be informative and basic oversampling does not add any extra information to your model.

Let me illustrate the impact of weighting on model creation by means of two sunburst visualizations for models of the Forest Covertype dataset. This dataset has 581,012 instances that belong to 7 different classes distributed as follows: The first sunburst below corresponds to a single (512-node) model colored by prediction that I created without using any weighting. The second one corresponds to a single (512-node) weighted model created using BigML’s new balance objective option (more on it below).

So as you can see, weighting helped make us aware of outcomes of predicted classes that are under-represented in the input data that otherwise would be shadowed by over-represented values.

If every weight does end up being zero (this can happen, for instance, if sampling the dataset produces only instances of classes with zero weight) then the resulting model will have a single node with a nil output.

That is, if you know the cost of  a false positive and the cost of a false negative in your problem, then you will want to weigh each class to minimize the overall misclassification costs when you build your model.

How can I change the instances weight when training a model or an ensemble?

To configure the weight parameters thru the BigML API, please click here.

In classification models and ensembles, you can combine it with balance objective or objective weights. If balance objective is chosen then BigML automatically balances all the classes evenly.

Weights of zero are valid as long as there are some positive valued weights.  Furthermore, you can read this blog post to learn how to correctly weigh fields of your datasets to tune the results of your model.

Using a Customized Cost Function to deal with UnbalancedData

As pointed in this Kdnuggets article, it’s often the case that we only have a few examples of the thing that we want to predict in our data.

That’s because the algorithms usually need many examples of each class to extract the general rules in your data, and the instances in minority classes can be discarded as noise, causing some useful rules to never be found.

However, the last suggested method takes a different approach: adapting the algorithm to the data by designing a function, which penalizes the more abundant classes and favors the less populated ones using a per-instance cost function.

The model configuration panel offers different options for this purpose: By using any of these options, you are telling the model how to compensate for the lack of instances in each class.

However, the article in Kdnuggets goes one step further and introduces the technique of using a cost function to also penalize the instances that lead to bad predictions.

This means that we need to tell our model when it’s not performing well, either because it’s not finding the less common classes or because it’s failing in the prediction of any of its results.

For starters, we can add to our dataset a new field containing the quantity to be used in penalizing (or increasing the importance of) each row according to our cost function.

And it’s perfect for this task. So we’ll apply it to build a model that depends on a cost function and check whether it performed better than the models built from raw (or automatically balanced) data.

The way to prove that balancing our instances is improving our model is evaluating its results and comparing them to the ones you’d obtain from a model built on the original data.

The 80% of data will then form a training dataset that will be used to build the models and we will hold out the remaining 20% to evaluate their performance.

The create-dataset-split procedure in WhizzML takes a dataset ID, a sample rate and a seed, which is simply a string value of our choice that will be used to randomly select the instances that go into the test dataset.

The idea here is that we want to improve our model’s performance, so besides assigning a higher weight to the instances of the minority class uniformly, we would like to be able to weight higher those instances that contribute to the model being correct when predicting.

There are two possibilities for the predictions to be wrong: instances that are predicted to be of the positive class and are not (FP = false positives), and instances of the positive class whose prediction fails (FN = false negatives).

We won’t discuss here the details of how to build this expression, but you can see that in each row we compare the value of the objective field (f {{objective-id}}) to the predicted value (f \'__prediction__\') and use the confidence of the prediction (f \'__confidence__\'), the total number of instances {{total}} and the instances in the objective field class {{class-inst}} to compute the weight.

The process is repeated with a different holdout each time, so every instance is weighted and the models that create the predictions are built on data completely independent from any particular test set.

After testing with some unbalanced datasets, we achieved better performance using the weight field model than with either the raw or the automatically balanced ones.

What kind of algorithm does BigML use to build decision tree models and how does it work?

When generating candidate split points for numeric fields, BigML does not consider every possible split, instead BigML uses streaming histograms to choose the split candidates (normally 32 candidates per split).  When encountering missing data in input fields during training there are two possible approaches: either ignoring the missing instances when generating candidate splits, or explicitly including them with the MIA approach (e.g., age >

When encountering missing data at prediction time there are also two strategies: on the one hand BigML generates a prediction at the tree node whose split encounters a missing value (ignoring the node's possible children), or on the other hand BigML evaluates the node's children and combine the predictions from both subtrees (similar to C4.5 algorithm and sometimes referred to as 'distribution-based imputation').

How to Make a Prediction - Intro to Deep Learning #1

Welcome to Intro to Deep Learning! This course is for anyone who wants to become a deep learning engineer. I'll take you from the very basics of deep learning ...

Intro to Azure ML & Cloud Computing

Azure Machine Learning Studio is a fully featured graphical data science tool in the cloud. You will learn how to upload, analyze, visualize, manipulate, and ...

Ruby Conf 2013 - Thinking about Machine Learning with Ruby by Bryan Liles

Not sure where to cluster or where to classify? Have you seen a linear regression lately? Every wanted to take a look into machine learning? Curious to what ...

MLMU.cz - FlowerChecker: Exciting journey of one ML startup – O. Veselý & J. Řihák

Machine Learning Meetup in Brno, Czech Republic Abstract: FlowerChecker — machine learning startup — was established three years ago by three PhD.