AI News, A “Data Science for Good” Machine Learning Project Walk-Through in Python: Part Two

A “Data Science for Good” Machine Learning Project Walk-Through in Python: Part Two

Model optimization means searching for the model hyperparameters that yield the best performance — measured in cross-validation — for a given dataset.

(I also wrote an article for using Hyperopt for model tuning here.) The details are a little protracted (see the notebook), but we need 4 parts for implementing Bayesian Optimization in Hyperopt The basic idea of Bayesian Optimization (BO) is that the algorithm reasons from the past results — how well previous hyperparameters have scored — and then chooses the next combination of values it thinks will do best.

Grid or random search are uninformed methods that don’t use past results and the idea is that by reasoning, BO can find better values in fewer search iterations.

Unlike in random search where the scores are, well random over time, in Bayesian Optimization, the scores tend to improve over time as the algorithm learns a probability model of the best hyperparameters.

In this case, the final cross validation results are shown below in dataframe form: The optimized model (denoted by OPT and using 10 cross validation folds with the features after selection) places right in the middle of the non-optimized variations of the Gradient Boosting Machine (which used hyperparameters I had found worked well for previous problems.) This indicates we haven’t found the optimal hyperparameters yet, or there could be multiple sets of hyperparameters that performly roughly the same.

Automated Machine Learning Hyperparameter Tuning in Python

Now, let’s define the entire domain: Here we use a number of different domain distribution types: (There are other distributions as well listed in the documentation.) There is one important point to notice when we define the boosting type: Here we are using a conditional domain which means the value of one hyperparameter depends on the value of another.

During optimization, the TPE algorithm constructs the probability model from the past results and decides the next set of hyperparameters to evaluate in the objective function by maximizing the expected improvement.

However, if we want to find out what is going on behind the scenes, we can use a Trials object which will store basic training information and also the dictionary returned from the objective function (which includes the loss andparams ).

Before training we open a new csv file and write the headers: and then within the objective function we can add lines to write to the csv on every iteration (the complete objective function is in the notebook): Writing to a csv means we can check the progress by opening the file while training (although not in Excel because this will cause an error in Python.

Once we have the four parts in place, optimization is run with fmin : Each iteration, the algorithm chooses new hyperparameter values from the surrogate function which is constructed based on the previous results and evaluates these values in the objective function.

The best object that is returned from fmin contains the hyperparameters that yielded the lowest loss on the objective function: Once we have these hyperparameters, we can use them to train a model on the full training data and then evaluate on the testing data (remember we can only use the test set once, when we evaluate the final model).

If the algorithm finds a local minimum of the objective function, it might concentrate on hyperparameter values around the local minimum rather than trying different values located far away in the domain space.

A Conceptual Explanation of Bayesian Hyperparameter Optimization for Machine Learning

The explanation of this equation is that we make two different distributions for the hyperparameters: one where the value of the objective function is less than the threshold, l(x), and one where the value of the objective function is greater than the threshold, g(x).

Let’s update our Random Forest graph to include a threshold: Now we construct two probability distributions for the number of estimators, one using the estimators that yielded values under the threshold and one using the estimators that yielded values above the threshold.

The Tree-structured Parzen Estimator works by drawing sample hyperparameters from l(x), evaluating them in terms of l(x) / g(x), and returning the set that yields the highest value under l(x) / g(x) corresponding to the greatest expected improvement.

Each time the algorithm proposes a new set of candidate hyperparameters, it evaluates them with the actual objective function and records the result in a pair (score, hyperparameters).

Eventually, with enough evaluations of the objective function, we hope that our model accurately reflects the objective function and the hyperparameters that yield the greatest Expected Improvement correspond to the hyperparameters that maximize the objective function.

Because the algorithm is proposing better candidate hyperparameters for evaluation, the score on the objective function improves much more rapidly than with random or grid search leading to fewer overall evaluations of the objective function.

In a paper about using SMBO with TPE, the authors reported that finding the next proposed set of candidate hyperparameters took several seconds, while evaluating the actual objective function took hours.

Hyperparameter tuning for machine learning models.

When creating a machine learning model, you'll be presented with design choices as to how to define your model architecture.

In true machine learning fashion, we'll ideally ask the machine to perform this exploration and select the optimal model architecture automatically.

Model parameters are learned during training when we optimize a loss function using something like gradient descent.The process for learning parameter values is shown generally below.

The ultimate goal for any machine learning model is to learn from examples in such a manner that the model is capable of generalizing the learning to new instances which it has not yet seen.

At a very basic level, you should train on a subset of your total dataset, holding out the remaining data for evaluation to gauge the model's ability to generalize - in other words, "how well will my model do on data which it hasn't directly learned from during training?"

The introduction of a validation dataset allows us to evaluate the model on different data than it was trained on and select the best model architecture, while still holding out a subset of the data for the final evaluation at the end of our model development.

You can also leverage more advanced techniques such as K-fold cross validation in order to essentially combine training and validation data for both learning the model parameters and evaluating the model without introducing data leakage.

Recall that I previously mentioned that the hyperparameter tuning methods relate to how we sample possible model architecture candidates from the space of possible hyperparameter values.

In the following visualization, the $x$ and $y$ dimensions represent two hyperparameters, and the $z$ dimension represents the model's score (defined by some evaluation metric) for the architecture defined by $x$ and $y$.

With this technique, we simply build a model for each possible combination of all of the hyperparameter values provided, evaluating each model, and selecting the architecture which produces the best results.

Gaussian process analysis of the function from hyper-parameters to validation set performance reveals that for most data sets only a few of the hyper-parameters really matter, but that different hyper-parameters are important on different data sets.

- Bergstra, 2012 In the following example, we're searching over a hyperparameter space where the one hyperparameter has significantly more influence on optimizing the model score - the distributions shown on each axis represent the model's score.

This model will essentially serve to use the hyperparameter values $\lambda_{1,...i}$ and corresponding scores $v_{1,...i}$ we've observed thus far to approximate a continuous score function over the hyperparameter space.

This approximated function also includes the degree of certainty of our estimate, which we can use to identify the candidate hyperparameter values that would yield the largest expected improvement over the current score.

Hyperparameter Optimization - The Math of Intelligence #7

Hyperparameters are the magic numbers of machine learning. We're going to learn how to find them in a more intelligent way than just trial-and-error. We'll go ...

How to find the best model parameters in scikit-learn

In this video, you'll learn how to efficiently search for the optimal tuning parameters (or "hyperparameters") for your machine learning model in order to maximize ...

40. Parameters vs Hyperparameters

Selecting the best model in scikit-learn using cross-validation

In this video, we'll learn about K-fold cross-validation and how it can be used for selecting optimal tuning parameters, choosing between models, and selecting ...

Lecture 3 | Loss Functions and Optimization

Lecture 3 continues our discussion of linear classifiers. We introduce the idea of a loss function to quantify our unhappiness with a model's predictions, and ...

OpenML robot assistant for algorithm selection and hyperparameter tuning

Lecture 7 | Training Neural Networks II

Lecture 7 continues our discussion of practical issues for training neural networks. We discuss different update rules commonly used to optimize neural networks ...

Keras Tutorial TensorFlow | Deep Learning with Keras | Building Models with Keras | Edureka

TensorFlow Training - ** This Edureka Keras Tutorial TensorFlow video (Blog: .

Lecture 13: Convolutional Neural Networks

Lecture 13 provides a mini tutorial on Azure and GPUs followed by research highlight "Character-Aware Neural Language Models." Also covered are CNN ...

Lesson 4: Practical Deep Learning for Coders

COLLABORATIVE FILTERING, EMBEDDINGS, AND MORE When we ran this class at the Data Institute, we asked what students were having the most trouble ...