# AI News, Monotonicity constraints in machine learning

- On Tuesday, September 25, 2018
- By Read More

## Monotonicity constraints in machine learning

In practical machine learning and data science tasks, an ML model is often used to quantify a global, semantically meaningful relationship between two or more values.

However what might easily happen is that upon building the model, the data scientist discovers that the model is behaving unexpectedly: for example the model predicts that on Tuesdays, the clients would rather pay $110 than $100 for a room!

And while monotonicity constraints have been a topic of academic research for a long time (see a survey paper on monotonocity constraints for tree based methods), there has been lack of support from libraries, making the problem hard to tackle for practitioners.

Luckily, in recent years there has been a lot of progress in various ML libraries to allow setting monotonicity constraints for the models, including in LightGBM and XGBoost, two of the most popular libraries for gradient boosted trees.

For tree based methods (decision trees, random forests, gradient boosted trees), monotonicity can be forced during the model learning phase by not creating splits on monotonic features that would break the monotonicity constraint.

One recent development in the field is Tensorflow Lattice, which implements lattice based models that are essentially interpolated look-up tables that can approximate arbitrary input-output relationships in the data and which can optionally be monotonic.

- On Tuesday, September 25, 2018
- By Read More

## Ideas on interpreting machine learning

You’ve probably heard by now that machine learning algorithms can use big data to predict whether a donor will give to a charity, whether an infant in a NICU will develop sepsis, whether a customer will respond to an ad, and on and on.

Although it is possible to enforce monotonicity constraints (a relationship that only changes in one direction) between independent variables and a machine-learned response function, machine learning algorithms tend to create nonlinear, non-monotonic, non-polynomial, and even non-continuous functions that approximate the relationship between independent and dependent variables in a data set.

(This relationship might also be referred to as the conditional distribution of the dependent variables, given the values of the independent variables.) These functions can then make very specific predictions about the values of dependent variables for new data—whether a donor will give to a charity, an infant in a NICU will develop sepsis, a customer will respond to an ad, etc.

These models will be referred to here as “linear and monotonic,” meaning that for a change in any given independent variable (or sometimes combination or function of an independent variable), the response function changes at a defined rate, in only one direction, and at a magnitude represented by a readily available coefficient.

For instance, if a lender rejects your credit card application, they can tell you why because their probability-of-default model often assumes your credit score, your account balances, and the length of your credit history are monotonically related to your ability to pay your credit card bill.

While there is no single coefficient that represents the change in the response function induced by a change in a single independent variable, nonlinear and monotonic functions do always change in one direction as a single input variable changes.

These functions are not highlighted here because they tend to be less accurate predictors than purely nonlinear, non-monotonic functions, while also lacking the high interpretability of their monotonic counterparts.) Nonlinear, non-monotonic functions: Most machine learning algorithms create nonlinear, non-monotonic response functions.

Global interpretability: Some of the presented techniques facilitate global interpretations of machine learning algorithms, their results, or the machine-learned relationship between the inputs and the dependent variable(s) (e.g., the model of the conditional distribution).

Another important way to classify model interpretability techniques is whether they are “model-agnostic,” meaning they can be applied to different types of machine learning algorithms, or “model-specific,” meaning techniques that are only applicable for a single type or class of algorithm.

Part 1 includes approaches for seeing and understanding your data in the context of training and interpreting machine learning algorithms, Part 2 introduces techniques for combining linear models and machine learning algorithms for situations where interpretability is of paramount importance, and Part 3 describes approaches for understanding and validating the most complex types of predictive models.

So, if a data set has more than two or three variables or more rows than can fit on a single page or screen, it’s realistically going to be hard to understand what’s going on in it without resulting to more advanced techniques than scrolling through countless rows of data.

Traditional univariate and bivariate tables and plots are still important and you should use them, but they are more relevant in the context of traditional linear models and slightly less helpful in understanding nonlinear models that can pick up on arbitrarily high-degree interactions between independent variables.

We can also see that, in general, operating system versions tend to be older than browser versions and that using Windows and Safari is correlated with using newer operating system and browser versions, whereas Linux users and bots are correlated with older operating system and browser versions.

While many details regarding the display of a correlation graph are optional and could be improved beyond those chosen for Figure 3, correlation graphs are a very powerful tool for seeing and understanding relationships (correlation) between variables in a data set.

The node size is determined by a node’s number of connections (node degree), node color is determined by a graph community calculation, and node position is defined by a graph force field algorithm.

In a supervised model built for the data represented in Figure 3, assuming one of the represented variables was an appropriate target, we would expect variable selection techniques to pick one or two variables from the light green, blue, and purple groups, we would expect variables with thick connections to the target to be important variables in the model, and we would expect a model to learn that unconnected variables like CHANNEL_R are not very important.

For instance, if known hierarchies, classes, or clusters exist in training or test data sets and these structures are visible in 2-D projections, it is possible to confirm that a machine learning model is labeling these structures correctly.

It is reasonable to expect a machine learning model to label older, richer customers differently than younger, less affluent customers—and moreover, to expect that these different groups should be relatively disjointed and compact in a projection, and relatively far from one another.

Such results should also be stable under minor perturbations of the training or test data, and projections from perturbed versus non-perturbed samples can be used to check for stability or for potential patterns of change over time.

In fact, the way partial dependence plots enhance understanding is exactly by showing the nonlinearity, non-monotonicity, and two-way interactions between independent variables and a dependent variable in complex models.

They can also enhance trust when displayed relationships conform to domain knowledge expectations, when the plots remain stable or change in expected ways over time, or when displayed relationships remain stable under minor perturbations of the input data.

Vice versa, if models are producing randomly distributed residuals, this a strong indication of a well-fit, dependable, trustworthy model, especially if other fit statistics (i.e., R2, AUC, etc.) are in the appropriate ranges.

As many machine learning algorithms seek to minimize squared residuals, observations with high residual values will have a strong impact on most models, and human analysis of the validity of these outliers can have a big impact on model accuracy.

For most people, visual representations of structures (clusters, hierarchy, sparsity, outliers) and relationships (correlation) in a data set are easier to understand than scrolling through plain rows of data and looking at each variable's values.

In general, visualizations themselves can sometimes be thought of as a type of sensitivity analysis when they are used to display data or models as they change over time, or as data are intentionally changed to test stability or important corner cases for your application.

For analysts and data scientists working in regulated industries, the potential boost in predictive accuracy provided by machine learning algorithms may not outweigh their current realities of internal documentation needs and external regulatory responsibilities.

These models produce linear, monotonic response functions (or at least monotonic ones) with globally interpretable results like those of traditional linear models, but often with a boost in predictive accuracy provided by machine learning algorithms.

Instead of solving the classic normal equation or using statistical tests for variable selection, penalized regression minimizes constrained objective functions to find the best set of regression parameters for a given data set.

Penalized regression has been applied widely across many research disciplines, but it is a great fit for business data with many columns, even data sets with more columns than rows, and for data sets with a lot of correlated variables.

L1/LASSO penalties drive unnecessary regression parameters to zero, selecting a small, representative subset of regression parameters for the regression model while avoiding potential multiple comparison problems that arise in forward, backward, and stepwise variable selection.

Quantile regression allows you to fit a traditional, interpretable, linear model to different percentiles of your training data, allowing you to find different sets of variables with different parameters for modeling different behaviors across a customer market or portfolio of accounts.

It’s quite possible that the lessened assumption burden, the ability to select variables without potentially problematic multiple statistical significance tests, the ability to incorporate important but correlated predictors, the ability to fit nonlinear phenomena, or the ability to fit different quantiles of the data's conditional distribution (and not just the mean of the conditional distribution) could lead to more accurate understanding of modeled phenomena.

Two of the main differences between machine learning algorithms and traditional linear models are that machine learning algorithms incorporate many implicit, high-degree variable interactions into their predictions and that machine learning algorithms create nonlinear, non-polynomial, non-monotonic, and even non-continuous response functions.

Building toward machine learning model benchmarks could lead to greater understanding if more data exploration or techniques such as GAMs, partial dependence plots, or multivariate adaptive regression splines lead to deeper understanding of interactions and nonlinear phenomena in a data set.

Building toward machine learning model benchmarks could lead to increased trust in models if additional data exploration or techniques such as GAMs, partial dependence plots, or multivariate adaptive regression splines create linear models that represent the phenomenon of interest in the data set more accurately.

Instead of using machine learning predictions directly for analytical decisions, traditional analytical lifecycle processes (such as data preparation and model deployment) can be augmented with machine learning techniques leading to potentially more accurate predictions from regulator-approved linear, monotonic models.

Understanding can be enhanced further if the process of adding nonlinear features to a linear model, using gated models, or forecasting model degradation leads to deeper knowledge of driving phenomena that create nonlinearity, trends, or changes in your data.

An analyst or data scientist could do experiments to determine the best weighting for the predictions of each model in a simple ensemble, and partial dependency plots could be used to ensure that the model inputs and predictions still behave monotonically with respect to one another.

To ensure interpretability is preserved, use the lowest possible number of individual constituent models, use simple, linear combinations of constituent models, and use partial dependence plots to check that linear or monotonic relationships have been preserved.

Figure 13 represents a process where carefully chosen and processed non-negative, monotonic independent variables are used in conjunction with a single hidden layer neural network training algorithm that is constrained to produce only positive parameters.

For tree-based models, monotonicity constraints are usually enforced by a uniform splitting strategy, where splits of a variable in one direction always increase the average value of the dependent variable in the resultant child node, and splits of the variable in the other direction always decrease the average value of the dependent variable in resultant child node.

They enable automatic generation of reason codes and for certain cases (i.e., single hidden-layer neural networks and single decision trees) important, high-degree variable interactions can also be automatically determined.

However, there is nothing to preclude fitting surrogate models to more local regions of a complex model's conditional distribution, such as clusters of input records and their corresponding predictions, or deciles of predictions and their corresponding input rows.

Surrogate models can increase trust when used in conjunction with sensitivity analysis to test that explanations remain stable and in line with human domain knowledge, and reasonable expectations when data is lightly and purposefully perturbed, when interesting scenarios are simulated, or as data changes over time.

It could certainly also be applied to business or customer data, for instance by explaining customers at every decile of predicted probabilities for default or churn, or by explaining representative customers from well-known market segments.

LIME can also enhance trust when used in conjunction with maximum activation analysis to see that a model treats obviously different records using different internal mechanisms and obviously similar records using similar internal mechanisms.

LIME can even be used as a type of sensitivity analysis to determine whether explanations remain stable and in line with human domain knowledge and expectations when data is intentionally and subtly perturbed, when pertinent scenarios are simulated, or as data changes over time.

In maximum activation analysis, examples are found or simulated that maximally activate certain neurons, layers, or filters in a neural network or certain trees in decision tree ensembles.

Maximum activation analysis elucidates internal mechanisms of complex models by determining the parts of the response function that specific observations or groups of similar observations excite to the highest degree, either by high-magnitude output from neurons or by low residual output from trees.

Maximum activation analysis enhances trust when a complex model handles obviously different records using different internal mechanisms and obviously similar records using similar internal mechanisms.

Output distributions, error measurements, plots, and interpretation techniques can be used to explore the way models behave in important scenarios, how they change over time, or if models remain stable when data is subtly and intentionally corrupted.

Sensitivity analysis is a global interpretation technique when global interpretation techniques are used, such as using a single, global surrogate model to ensure major interactions remain stable when data is lightly and purposely corrupted.

Sensitivity analysis is a local interpretation technique when local interpretation techniques are used, for instance using LIME to determine if the important variables in a credit allocation decision remain stable for a given customer segment under macroeconomic stress testing.

As illustrated in Figure 18, a simple heuristic rule for variable importance in a decision tree is related to the depth and frequency at which a variable is split on in a tree, where variables used higher in the tree and more frequently in the tree are more important.

For a random forest, variable importance is also calculated as it is for a single tree and aggregated, but an additional measure of variable importance is provided by the change in out-of-bag accuracy caused by shuffling the independent variable of interest, where larger decreases in accuracy are taken as larger indications of importance.

(Shuffling is seen as zeroing-out the effect of the given independent variable in the model, because other variables are not shuffled.) For neural networks, variable importance measures are typically associated with the aggregated, absolute magnitude of model parameters for a given variable of interest.

Global variable importance techniques are typically model specific, and practitioners should be aware that unsophisticated measures of variable importance can be biased toward larger scale variables or variables with a high number of categories.

LOCO also creates global variable importance measures by estimating the mean change in accuracy for each variable over an entire data set and can even provide confidence intervals for these global estimates of variable importance.

(Treeinterpreter simply outputs a list of the bias and individual contributions for a variable in a given model or the contributions the input variables in a single record make to a single prediction.) The eli5 package also has an implementation of treeinterpreter.

Treeinterpreter also enhances trust if displayed explanations remain stable when data is subtly and intentionally corrupted, and if explanations change in appropriate ways as data changes over time or when interesting scenarios are simulated.

I believe the article has put forward a useful ontology for understanding machine learning interpretability techniques moving into the future by categorizing them based on four criteria: their scope (local versus global), the complexity of response function they can explain, their application domain (model specific versus model agnostic), and how they can enhance trust and understanding.

If I had more time and could keep adding to the article indefinitely, I think two topics I would prioritize for readers today would be RuleFit and the multiple advances in making deep learning more interpretable, one example of many being learning deep k-Nearest Neighbor representations.

- On Tuesday, September 25, 2018
- By Read More

## Classification and regression

spark.ml logistic regression can be used to predict a binary outcome by using binomial logistic regression, or it can be used to predict a multiclass outcome by using multinomial logistic regression.

Multinomial logistic regression can be used for binary classification by setting the family param to “multinomial”.

When fitting LogisticRegressionModel without intercept on dataset with constant nonzero column, Spark MLlib outputs zero coefficients for constant nonzero columns.

For more background and more details about the implementation of binomial logistic regression, refer to the documentation of logistic regression in spark.mllib.

Example The following example shows how to train binomial and multinomial logistic regression models for binary classification with elastic net regularization.

coefficients and intercept methods on a logistic regression model trained with multinomial family are not supported.

The conditional probabilities of the outcome classes $k \in {1, 2, …, K}$ are modeled using the softmax function.

\[ P(Y=k|\mathbf{X}, \boldsymbol{\beta}_k, \beta_{0k}) = \frac{e^{\boldsymbol{\beta}_k \cdot \mathbf{X} + \beta_{0k}}}{\sum_{k'=0}^{K-1} e^{\boldsymbol{\beta}_{k'} \cdot \mathbf{X} + \beta_{0k'}}} \]

We minimize the weighted negative log-likelihood, using a multinomial response model, with elastic-net penalty to control for overfitting.

\[\min_{\beta, \beta_0} -\left[\sum_{i=1}^L w_i \cdot \log P(Y = y_i|\mathbf{x}_i)\right] + \lambda \left[\frac{1}{2}\left(1 - \alpha\right)||\boldsymbol{\beta}||_2^2 + \alpha ||\boldsymbol{\beta}||_1\right] \]

Example The following example shows how to train a multiclass logistic regression model with elastic net regularization.

Example The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set. We

these help index categories for the label and categorical features, adding metadata to the DataFrame which the Decision Tree algorithm can recognize.

Example The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set. We

these help index categories for the label and categorical features, adding metadata to the DataFrame which the tree-based algorithms can recognize.

Example The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set. We

these help index categories for the label and categorical features, adding metadata to the DataFrame which the tree-based algorithms can recognize.

All other nodes map inputs to outputs by a linear combination of the inputs with the node’s weights $\wv$ and bias $\bv$ and applying an activation function.

Example OneVsRest is an example of a machine learning reduction for performing multiclass classification given a base classifier that can perform binary classification efficiently.

For the base classifier it takes instances of Classifier and creates a binary classification problem for each of the k classes.

The classifier for class i is trained to predict whether the label is i or not, distinguishing class i from all other classes.

generalized linear models (GLMs) are specifications of linear models where the response variable $Y_i$ follows some distribution

The form of a natural exponential family distribution is given as: where $\theta$ is the parameter of interest and $\tau$ is a dispersion parameter.

In a GLM the response variable $Y_i$ is assumed to be drawn from a natural exponential family distribution: where the parameter of interest $\theta_i$ is related to the expected value of the response variable $\mu_i$ by Here, $A’(\theta_i)$ is defined by the form of the distribution selected.

use a feature transformer to index categorical features, adding metadata to the DataFrame which the tree-based algorithms can recognize.

In spark.ml, we implement the Accelerated failure time (AFT) model which is a parametric survival regression model for censored data.

It describes a model for the log of survival time, so it’s often called a log-linear model for survival analysis.

Given the values of the covariates $x^{‘}$, for random lifetime $t{i}$ of subjects i = 1, …, n, with possible right-censoring, the likelihood function under the AFT model is given as: \[ L(\beta,\sigma)=\prod_{i=1}^n[\frac{1}{\sigma}f_{0}(\frac{\log{t_{i}}-x^{'}\beta}{\sigma})]^{\delta_{i}}S_{0}(\frac{\log{t_{i}}-x^{'}\beta}{\sigma})^{1-\delta_{i}} \] Where

The Weibull distribution for lifetime corresponds to the extreme value distribution for the log of the lifetime, and the $S{0}(\epsilon)$ function is: \[

the task of finding a minimizer of a convex function $-\iota(\beta,\sigma)$ that depends on the coefficients vector $\beta$ and the log of scale parameter $\log\sigma$. The

implementation matches the result from R’s survival function survreg When fitting AFTSurvivalRegressionModel without intercept on dataset with constant nonzero column, Spark MLlib outputs zero coefficients for constant nonzero columns.

a finite set of real numbers $Y = {y_1, y_2, ..., y_n}$ representing observed responses and

a function that minimises \begin{equation} f(x) = \sum_{i=1}^n w_i (y_i - x_i)^2 \end{equation}

\left( \lambda \|\wv\|_1 \right) + (1-\alpha) \left( \frac{\lambda}{2}\|\wv\|_2^2 \right) , \alpha \in [0, 1], \lambda \geq 0 \] By

- On Wednesday, January 16, 2019

**Zeitgeist: Moving Forward (2011)**

For more information on the zeitgeist movie series, please visit: .