AI News, QA – Blackbox Testing for Machine Learning Models

QA – Blackbox Testing for Machine Learning Models

Simply speaking, machine learning (models) represents a class of software which learns from a given set of data, and then, make predictions on the new data set based on its learning.  In other words, the machine learning models are trained with an existing data set in order to make the prediction on a new data set.

The following represent different classes of machine learning algorithms: The following diagram represents the supervised and unsupervised learning aspects of machine learning: Blackbox testing is testing the functionality of an application without knowing the details of its implementation including internal program structure, data structures etc.

The following diagram represents the blackbox testing: When applied to machine learning models, blackbox testing would mean testing machine learning models without knowing the internal details such as features of the machine learning model, the algorithm used to create the model etc.

In conventional software development, a frequently invoked assumption is the presence of a test oracle which is nothing but testers/test engineers (human) or some form of testing mechanisms including the testing program which could verify the output of the computer program against the expected value which is known beforehand.

Given the outcome of machine learning models is a prediction, it is not easy to compare or verify the prediction against some kind of expected value which is not known beforehand.  That said, during the development (model building) phase, data scientists test the model performance by comparing the model outputs (predicted values) with the actual values.  This is not same as testing the model for any input where the expected value is not known.

Testing model performance is about testing the models with the test data/new data sets and comparing the model performance in terms of parameters such as accuracy/recall etc., to that of pre-determined accuracy with the model already built and moved into production.

Based on the detailed analysis, it is derived that given the person is a smoker and a male, the likelihood of the person suffering from the disease increases by 5% with an increase in his age by 3 years.  This could be used to perform metamorphic testing as the property, age, represents the metamorphic relationship between inputs and outputs.

The following diagram represents metamorphic testing: In metamorphic testing, the test cases that result in success lead to another set of test cases which could be used for further testing of machine learning models. The following represents a sample test plan: Test cases such as above can be executed until all results in success or failure at any step.

Given the testing techniques mentioned earlier, the following are some of the techniques which could be required to have for test engineers or QA professional to play a vital role in performing quality control checks on AI or machine learning models: (adsbygoogle = window.adsbygoogle ||

QA: Blackbox Testing for Machine Learning Models

A Data Science/Machine Learning career has primarily been associated with building models that could do numerical or class-related predictions.

Simply speaking, Machine Learning (models) represents a class of software that learns from a given set of data and then makes predictions on the new data set based on its learning.

For example, a Machine Learning model could be trained with patients' data (diagnostic report) suffering from a cardiac disease in order to predict whether a patient is suffering from the cardiac disease given his diagnostic data is fed into the model.

Blackbox testing When applied to Machine Learning models, blackbox testing would mean testing Machine Learning models without knowing the internal details such as features of the Machine Learning model, the algorithm used to create the model etc.

In conventional software development, a frequently invoked assumption is the presence of a test oracle which is nothing but testers/test engineers (human) or some form of testing mechanisms including the testing program which could verify the output of the computer program against the expected value which is known beforehand.

The following represents some of the techniques which could be used to perform blackbox testing on Machine Learning models: Testing model performance is about testing the models with the test data/new data sets and comparing the model performance in terms of parameters such as accuracy/recall etc., to that of pre-determined accuracy with the model already built and moved into production.

For example, hypothetically speaking, an ML model is built that predicts the likelihood of a person suffering from a particular disease based on different predictor variables such as age, smoking habit, gender, exercise habits, etc.

Based on the detailed analysis, it is derived that given the person is a smoker and a male, the likelihood of the person suffering from the disease increases by 5% with an increase in his age by 3 years.

Fig 3 – Metamorphic testing of machine learning models In metamorphic testing, the test cases that result in success lead to another set of test cases which could be used for further testing of Machine Learning models.

For inputs where the majority of remaining models other than random forest gives a prediction which does not match with that of the model built with random forest, a bug/defect could be raised in the defect tracking system.

Given the testing techniques mentioned earlier, the following are some of the techniques that could be required to have for test engineers or QA professional to play a vital role in performing quality control checks on AI or Machine Learning models: If you are a QA professional or test engineer working in QA organization/department in your company, you could explore the career in the field of data science/Machine Learning for testing / performing quality control checks on Machine Learning models from QA perspective.

Explaining Black-Box Machine Learning Models - Code Part 1: tabular data + caret + iml

The first metric to look at for Random Forest models (and many other algorithms) is feature importance: “Variable importance evaluation functions can be separated into two groups: those that use the model information and those that do not.

The advantage of using a model-based approach is that is more closely tied to the model performance and that it may be able to incorporate the correlation structure between the predictors into the importance calculation.” https://topepo.github.io/caret/variable-importance.html The varImp() function from the caret package can be used to calculate feature importance measures for most methods.

For Random Forest classification models such as ours, the prediction error rate is calculated for These two measures are averaged and normalized as described here: “Here are the definitions of the variable importance measures.

For multi–class outcomes, the problem is decomposed into all pair-wise problems and the area under the curve is calculated for each class pair (i.e class 1 vs. class 2, class 2 vs. class 3 etc.).

Why are Machine Learning models called black boxes?

While I agree on ncasas answer in most points (+1), I beg to differ on some: Explaining the prediction of a black-box model by fancy occlusion analysis (from 'Why should I trust you?'):

How people use it: Because they do not model the problem in a way which allows humans to directly say what happens for any given input.

How the term 'black-box model' should be used: An approach which makes more sense to me is to call the problem a 'black box problem', similar to what user144410 (+1) writes.

Hence any model which only treats the problem as a black box - hence something you can put input in and get output out - is a black box model.

The “black box” metaphor in machine learning

It has become quite common these days to hear people refer to modern machine learning systems as “black boxes”.

Harris asks: “So, if I’m not mistaken, most, if not all of these deep learning approaches, or even more generally machine learning approaches are, essentially black boxes, in which you can’t really inspect how the algorithm is accomplishing what it is accomplishing.” Although this metaphor is appropriate for some particular situations, it is actually quite misleading in general, and may be causing a considerable amount of confusion.

When people reach for the black box metaphor, what they seem to be expressing is the fact that it is difficult to make sense of the purpose of the various components in a machine learning model.

Along the way, I’ll try to explain the difference between models and how they are trained, discuss scenarios in which the black box metaphor is appropriate, and suggest that in many ways, humans are the real black boxes, at least as far as machine learning is concerned.

The black box metaphor dates back to the early days of cybernetics and behaviourism, and typically refers to a system for which we can only observe the inputs and outputs, but not the internal workings.

Although he successfully demonstrated how certain learned behaviours could be explained by a reinforcement signal which linked certain inputs to certain outputs, he then famously made the mistake of thinking that this theory could easily explain all of human behaviour, including language.

As a simpler example of a black box, consider a thought experiment from Skinner: you are given a box with a set of inputs (switches and buttons) and a set of outputs (lights which are either on or off).

Even if we know the purpose of the overall system, however, there is not necessarily a simple explanation we can offer as to how the system works, other than the fact that each individual component operates according to its own rules, in response to the input.

Harris refers to “how the algorithm is accomplishing what its accomplishing”, but there are really two parts here: a model — such a deep learning system — and a learning algorithm — which we use to fit the model to data.

The model states that the output (the force of gravity between two objects) is determined by three input values: the mass of the first object, the mass of the second object, and the distance between them.

Even more impressively, the secondary predictions of relativity have been astounding, successfully predicting, for example, the existence of black holes before we could ever hope to test for their existence.

In addition to being an incredibly successful rebranding of neural networks and machine learning (itself arguably a rather successful rebranding of statistics), the term deep learning refers to a particular type of model, one in which the outputs are the results of a series of many simple transformations applied to the inputs (much like our wiring diagram from above).

If I give you a simple set of rules to follow in order to make a prediction, as long as there aren’t too many rules and the rules themselves are simple, you could pretty easily figure out the full set of input-to-output mappings in your mind.

This is also true, though to a lesser extent, with a class of models known as linear models, where the effect of changing any one input can be interpreted without knowing about the value of other inputs.

Deep learning models, by contrast, typically involve non-linearities and interactions between inputs, which means that not only is there no simple mapping from input to outputs, the effect of changing one input may dependent critically on the values of other inputs.

In the example of gravity, once we have assumed a “good enough” model (proportional to mass and inversely proportional to distance squared), we just need to resolve the value of one parameter (G), by fitting the model to observations.

In practice, nearly all of these deep learning models are trained using some variant of an algorithm called stochastic gradient descent (SGD), which takes random samples from the training data, and gradually adjusts all parameters to make the predicted output more like what we want.

Although it is somewhat unsatisfying, the complete answer to why a machine learning system did something ultimately lies in the combination of the assumptions we made in designing model, the data it was trained on, and various decisions made about how to learn the parameters, including the randomness in the initialization.

If asked to explain how we are able to recognize objects, by contrast, we might think we can provide some sort of explanation (something involving edges and colours), but in reality, this process operates well below the level of consciousness.

Although there are special circumstances in which we can actual inspect the inner workings human or other mammalian systems, such as neuroscience experiments, in general, we are trying to use machine learning to mimc human behaviour using only the inputs and the outputs.

In the popular imagination, the expectation seems to be that the car must have evaluated possible outcomes, assigned them probabilities, and chose the one with the best chance of maximizing some better outcome, where better is determined according to some sort of morality that has been programmed into it.

Rather, if we ask the car why it did what it did, the answer will be that it applied a transparent and deterministic computation using the values of its parameters, given its current input, and this determined its actions.

If we ask a human driver why they went off the road, they will likely be capable of responding in language, and giving some account of themselves — that they were drunk, or distracted, or had to swerve, or were blinded by the weather — and yet aside from providing some sort of narrative coherence, we don’t really know why they did it, and neither do they.

Introducing RapidMiner Auto Model

Unlike existing automated machine learning approaches, Auto Model is not a “black box” that prevents data scientists from understanding how the model works.

Friends Don’t Let Friends Deploy Black-Box Models

Rich Caruana, Microsoft Research — In machine learning often a trade-off must be made between accuracy and intelligibility: the most accurate models usually ...

Membership Inference Attacks against Machine Learning Models

Membership Inference Attacks against Machine Learning Models Reza Shokri (Cornell Tech) Presented at the 2017 IEEE Symposium on Security & Privacy May ...

State Transition Testing - test design techniques tutorial

In this test design video, I explain what is state, what is transition, valid transitions, invalid transitions with example state graph or state chart. Click Cc button to ...

Explaining Black-Box Machine Learning Predictions - Sameer Singh, UCI

Explaining Black-Box Machine Learning Predictions The Cove at UCI Applied Innovation Sameer Singh, Ph.D. Asst. Prof. of Computer ..

NIPS 2015 Workshop (Salakhutdinov) 15638 Black box learning and inference

Probabilistic models have traditionally co-evolved with tailored algorithms for efficient learning and inference. One of the exciting developments of recent years ...

Symbolic Execution and Model Checking for Testing

Google Tech Talks November, 16 2007 This talk describes techniques that use model checking and symbolic execution for test input generation. Abstract state ...

Lecture 16 | Adversarial Examples and Adversarial Training

In Lecture 16, guest lecturer Ian Goodfellow discusses adversarial examples in deep learning. We discuss why deep networks and other machine learning ...

Interacting with Predictions: Visual Inspection of Black-box Machine Learning Models

Author: Adam Perer, IBM Thomas J. Watson Research Center Abstract: Understanding predictive models, in terms of interpreting and identifying actionable ...

Neural Network Model - Deep Learning with Neural Networks and TensorFlow

Welcome to part three of Deep Learning with Neural Networks and TensorFlow, and part 45 of the Machine Learning tutorial series. In this tutorial, we're going to ...