AI News, About Feature Scaling: Z-Score Standardization and Min-Max Scaling¶

About Feature Scaling: Z-Score Standardization and Min-Max Scaling¶

The result of standardization (or Z-score normalization) is that the features will be rescaled so that they'll have the properties of a standard normal distribution with $\mu = 0$ and $\sigma = 1$ where $\mu$ is the mean (average) and $\sigma$ is the standard deviation from the mean;

with features being on different scales, certain weights may update faster than others since the feature values $x_j$ play a role in the weight updates $$\Delta w_j = - \eta \frac{\partial J}{\partial w_j} = \eta \sum_i (t^{(i)} - o^{(i)})x^{(i)}_{j},$$ so that $$w_j := w_j + \Delta w_j,$$ where

k-means clustering

k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining.

k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

These are usually similar to the expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both k-means and Gaussian mixture modeling.

however, k-means clustering tends to find clusters of comparable spatial extent, while the expectation-maximization mechanism allows clusters to have different shapes.

The algorithm has a loose relationship to the k-nearest neighbor classifier, a popular machine learning technique for classification that is often confused with k-means due to the k in the name.

One can apply the 1-nearest neighbor classifier on the cluster centers obtained by k-means to classify new data into the existing clusters.

Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k (≤ n) sets S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of squares (WCSS) (i.e.

x

∈

S

i

‖

x

−

μ

i

‖

2

x

≠

y

∈

S

i

i

i

{\displaystyle \sum _{\mathbf {x} \in S_{i}}\left\|\mathbf {x} -{\boldsymbol {\mu }}_{i}\right\|^{2}=\sum _{\mathbf {x} \neq \mathbf {y} \in S_{i}}(\mathbf {x} -{\boldsymbol {\mu }}_{i})({\boldsymbol {\mu }}_{i}-\mathbf {y} )}

Because the total variance is constant, this is also equivalent to maximizing the sum of squared deviations between points in different clusters (between-cluster sum of squares, BCSS),[1]

The Random Partition method first randomly assigns a cluster to each observation and then proceeds to the update step, thus computing the initial mean to be the centroid of the cluster's randomly assigned points.

However, in the worst case, k-means can be very slow to converge: in particular it has been shown that there exist certain point sets, even in 2 dimensions, on which k-means takes exponential time, that is 2Ω(n), to converge.[12]

These point sets do not seem to arise in practice: this is corroborated by the fact that the smoothed running time of k-means is polynomial.[13]

The 'assignment' step is also referred to as expectation step, the 'update step' as maximization step, making this algorithm a variant of the generalized expectation-maximization algorithm.

Regarding computational complexity, finding the optimal solution to the k-means clustering problem for observations in d dimensions is:

On data that does have a clustering structure, the number of iterations until convergence is often small, and results only improve slightly after the first dozen iterations.

Lloyd's algorithm is therefore often considered to be of 'linear' complexity in practice, although it is in the worst case superpolynomial when performed until convergence.[21]

Lloyd's algorithm is the standard approach for this problem, However, it spends a lot of processing time computing the distances between each of the k cluster centers and the n data points.

Since points usually stay in the same clusters after a few iterations, much of this work is unnecessary, making the naive implementation very inefficient.

j

j

x

S

j

∣

S

n

∣

∣

S

n

∣

−

1

∣

S

m

∣

∣

S

m

∣

+

1

{\displaystyle \Delta (x,n,m)={\frac {\mid S_{n}\mid }{\mid S_{n}\mid -1}}\cdot \lVert \mu _{n}-x\rVert ^{2}-{\frac {\mid S_{m}\mid }{\mid S_{m}\mid +1}}\cdot \lVert \mu _{m}-x\rVert ^{2}}

k-means clustering is rather easy to implement and apply even on large data sets, particularly when using heuristics such as Lloyd's algorithm .

For example, in computer graphics, color quantization is the task of reducing the color palette of an image to a fixed number of colors k.

Other uses of vector quantization include non-random sampling, as k-means can easily be used to choose k different but prototypical objects from a large data set for further analysis.

Then, to project any input datum into the new feature space, we have a choice of 'encoding' functions, but we can use for example the thresholded matrix-product of the datum with the centroid locations, the distance from the datum to each centroid, or simply an indicator function for the nearest centroid,[40][41]

On an object recognition task, it was found to exhibit comparable performance with more sophisticated feature learning approaches such as autoencoders and restricted Boltzmann machines.[42] However,

k-means clustering, and its associated expectation-maximization algorithm, is a special case of a Gaussian mixture model, specifically, the limit of taking all covariances as diagonal, equal, and small.

Another generalization of the k-means algorithm is the K-SVD algorithm, which estimates data points as a sparse linear combination of 'codebook vectors'.

that the relaxed solution of k-means clustering, specified by the cluster indicators, is given by principal component analysis (PCA), and the PCA subspace spanned by the principal directions is identical to the cluster centroid subspace.

By contrast, k-means restricts this updated set to k points usually much less than the number of points in the input data set, and replaces each point in this set by the mean of all points in the input set that are closer to that point than any other (e.g.

A mean shift algorithm that is similar then to k-means, called likelihood mean shift, replaces the set of points undergoing replacement by the mean of all points in the input set that are within a given distance of the changing set.[51]

One of the advantages of mean shift over k-means is that there is no need to choose the number of clusters, because mean shift is likely to find only a few clusters if indeed only a small number exist.

It has been shown in [52]that under sparsity assumptions and when input data is pre-processed with the whitening transformation k-means produces the solution to the linear independent component analysis (ICA) task.

However, the bilateral filter restricts the calculation of the (kernel weighted) mean to include only points that are close in the ordering of the input data.[51]

The set of squared error minimizing cluster functions also includes the k-medoids algorithm, an approach which forces the center point of each cluster to be one of the actual points, i.e., it uses medoids in place of centroids.

Different implementations of the same algorithm were found to exhibit enormous performance differences, with the fastest on a test data set finishing in 10 seconds, the slowest taking 25988 seconds (approximately 7.2 hours).[1]

The differences can be attributed to implementation quality, language and compiler differences, different termination criteria and precision levels, and the use of indexes for acceleration.

Supervised learning

Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs.[1]

In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal).

A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples.

An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances.

This requires the learning algorithm to generalize from the training data to unseen situations in a 'reasonable' way (see inductive bias).

There is no single learning algorithm that works best on all supervised learning problems (see the No free lunch theorem).

The prediction error of a learned classifier is related to the sum of the bias and the variance of the learning algorithm.[4]

But if the learning algorithm is too flexible, it will fit each training data set differently, and hence have high variance.

A key aspect of many supervised learning methods is that they are able to adjust this tradeoff between bias and variance (either automatically or by providing a bias/variance parameter that the user can adjust).

The second issue is the amount of training data available relative to the complexity of the 'true' function (classifier or regression function).

If the true function is simple, then an 'inflexible' learning algorithm with high bias and low variance will be able to learn it from a small amount of data.

But if the true function is highly complex (e.g., because it involves complex interactions among many different input features and behaves differently in different parts of the input space), then the function will only be learnable from a very large amount of training data and using a 'flexible' learning algorithm with low bias and high variance.

If the input feature vectors have very high dimension, the learning problem can be difficult even if the true function only depends on a small number of those features.

Hence, high input dimensionality typically requires tuning the classifier to have low variance and high bias.

In practice, if the engineer can manually remove irrelevant features from the input data, this is likely to improve the accuracy of the learned function.

In addition, there are many algorithms for feature selection that seek to identify the relevant features and discard the irrelevant ones.

This is an instance of the more general strategy of dimensionality reduction, which seeks to map the input data into a lower-dimensional space prior to running the supervised learning algorithm.

fourth issue is the degree of noise in the desired output values (the supervisory target variables).

If the desired output values are often incorrect (because of human error or sensor errors), then the learning algorithm should not attempt to find a function that exactly matches the training examples.

You can overfit even when there are no measurement errors (stochastic noise) if the function you are trying to learn is too complex for your learning model.

In such a situation, the part of the target function that cannot be modeled 'corrupts' your training data - this phenomenon has been called deterministic noise.

In practice, there are several approaches to alleviate noise in the output values such as early stopping to prevent overfitting as well as detecting and removing the noisy training examples prior to training the supervised learning algorithm.

There are several algorithms that identify noisy training examples and removing the suspected noisy training examples prior to training has decreased generalization error with statistical significance.[5][6]

When considering a new application, the engineer can compare multiple learning algorithms and experimentally determine which one works best on the problem at hand (see cross validation).

Given fixed resources, it is often better to spend more time collecting additional training data and more informative features than it is to spend extra time tuning the learning algorithms.

x

1

y

1

x

N

y

N

x

i

y

i

R

arg

⁡

max

y

|

For example, naive Bayes and linear discriminant analysis are joint probability models, whereas logistic regression is a conditional probability model.

In both cases, it is assumed that the training set consists of a sample of independent and identically distributed pairs,

x

i

y

i

In order to measure how well a function fits the training data, a loss function

R

≥

0

x

i

y

i

y

^

y

i

y

^

Hence, a supervised learning algorithm can be constructed by applying an optimization algorithm to find

|

y

^

|

contains many candidate functions or the training set is not sufficiently large, empirical risk minimization leads to high variance and poor generalization.

The regularization penalty can be viewed as implementing a form of Occam's razor that prefers simpler functions over more complex ones.

∑

j

j

2

2

1

∑

j

|

j

|

0

j

The training methods described above are discriminative training methods, because they seek to find a function

i

i

i

About Feature Scaling and Normalization

The result of standardization (or Z-score normalization) is that the features will be rescaled so that they’ll have the properties of a standard normal distribution with \mu = 0 and \sigma = 1 where \mu is the mean (average) and \sigma is the standard deviation from the mean;

standard scores (also called z scores) of the samples are calculated as follows: Standardizing the features so that they are centered around 0 with a standard deviation of 1 is not only important if we are comparing measurements that have different units, but it is also a general requirement for many machine learning algorithms.

with features being on different scales, certain weights may update faster than others since the feature values x_j play a role in the weight updates so that w_j := w_j + \Delta w_j, where

Without going into much depth regarding information gain and impurity measures, we can think of the decision as “is feature x_i >= some_val?” Intuitively, we can see that it really doesn’t matter on which scale this feature is (centimeters, Fahrenheit, a standardized scale – it really doesn’t matter).

For example, if we initialize the weights of a small multi-layer perceptron with tanh activation units to 0 or small random values centered around zero, we want to update the model weights “equally.” As a rule of thumb I’d say: When in doubt, just standardize the data, it shouldn’t hurt.

Another prominent example is the Principal Component Analysis, where we usually prefer standardization over Min-Max scaling, since we are interested in the components that maximize the variance (depending on the question and if the PCA computes the components via the correlation matrix instead of the covariance matrix;

In the following section, we will go through the following steps: In this step, we will randomly divide the wine dataset into a training dataset and a test dataset where the training dataset will contain 70% of the samples and the test dataset will contain 30%, respectively.

Let us think about whether it matters or not if the variables are centered for applications such as Principal Component Analysis (PCA) if the PCA is calculated from the covariance matrix (i.e., the k principal components are the eigenvectors of the covariance matrix that correspond to the k largest eigenvalues.

Let’s assume we have the 2 variables \bf{x} and \bf{y} Then the covariance between the attributes is calculated as Let us write the centered variables as The centered covariance would then be calculated as follows: But since after centering, \bar{x}' = 0 and \bar{y}' = 0 we have \sigma_{xy}' = \frac{1}{n-1} \sum_{i}^{n} x_i' y_i' which is our original covariance matrix if we resubstitute back the terms x'

Let c be the scaling factor for \bf{x} Given that the “original” covariance is calculated as the covariance after scaling would be calculated as: \sigma_{xy}' = \frac{1}{n-1} \sum_{i}^{n} (c \cdot x_i - c \cdot \bar{x})(y_i - \bar{y}) =

\sigma_{xy}' = c \cdot \sigma_{xy} Therefore, the covariance after scaling one attribute by the constant c will result in a rescaled covariance c \sigma_{xy} So if we’d scaled \bf{x} from pounds to kilograms, the covariance between \bf{x} and \bf{y} will be 0.453592 times smaller.

Support Vector Machine (SVM) - Fun and Easy Machine Learning

Support Vector Machine (SVM) - Fun and Easy Machine Learning

Concepts of Algorithm, Flow Chart & C Programming

Concepts of Algorithm, Flow Chart & C Programming by Prof. Wongmulin | Dept. of Computer Science Garden City College-Bangalore.

The Best Way to Prepare a Dataset Easily

In this video, I go over the 3 steps you need to prepare a dataset to be fed into a machine learning model. (selecting the data, processing it, and transforming it).

12. Clustering

MIT 6.0002 Introduction to Computational Thinking and Data Science, Fall 2016 View the complete course: Instructor: John Guttag ..

Data Structures and Algorithms Complete Tutorial Computer Education for All

Computer Education for all provides complete lectures series on Data Structure and Applications which covers Introduction to Data Structure and its Types ...

Feature Scaling Formula Solution - Intro to Machine Learning

This video is part of an online course, Intro to Machine Learning. Check out the course here: This course was designed ..

TensorFlow in 5 Minutes (tutorial)

This video is all about building a handwritten digit image classifier in Python in under 40 lines of code (not including spaces and comments). We'll use the ...

Dimensionality Reduction - The Math of Intelligence #5

Most of the datasets you'll find will have more than 3 dimensions. How are you supposed to understand visualize n-dimensional data? Enter dimensionality ...

How MLE (Maximum Likelihood Estimation) algorithm works

In this video I show how the MLE algorithm works. We provide an animation where several points are classified considering three classes with mean and ...

How a Math Algorithm Could Educate the Whole World — for Free | Po-Shen Loh

Po-Shen Loh is a Princeton-educated mathematician, Carnegie Mellon professor, the head coach of the U.S. International Math Olympiad team, and now he's ...