AI News, Open Machine Learning Course. Topic 6. Feature Engineering and Feature Selection

Open Machine Learning Course. Topic 6. Feature Engineering and Feature Selection

In this course, we have already seen several key machine learning algorithms.

Any experienced professional can recall numerous times when a simple model trained on high-quality data was proven to be better than a complicated multi-model ensemble built on data that wasn’t clean.

To start, I wanted to review three similar but different tasks: This article will contain almost no math, but there will be a fair amount of code.

There are ready-to-use tokenizers that take into account peculiarities of the language, but they make mistakes as well, especially when you work with specific sources of text (newspapers, slang, misspellings, typos).

The easiest approach is called Bag of Words: we create a vector with the length of the dictionary, compute the number of occurrences of each word in the text, and place that number of occurrences in the appropriate position in the vector.

In practice, you need to consider stop words, the maximum length of the dictionary, more efficient data structures (usually text data is converted to a sparse vector), etc.

This approach is called TF-IDF (term frequency-inverse document frequency), which cannot be written in a few lines, so you should look into the details in references such as this wiki.

Using Word2Vec and similar models, we can not only vectorize words in a high-dimensional space (typically a few hundred dimensions) but also compare their semantic similarity.

This is a classic example of operations that can be performed on vectorized concepts: king - man + woman = queen.

It is worth noting that this model does not comprehend the meaning of the words but simply tries to position the vectors such that words used in common context are close to each other.

the last fully connected layers of the network, adding new layers chosen for a specific task, and then training the network on new data.

If your task is to just vectorize the image (for example, to use some non-network classifier), you only need to remove the last layers and use the output from the previous layers: Here's a classifier trained on one dataset and adapted for a different one by "detaching"

Features generated by hand are still very useful: for example, for predicting the popularity of a rental listing, we can assume that bright apartments attract more attention and create a feature such as "the average value of the pixel".

For images, EXIF stores many useful meta-information: manufacturer and camera model, resolution, use of the flash, geographic coordinates of shooting, software used to process image and more.

Geographic data is not so often found in problems, but it is still useful to master the basic techniques for working with it, especially since there are quite a number of ready-to-use solutions in this field.

If you have a small amount of data, enough time, and no desire to extract fancy features, you can use reverse_geocoder in lieu of OpenStreetMap: When working with geoсoding, we must not forget that addresses may contain typos, which makes the data cleaning step necessary.

Coordinates contain fewer misprints, but its position can be incorrect due to GPS noise or bad accuracy in places like tunnels, downtown areas, etc.

Here, you can really unleash your imagination and invent features based on your life experience and domain knowledge: the proximity of a point to the subway, the number of stories in the building, the distance to the nearest store, the number of ATMs around, etc.

In that case, distances (great circle distance and road distance calculated by the routing graph), number of turns with the ratio of left to right turns, number of traffic lights, junctions, and bridges will be useful.

In general, when working with time series data, it is a good idea to have a calendar with public holidays, abnormal weather conditions, and other important events.

At the same time, if you encode them as categorical variables, you'll breed a large numbers of features and lose information about proximity -- the difference between 22 and 23 will be the same as the difference between 22 and 7.

This transformation preserves the distance between points, which is important for algorithms that estimate distance (kNN, SVM, k-means ...) However, the difference between such coding methods is down to the third decimal place in the metric.

Regarding time series — we will not go into too much detail here (mostly due to my personal lack of experience), but I will point you to a useful library that automatically generates features for time series.

By the way, the data from the IP-address is well combined with http_accept_language: if the user is sitting at the Chilean proxies and browser locale is ru_RU, something is unclean and worth a look in the corresponding column in the table (is_traveler_or_proxy_user).

simple example: suppose that the task is to predict the cost of an apartment from two variables — the distance from city center and the number of rooms.

But, to some extent, it protects against outliers: Another fairly popular option is MinMax Scaling, which brings all the points within a predetermined interval (typically (0, 1)).

If we assume that some data is not normally distributed but is described by the log-normal distribution, it can easily be transformed to a normal distribution: The lognormal distribution is suitable for describing salaries, price of securities, urban population, number of comments on articles on the internet, etc.

If there are a limited number of features, it is possible to generate all the possible interactions and then weed out the unnecessary ones using the techniques described in the next section.

Approaches to handling missing values are pretty straightforward: Easy-to-use library solutions sometimes suggest sticking to something like df = df.fillna(0) and not sweat the gaps.

But this is not the best solution: data preparation takes more time than building models, so thoughtless gap-filling may hide a bug in processing and damage the model.

As long as we work with toy datasets, the size of the data is not a problem, but, for real loaded production systems, hundreds of extra features will be quite tangible.

Two types of models are usually used: some “wooden” composition such as Random Forest or a linear model with Lasso regularization so that it is prone to nullify weights of weak features.

Train a model on a subset of features, store results, repeat for different subsets, and compare the quality of models to identify the best feature set.

Fix a small number N, iterate through all combinations of N features, choose the best combination, and then iterate through the combinations of (N + 1) features so that the previous best combination of features is fixed and only a single new feature is considered.

This algorithm can be reversed: start with the complete feature space and remove features one by one until it does not impair the quality of the model or until the desired number of features is reached.

node2vec: Scalable Feature Learning for Networks

Author: Aditya Grover, Department of Computer Science, Stanford University Abstract: Prediction tasks over nodes and edges in networks require careful effort in ...

How to Make a Text Summarizer - Intro to Deep Learning #10

I'll show you how you can turn an article into a one-sentence summary in Python with the Keras machine learning library. We'll go over word embeddings, ...

Predicting the Winning Team with Machine Learning

Can we predict the outcome of a football game given a dataset of past games? That's the question that we'll answer in this episode by using the scikit-learn ...

Linear Programming

Thanks to all of you who support me on Patreon. You da real mvps! $1 per month helps!! :) !! **DOH! There is a STUPID ..

A Guide to Speech Recognition Algorithms (Part 1)

Feature Extraction Methods: Perceptual Linear Prediction (PLP) Relative spectra filtering of log domain coefficients PLP (RASTA-PLP) Linear predictive coding ...

Introduction to Text Analytics with R: N-grams

This data science tutorial introduces the viewer to the exciting world of text analytics with R programming. As exemplified by the popularity of blogging and social ...

How to Make a Simple Tensorflow Speech Recognizer

In this video, we'll make a super simple speech recognizer in 20 lines of Python using the Tensorflow machine learning library. I go over the history of speech ...

Understanding Feature Space in Machine Learning

Featured Talk ➟ Data Science Pop-up Seattle Presented by Alice Zheng - Director of Data Science at Dato Machine learning derives mathematical models from ...

Machine Learning - Supervised VS Unsupervised Learning

Enroll in the course for free at: Machine Learning can be an incredibly beneficial tool to ..

Text Analytics - Ep. 25 (Deep Learning SIMPLIFIED)

Unstructured textual data is ubiquitous, but standard Natural Language Processing (NLP) techniques are often insufficient tools to properly analyze this data.