Feature Engineering¶
All of the examples so far assume that you have numerical data in a tidy, [n_samples, n_features]
format.
That may be the case when we load data from a SKLearn dataset, but it is not the case in the real world, where
data rarely comes in such a form.
Hence, one of the most important steps in using machine learning in practice is feature engineering: that is, taking whatever information you have about your problem and turning it into numbers that you can use to build your feature matrix.
Note
Feature engineering is a fancy term for "creating new variables from existing variables". For example, if we have a dataset with a column containing the date of birth of a person, we can create a new column containing the age of the person by subtracting the date of birth from the current date.
Note
Proper data cleaning is also an important part of feature engineering. Some common data cleaning tasks include:
- Missing values: Most machine learning algorithms cannot handle missing values. There are several ways to deal with missing values, including removing the observations with missing values, imputing the missing values with the mean or median, or using a machine learning algorithm that can handle missing values.
- Outliers: Outliers are observations that are far away from the rest of the observations. Outliers can have a significant effect on the model, so it is important to detect and (if it is justified to do so) remove them.
Transformers¶
Scikit learn provides classes known as transformers, which are estimators (i.e., they inherit from a
base class called BaseEstimator
) that can transform data instead of making predictions.
Note
Transformers are estimators, so they also have the fit
method. However, this method is used to
learn the parameters of the transformer (e.g., the mean of the data in a StandardScaler
), not to train a model.
They also don't have the predict
method, since they don't make predictions. Instead, they have the
transform
method, which is used to transform the data.
Transformers are typically used to preprocess the data before training the model, and hence they are an important part of feature engineering.
Note
We'll discuss some examples belows, but bear in mind that it is not necessary to use transformers or Scikit learn to preprocess the data. Many of these transformations can be done "by hand", for example in Pandas, before going into Scikit Learn.
Note
These transformers have nothing to do with the transformers in deep learning.
Standard scaler¶
The StandardScaler
is a transformer used to standardize the data before training a model.
It removes the mean of the data and scales it to unit variance, calculating the z-score of each sample in the dataset:
$$
z = \frac{x - u}{s}
$$
where u
is the mean of the training samples (or zero if with_mean=False
), and s
is the standard deviation of
the training samples or one if with_std=False
:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
Note
The fit_transform
method is a combination of the fit
and transform
methods. It first fits the transformer
to the data (which in this case, although confusing, means computing the mean and std), and then transforms the
data. This is equivalent to calling fit
and then transform
separately.
Ordinal encoder¶
The OrdinalEncoder
is a transformer used to encode categorical features that have a natural ordering as a numeric
array. For example, if we have a categorical feature with three possible values that have a meaningful order,
e.g. low
, medium
, and high
, the ordinal encoding would be:
low | medium | high |
---|---|---|
0 | 1 | 2 |
from sklearn.preprocessing import OrdinalEncoder
X_train = pd.DataFrame({'feature': ['low', 'medium', 'high']})
encoder = OrdinalEncoder()
X_train_encoded = encoder.fit_transform(X_train)
print(X_train_encoded)
# Output:
[[0.]
[1.]
[2.]]
One-hot encoder¶
The OneHotEncoder
is a transformer used to encode categorical features that do not have a natural ordering
as a one-hot numeric array (i.e., a binary array with a single 1 and many 0s).
For example, if we have a categorical feature with three possible values, a
, b
, and c
, the one-hot encoding
would be:
a | b | c |
---|---|---|
1 | 0 | 0 |
0 | 1 | 0 |
0 | 0 | 1 |
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
X_train = pd.DataFrame({'feature': ['a', 'b', 'c']})
encoder = OneHotEncoder()
X_train_encoded = encoder.fit_transform(X_train)
print(X_train_encoded.toarray())
# Output:
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]
Tfidf vectorizer¶
The TfidfVectorizer
is a transformer used to convert text data into a numerical feature matrix.
It is a combination of the CountVectorizer
and TfidfTransformer
transformers.
from sklearn.feature_extraction.text import TfidfVectorizer
X_train = ['This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?']
vectorizer = TfidfVectorizer()
X_train_encoded = vectorizer.fit_transform(X_train)
Note
You can read more about Tf-idf here.
To see the vocabulary that was learned by the vectorizer, we can use the get_feature_names
method:
print(vectorizer.get_feature_names())
# Output:
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
Types of features¶
The following sections describe a few general tricks of feature engineering.
Categorical features¶
One common type of non-numerical data is categorical data. To use this data in a machine learning model, we need to convert this categorical feature to a numerical feature.
If the categorical feature has no natural ordering, we can use one-hot encoding to convert it to a numerical feature.
If the categorical feature has a natural ordering, we can use ordinal encoding to convert it to a regular number.
Text features¶
Another common type of non-numerical data is text data. To use this data in a machine learning model, we need to convert this text feature to a numerical feature. This can be done using several NLP techniques. Check bag-of-words, TF-IDF, and word embeddings for more information.
Date features¶
Another common type of non-numerical data is date data. To use this data in a machine learning model, we need to convert this date feature to a numerical feature. This can be done by extracting the year, month, day, etc. from the date and using them as numerical features.
Date features are very important in time series, which naturally has seasonality and trends. For this, sometimes it is a good idea to use extra dummy variables to be able to express all dates that are close to each other (e.g., the day of the week, the number of the week, etc.).
Note
An example: we can use the week numbers as features, but with this ordinal encoding we can't express that week 52 and week 1 are actually close to one another. However, we could also add a new feature that is the week number plus 26, and then model might be able to understand that weeks "live" in a circle instead of in a line.
Image features¶
Another common type of data are images. Although they might look like non-numerical data, images are actually matrices of numbers. To use this data in a machine learning, the easiest way is to flatten the image and use the pixels as numerical features (often even deleting the color channels).
Then, we can "spaghettify" the image by flattening the image and converting it to a 1D vector.
Derived features¶
Sometimes, we can create new features from existing features. For example, if we have a dataset with a column containing the date of birth of a person, we can create a new column containing the age of the person by subtracting the date of birth from the current date.
For numerical data, we can often apply mathematical transformations to create new features that have a better behavior for the model, or that change the data distribution to one that is more suitable. This needs to be done on a case-by-case basis, and requires some domain knowledge.
Note
Example: imagine that we have a dataset with a column containing the price of a house. We can create a new column containing the logarithm of the price of the house. This (could) make the data distribution more suitable for the model by reducing the tail of the distribution.
Feature selection¶
In some cases, we might have too many features, which makes the model too complex and costly to maintain. In such cases we will need to select only the most important ones. This can be achieved in several ways:
- With dimensionality reduction techniques, such as PCA.
- With feature importance techniques, such as the SHAP values (and simpler alternatives).
Note
Rules of thumb to know if a feature is important: 1. Add noise to it and see if the model performance decreases. If it does, the feature is probably important. 2. Create new features with values drawn from a random distribution. Then, train a model and scrap all features that perform similarly to the random noise features.