βš–οΈ Data Scaling Methods#

πŸ‘¨β€πŸ« Vikesh K
πŸ““ Lab-06

πŸ’‘ β€œThe secret to getting ahead is getting started” πŸ’‘

πŸ“Lab Agenda#

  • Importance of Scaling

  • How to do Scaling and types

  • Scaling comparision

Importance of Scaling#

Scaling is an important pre-processing step for machine learning because it helps to normalize the input data so that it can be more easily compared and processed by the learning algorithm. Here are some reasons why scaling is important:

  • Different scales: Features in the input data may have different scales, i.e., some features may have larger numerical values than others. This can cause the learning algorithm to place undue importance on features with larger numerical values, even if they are not actually more important for the problem at hand. Scaling the features to a common scale can help prevent this.

  • Convergence: Many machine learning algorithms use some form of gradient descent to optimize the model parameters. If the input features are not scaled, the optimization process may take longer to converge or may not converge at all. This is because features with large numerical values may cause the gradients to be too large or too small, making it difficult to find the optimal parameter values.

  • Distance-based algorithms: Some machine learning algorithms, such as k-nearest neighbors and clustering, are based on distances between data points. If the input features are not scaled, the distance calculations may be dominated by features with larger numerical values, leading to biased results.

  • Regularization: Regularization is a technique used to prevent overfitting in machine learning models. Some regularization techniques, such as L1 regularization, are based on the absolute values of the model parameters. If the input features are not scaled, the regularization may be biased towards features with larger numerical values.

Overall, scaling is important for machine learning because it can help improve the performance and stability of the learning algorithm, reduce the training time, and ensure that the model is making meaningful comparisons between the input features

Importing Modules and Data#

import pandas as pd
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
import warnings
warnings.filterwarnings("ignore")

Note

We will use sample columns of Titanic data to further understand the scaling and how it impacts the data

df = pd.read_csv('https://raw.githubusercontent.com/vkoul/data/main/misc/titanic.csv')
df.head(2)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C

How to do Scaling#

Note

For illustration, we will focus on Age and Fare columns for this task. Idea is to compare the distributions before and after the scaling

df[['Age', "Fare"]].plot(kind = 'kde', subplots = True, title = "The x-axis range for Age and Fare are very different", figsize= (15,8));
_images/9b719390dc0d5ac1e21e7374c770972a9c99aca8e6b85c55b4f7a2dd7e9ecad0.png

The two common scaling techniques are Min-Max scaling and Standard Scaling. We will quickly use the example of Standard Scaling and see its impact on the variables

df2 = df.copy() # creating a copy, so that original dataset remains untouched
from sklearn.preprocessing import StandardScaler

# call scaler
ss_scaler = StandardScaler()

# fit the scaler
df2[['Age', 'Fare']] = ss_scaler.fit_transform(df2[['Age', 'Fare']])
df2[['Age', "Fare"]].describe()
Age Fare
count 7.140000e+02 8.910000e+02
mean 2.388379e-16 3.987333e-18
std 1.000701e+00 1.000562e+00
min -2.016979e+00 -6.484217e-01
25% -6.595416e-01 -4.891482e-01
50% -1.170488e-01 -3.573909e-01
75% 5.718310e-01 -2.424635e-02
max 3.465126e+00 9.667167e+00
df2[['Age', "Fare"]].plot(kind = 'kde', subplots = True, xlim = (-6,10), title = "Distribution of columns post Standard Scaling", figsize=(15,8));
_images/d8140d1e82609d4cf6156481564df0bbdeeb0e10bf76e121e7e37a6214c22d41.png
df3 = df.copy()

Min-max Scaling

from sklearn.preprocessing import MinMaxScaler

# call scaler
mm_scaler = MinMaxScaler()

# fit the scaler
df3[['Age', 'Fare']] = mm_scaler.fit_transform(df3[['Age', 'Fare']])
df3[['Age', "Fare"]].describe()
Age Fare
count 714.000000 891.000000
mean 0.367921 0.062858
std 0.182540 0.096995
min 0.000000 0.000000
25% 0.247612 0.015440
50% 0.346569 0.028213
75% 0.472229 0.060508
max 1.000000 1.000000
df3[['Age', "Fare"]].plot(kind = 'kde', subplots = True, title = "Distribution of columns post Min-Max Scaling", figsize= (15,8), xlim= (0,1));
_images/84ff246a130c8d30a6065c991ff649591d9ae73a433f93e2d1184b7030d9e3c2.png

Comparisions between the two methods#

We will take each of the column, Age and Fare and individually apply all the techniques and make a comparision between them

Age Analysis#

age = df[['Age']].dropna()

Min Max Scaling#

Min-Max normalization is a technique used to rescale a feature in the data to a specific range, such as [0, 1]. The formula for min-max normalization is as follows:

X_norm = (X - X_min) / (X_max - X_min)

where X is the original feature, X_min is the minimum value for that feature, X_max is the maximum value for that feature, and X_norm is the normalized feature.

This formula transforms each feature in the data to a new scale, based on the minimum and maximum values for that feature. The resulting normalized feature will have values in the range [0, 1], with 0 representing the minimum value for that feature and 1 representing the maximum value.

It is important to note that min-max normalization should only be used when the data is skewed or has a large range. If the data is not skewed and has a Gaussian distribution, it may be more appropriate to use a technique such as standard scaling, which transforms the data so that it has a mean of 0 and a standard deviation of 1.

from sklearn.preprocessing import MinMaxScaler

# call scaler
mm_scaler = MinMaxScaler()

# fit the scaler
age['mm_scaled'] = mm_scaler.fit_transform(age)

Standard Scaling#

Standard scaling, also known as standardization, is a technique used in machine learning to transform a feature to have a mean of 0 and a standard deviation of 1. This is useful because it allows features with different units of measurement to be compared on the same scale.

The formula for standard scaling is as follows:

X_std = (X - X_mean) / X_std

where X is the original feature, X_mean is the mean of that feature, X_std is the standard deviation of that feature, and X_std is the standardized feature.

from sklearn.preprocessing import StandardScaler

# call scaler
ss_scaler = StandardScaler()

# fit the scaler
age['ss_scaled'] = ss_scaler.fit_transform(age[['Age']])
age.head()
Age mm_scaled ss_scaled
0 22.0 0.271174 -0.530377
1 38.0 0.472229 0.571831
2 26.0 0.321438 -0.254825
3 35.0 0.434531 0.365167
4 35.0 0.434531 0.365167
age.describe()
Age mm_scaled ss_scaled
count 714.000000 714.000000 7.140000e+02
mean 29.699118 0.367921 2.338621e-16
std 14.526497 0.182540 1.000701e+00
min 0.420000 0.000000 -2.016979e+00
25% 20.125000 0.247612 -6.595416e-01
50% 28.000000 0.346569 -1.170488e-01
75% 38.000000 0.472229 5.718310e-01
max 80.000000 1.000000 3.465126e+00
for cols in age.columns:
    print(cols)
    age[cols].plot(kind = 'kde', figsize = (12, 8), xlim = (age[cols].min(), age[cols].max()))
    plt.axvline(age[cols].mean(), color = 'red', label = 'mean')
    plt.axvline(age[cols].median(), color = 'green', label = 'median')
    plt.axvline(age[cols].max(), color = 'brown', label = 'max')
    plt.axvline(age[cols].min(), color = 'orange', label = 'max')
    plt.legend()
    plt.show()
Age
_images/de18e4a83e09ad7c1d8353ae6d061a296046f6c353ae742c5f572d11670a5756.png
mm_scaled
_images/a60c208595902adae04b842d60aa13e463447a372119fe1b0b2e2fa511ca91e1.png
ss_scaled
_images/544b8afaa3079ad2864c0ed1d7bd17464fb812ba69a710928c0021dd316294ae.png

Fare Analysis#

The column is skewed

fare = df[['Fare']]
from sklearn.preprocessing import MinMaxScaler

# call scaler
mm_scaler = MinMaxScaler()

# fit the scaler
fare['mm_scaled'] = mm_scaler.fit_transform(fare)
from sklearn.preprocessing import StandardScaler

# call scaler
ss_scaler = StandardScaler()

# fit the scaler
fare['ss_scaled'] = ss_scaler.fit_transform(fare[['Fare']])
fare.describe()
Fare mm_scaled ss_scaled
count 891.000000 891.000000 8.910000e+02
mean 32.204208 0.062858 3.987333e-18
std 49.693429 0.096995 1.000562e+00
min 0.000000 0.000000 -6.484217e-01
25% 7.910400 0.015440 -4.891482e-01
50% 14.454200 0.028213 -3.573909e-01
75% 31.000000 0.060508 -2.424635e-02
max 512.329200 1.000000 9.667167e+00
for cols in fare.columns:
    print(cols)
    fare.query('Fare >0')[cols].plot(kind = 'kde', figsize = (12, 8))
    plt.show()
Fare
_images/70eeda210f65b8f28e4db1bf401302d0ac533f794a08ed5a8345290627cd5ded.png
mm_scaled
_images/bf73c47e317767ec8bf583b8f964d030be291d1c1cad5abb73f9f3811d4924ef.png
ss_scaled
_images/90e7e246dfc8c8c1932e65162937391eee4a9a90c33bc7a3ecacd9cc8eb204c2.png