⚖️ Data Scaling Methods

⚖️ Data Scaling Methods#

👨‍🏫 Vikesh K
📓 Lab-06

💡 “The secret to getting ahead is getting started” 💡

📝Lab Agenda#

Importance of Scaling
How to do Scaling and types
Scaling comparision

Importance of Scaling#

Scaling is an important pre-processing step for machine learning because it helps to normalize the input data so that it can be more easily compared and processed by the learning algorithm. Here are some reasons why scaling is important:

Different scales: Features in the input data may have different scales, i.e., some features may have larger numerical values than others. This can cause the learning algorithm to place undue importance on features with larger numerical values, even if they are not actually more important for the problem at hand. Scaling the features to a common scale can help prevent this.
Convergence: Many machine learning algorithms use some form of gradient descent to optimize the model parameters. If the input features are not scaled, the optimization process may take longer to converge or may not converge at all. This is because features with large numerical values may cause the gradients to be too large or too small, making it difficult to find the optimal parameter values.
Distance-based algorithms: Some machine learning algorithms, such as k-nearest neighbors and clustering, are based on distances between data points. If the input features are not scaled, the distance calculations may be dominated by features with larger numerical values, leading to biased results.
Regularization: Regularization is a technique used to prevent overfitting in machine learning models. Some regularization techniques, such as L1 regularization, are based on the absolute values of the model parameters. If the input features are not scaled, the regularization may be biased towards features with larger numerical values.

Overall, scaling is important for machine learning because it can help improve the performance and stability of the learning algorithm, reduce the training time, and ensure that the model is making meaningful comparisons between the input features

Importing Modules and Data#

import pandas as pd
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
import warnings
warnings.filterwarnings("ignore")

Note

We will use sample columns of Titanic data to further understand the scaling and how it impacts the data

df = pd.read_csv('https://raw.githubusercontent.com/vkoul/data/main/misc/titanic.csv')

df.head(2)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C

How to do Scaling#

Note

For illustration, we will focus on Age and Fare columns for this task. Idea is to compare the distributions before and after the scaling

df[['Age', "Fare"]].plot(kind = 'kde', subplots = True, title = "The x-axis range for Age and Fare are very different", figsize= (15,8));

_images/9b719390dc0d5ac1e21e7374c770972a9c99aca8e6b85c55b4f7a2dd7e9ecad0.png

The two common scaling techniques are Min-Max scaling and Standard Scaling. We will quickly use the example of Standard Scaling and see its impact on the variables

df2 = df.copy() # creating a copy, so that original dataset remains untouched

from sklearn.preprocessing import StandardScaler

# call scaler
ss_scaler = StandardScaler()

# fit the scaler
df2[['Age', 'Fare']] = ss_scaler.fit_transform(df2[['Age', 'Fare']])

df2[['Age', "Fare"]].describe()

	Age	Fare
count	7.140000e+02	8.910000e+02
mean	2.388379e-16	3.987333e-18
std	1.000701e+00	1.000562e+00
min	-2.016979e+00	-6.484217e-01
25%	-6.595416e-01	-4.891482e-01
50%	-1.170488e-01	-3.573909e-01
75%	5.718310e-01	-2.424635e-02
max	3.465126e+00	9.667167e+00

df2[['Age', "Fare"]].plot(kind = 'kde', subplots = True, xlim = (-6,10), title = "Distribution of columns post Standard Scaling", figsize=(15,8));

_images/d8140d1e82609d4cf6156481564df0bbdeeb0e10bf76e121e7e37a6214c22d41.png

df3 = df.copy()

Min-max Scaling

from sklearn.preprocessing import MinMaxScaler

# call scaler
mm_scaler = MinMaxScaler()

# fit the scaler
df3[['Age', 'Fare']] = mm_scaler.fit_transform(df3[['Age', 'Fare']])

df3[['Age', "Fare"]].describe()

	Age	Fare
count	714.000000	891.000000
mean	0.367921	0.062858
std	0.182540	0.096995
min	0.000000	0.000000
25%	0.247612	0.015440
50%	0.346569	0.028213
75%	0.472229	0.060508
max	1.000000	1.000000

df3[['Age', "Fare"]].plot(kind = 'kde', subplots = True, title = "Distribution of columns post Min-Max Scaling", figsize= (15,8), xlim= (0,1));

_images/84ff246a130c8d30a6065c991ff649591d9ae73a433f93e2d1184b7030d9e3c2.png

Comparisions between the two methods#

We will take each of the column, Age and Fare and individually apply all the techniques and make a comparision between them

Age Analysis#

age = df[['Age']].dropna()

Min Max Scaling #

Min-Max normalization is a technique used to rescale a feature in the data to a specific range, such as [0, 1]. The formula for min-max normalization is as follows:

X_norm = (X - X_min) / (X_max - X_min)

where X is the original feature, X_min is the minimum value for that feature, X_max is the maximum value for that feature, and X_norm is the normalized feature.

This formula transforms each feature in the data to a new scale, based on the minimum and maximum values for that feature. The resulting normalized feature will have values in the range [0, 1], with 0 representing the minimum value for that feature and 1 representing the maximum value.

It is important to note that min-max normalization should only be used when the data is skewed or has a large range. If the data is not skewed and has a Gaussian distribution, it may be more appropriate to use a technique such as standard scaling, which transforms the data so that it has a mean of 0 and a standard deviation of 1.

from sklearn.preprocessing import MinMaxScaler

# call scaler
mm_scaler = MinMaxScaler()

# fit the scaler
age['mm_scaled'] = mm_scaler.fit_transform(age)

Standard Scaling #

Standard scaling, also known as standardization, is a technique used in machine learning to transform a feature to have a mean of 0 and a standard deviation of 1. This is useful because it allows features with different units of measurement to be compared on the same scale.

The formula for standard scaling is as follows:

X_std = (X - X_mean) / X_std

where X is the original feature, X_mean is the mean of that feature, X_std is the standard deviation of that feature, and X_std is the standardized feature.

from sklearn.preprocessing import StandardScaler

# call scaler
ss_scaler = StandardScaler()

# fit the scaler
age['ss_scaled'] = ss_scaler.fit_transform(age[['Age']])

age.head()

	Age	mm_scaled	ss_scaled
0	22.0	0.271174	-0.530377
1	38.0	0.472229	0.571831
2	26.0	0.321438	-0.254825
3	35.0	0.434531	0.365167
4	35.0	0.434531	0.365167

age.describe()

	Age	mm_scaled	ss_scaled
count	714.000000	714.000000	7.140000e+02
mean	29.699118	0.367921	2.338621e-16
std	14.526497	0.182540	1.000701e+00
min	0.420000	0.000000	-2.016979e+00
25%	20.125000	0.247612	-6.595416e-01
50%	28.000000	0.346569	-1.170488e-01
75%	38.000000	0.472229	5.718310e-01
max	80.000000	1.000000	3.465126e+00

for cols in age.columns:
    print(cols)
    age[cols].plot(kind = 'kde', figsize = (12, 8), xlim = (age[cols].min(), age[cols].max()))
    plt.axvline(age[cols].mean(), color = 'red', label = 'mean')
    plt.axvline(age[cols].median(), color = 'green', label = 'median')
    plt.axvline(age[cols].max(), color = 'brown', label = 'max')
    plt.axvline(age[cols].min(), color = 'orange', label = 'max')
    plt.legend()
    plt.show()

Age

_images/de18e4a83e09ad7c1d8353ae6d061a296046f6c353ae742c5f572d11670a5756.png

mm_scaled

_images/a60c208595902adae04b842d60aa13e463447a372119fe1b0b2e2fa511ca91e1.png

ss_scaled

_images/544b8afaa3079ad2864c0ed1d7bd17464fb812ba69a710928c0021dd316294ae.png

Fare Analysis#

The column is skewed

fare = df[['Fare']]

from sklearn.preprocessing import MinMaxScaler

# call scaler
mm_scaler = MinMaxScaler()

# fit the scaler
fare['mm_scaled'] = mm_scaler.fit_transform(fare)

from sklearn.preprocessing import StandardScaler

# call scaler
ss_scaler = StandardScaler()

# fit the scaler
fare['ss_scaled'] = ss_scaler.fit_transform(fare[['Fare']])

fare.describe()

	Fare	mm_scaled	ss_scaled
count	891.000000	891.000000	8.910000e+02
mean	32.204208	0.062858	3.987333e-18
std	49.693429	0.096995	1.000562e+00
min	0.000000	0.000000	-6.484217e-01
25%	7.910400	0.015440	-4.891482e-01
50%	14.454200	0.028213	-3.573909e-01
75%	31.000000	0.060508	-2.424635e-02
max	512.329200	1.000000	9.667167e+00

for cols in fare.columns:
    print(cols)
    fare.query('Fare >0')[cols].plot(kind = 'kde', figsize = (12, 8))
    plt.show()

Fare

_images/70eeda210f65b8f28e4db1bf401302d0ac533f794a08ed5a8345290627cd5ded.png

mm_scaled

_images/bf73c47e317767ec8bf583b8f964d030be291d1c1cad5abb73f9f3811d4924ef.png

ss_scaled

_images/90e7e246dfc8c8c1932e65162937391eee4a9a90c33bc7a3ecacd9cc8eb204c2.png