Thursday, August 3, 2023

AI models

 https://alvinntnu.github.io/python-notes/nlp/sentiment-analysis-dl.html

https://medium.com/data-science-365

standard deviation

Normalization vs Standardization Explained












The Theta1 feature size/scale is smaller than Theta2


The difference in feature scales causes oscillations


Compute the distance
The horizontal feature scale dominates the vertical scale
-need to standardize


Normalize along both axis


Compute the distance
Both features contribute roughly equally to the calculated distance
The algorithm that uses the data will no longer be affected by the feature with the higher scale.


https://stackoverflow.com/questions/40758562/can-anyone-explain-me-standardscaler

How and why to Standardize your data: A python tutorial

I assume that you have a matrix X where each row/line is a sample/observation and each column is a variable/feature (this is the expected input for any sklearn ML function by the way -- X.shape should be [number_of_samples, number_of_features]).

Core of method

The main idea is to normalize/standardize i.e. μ = 0 and σ = 1 your features/variables/columns of Xindividuallybefore applying any machine learning model.

StandardScaler() will normalize the features i.e. each column of X, INDIVIDUALLY, so that each column/feature/variable will have μ = 0 and σ = 1.

P.S: I find the most upvoted answer on this page, wrong. I am quoting "each value in the dataset will have the sample mean value subtracted" -- This is neither true nor correct.

See also: How and why to Standardize your data: A python tutorial

Example with code

from sklearn.preprocessing import StandardScaler
import numpy as np

# 4 samples/observations and 2 variables/features
data = np.array([[0, 0], [1, 0], [0, 1], [1, 1]])
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

print(data)
[[0, 0],
 [1, 0],
 [0, 1],
 [1, 1]])

print(scaled_data)
[[-1. -1.]
 [ 1. -1.]
 [-1.  1.]
 [ 1.  1.]]

Verify that the mean of each feature (column) is 0:

scaled_data.mean(axis = 0)
array([0., 0.])

Verify that the std of each feature (column) is 1:

scaled_data.std(axis = 0)
array([1., 1.])

Appendix: The maths










PCA in Scikit-learn – Principal Component Analysis (with Python Example)


seaborn python plotting

https://medium.com/data-science-365/an-in-depth-guide-to-pca-with-numpy-1fb128535b3e

https://medium.com/data-science-365/principal-component-analysis-pca-with-scikit-learn-1e84a0c731b0

No comments:

Post a Comment