4/02/2025

Explanation for data Normalization and Min/Max calculation.

Let me explain how a specific normalized feature value is calculated using one concrete example.

Let's take the feature "GroupSize" which has:

  • Min value: -0.045121
  • Max value: 103.032967

These values are post-normalization, but we can work backwards to understand how they were calculated.

The Normalization Formula

The normalization function you're using is:

normalized_features = (features - mean) / std

Where:

  • features are the original, raw values
  • mean is the average of all values for that feature in the training set
  • std is the standard deviation of all values for that feature in the training set

Working Through An Example

Let's say we have these raw values for GroupSize in the training set:

  • Raw values: [0, 0, 0, 0, 0, 1, 1, 1, 32, 64]

First, we calculate the mean:

  • Mean = (0+0+0+0+0+1+1+1+32+64)/10 = 9.9

Then we calculate the standard deviation:

  • Each deviation: [-9.9, -9.9, -9.9, -9.9, -9.9, -8.9, -8.9, -8.9, 22.1, 54.1]
  • Squared deviations: [98.01, 98.01, 98.01, 98.01, 98.01, 79.21, 79.21, 79.21, 488.41, 2926.81]
  • Average squared deviation: 4143.9/10 = 414.39
  • Standard deviation = √414.39 ≈ 20.36

Now, we can normalize each value:

  • For 0: (0 - 9.9) / 20.36 = -0.486
  • For 1: (1 - 9.9) / 20.36 = -0.437
  • For 32: (32 - 9.9) / 20.36 = 1.085
  • For 64: (64 - 9.9) / 20.36 = 2.657

Explaining the Min/Max Values

Going back to your data:

  • The min value for GroupSize (-0.045121) represents the original minimum value in your dataset after normalization
  • The max value (103.032967) represents the original maximum value after normalization

For GroupSize, this extreme range suggests:

  1. Your raw data has a wide range of values
  2. The high maximum suggests outliers that are far from the mean, creating a highly skewed distribution
  3. The standard deviation is relatively small compared to the maximum value

Concrete Calculation

If we assume the mean of raw GroupSize is ฮผ and standard deviation is ฯƒ, then:

  • Minimum normalized value: (min_raw - ฮผ) / ฯƒ = -0.045121
  • Maximum normalized value: (max_raw - ฮผ) / ฯƒ = 103.032967

This tells us that your maximum raw value is over 103 standard deviations away from the mean, which is extremely far! This confirms that your raw data has a heavily skewed distribution with significant outliers.

The fact that most normalized values for GroupSize are close to the minimum (-0.045121) suggests that the most common value is slightly below the mean, while a few extreme outliers are pulling the mean upward.

This type of skewed distribution is exactly why techniques like masking and autoencoder approaches are beneficial - they can help the model learn robust representations even with such extreme distributions.