1/07/2025

FP32, TF32, FP16, BFLOAT16, FP8

 


A floating-point number consists of three parts:

1. Sign bit (determines if number is positive or negative)

2. Exponent (controls how far to move the decimal point)

3. Mantissa/Fraction (the actual digits of the number)


Basic Formula:

```

Number = (-1)^sign × (1 + mantissa) × 2^(exponent - bias)

```


Let's break down the number 42.5 into FP32 format:

1. First, convert 42.5 to binary:

   - 42 = 101010 (in binary)

   - 0.5 = 0.1 (in binary)

   - So 42.5 = 101010.1 (binary)


2. Normalize the binary (move decimal until only one 1 is before decimal):

   - 101010.1 = 1.010101 × 2^5

   - Mantissa becomes: 010101

   - Exponent becomes: 5


3. For FP32:

   - Sign bit: 0 (positive number)

   - Exponent: 5 + 127 (bias) = 132 = 10000100

   - Mantissa: 01010100000000000000000


Example in different formats:


1. FP32 (32-bit):

```

Sign    Exponent     Mantissa

0       10000100    01010100000000000000000

```


2. FP16 (16-bit):

```

Sign    Exponent  Mantissa

0       10100     0101010000

```


3. FP8 (8-bit):

```

Sign    Exponent  Mantissa

0       1010      010

```


Real-world example:

```python

# Breaking down 42.5 in FP32

sign = 0  # positive

exponent = 5 + 127  # actual exponent + bias

mantissa = 0.328125  # binary 010101 converted to decimal


# Calculation

value = (-1)**sign * (1 + mantissa) * (2**(exponent - 127))

# = 1 * (1 + 0.328125) * (2**5)

# = 1.328125 * 32

# = 42.5

```


The tradeoffs:

- More exponent bits = larger range of numbers (very big/small)

- More mantissa bits = more precision (decimal places)

- FP8 sacrifices both for memory efficiency

- BFLOAT16 keeps exponent bits (range) but reduces precision


This is why different formats are used for different parts of ML models:

- Weights might use FP16/BF16 for good balance

- Activations might use FP8 for efficiency

- Final results might use FP32 for accuracy


No comments:

Post a Comment