A floating-point number consists of three parts:
1. Sign bit (determines if number is positive or negative)
2. Exponent (controls how far to move the decimal point)
3. Mantissa/Fraction (the actual digits of the number)
Basic Formula:
```
Number = (-1)^sign × (1 + mantissa) × 2^(exponent - bias)
```
Let's break down the number 42.5 into FP32 format:
1. First, convert 42.5 to binary:
- 42 = 101010 (in binary)
- 0.5 = 0.1 (in binary)
- So 42.5 = 101010.1 (binary)
2. Normalize the binary (move decimal until only one 1 is before decimal):
- 101010.1 = 1.010101 × 2^5
- Mantissa becomes: 010101
- Exponent becomes: 5
3. For FP32:
- Sign bit: 0 (positive number)
- Exponent: 5 + 127 (bias) = 132 = 10000100
- Mantissa: 01010100000000000000000
Example in different formats:
1. FP32 (32-bit):
```
Sign Exponent Mantissa
0 10000100 01010100000000000000000
```
2. FP16 (16-bit):
```
Sign Exponent Mantissa
0 10100 0101010000
```
3. FP8 (8-bit):
```
Sign Exponent Mantissa
0 1010 010
```
Real-world example:
```python
# Breaking down 42.5 in FP32
sign = 0 # positive
exponent = 5 + 127 # actual exponent + bias
mantissa = 0.328125 # binary 010101 converted to decimal
# Calculation
value = (-1)**sign * (1 + mantissa) * (2**(exponent - 127))
# = 1 * (1 + 0.328125) * (2**5)
# = 1.328125 * 32
# = 42.5
```
The tradeoffs:
- More exponent bits = larger range of numbers (very big/small)
- More mantissa bits = more precision (decimal places)
- FP8 sacrifices both for memory efficiency
- BFLOAT16 keeps exponent bits (range) but reduces precision
This is why different formats are used for different parts of ML models:
- Weights might use FP16/BF16 for good balance
- Activations might use FP8 for efficiency
- Final results might use FP32 for accuracy
No comments:
Post a Comment