MareArts Computer Vision Study.: FP32, TF32, FP16, BFLOAT16, FP8

1/07/2025

FP32, TF32, FP16, BFLOAT16, FP8

A floating-point number consists of three parts:

1. Sign bit (determines if number is positive or negative)

2. Exponent (controls how far to move the decimal point)

3. Mantissa/Fraction (the actual digits of the number)

Basic Formula:

```

Number = (-1)^sign × (1 + mantissa) × 2^(exponent - bias)

```

Let's break down the number 42.5 into FP32 format:

1. First, convert 42.5 to binary:

- 42 = 101010 (in binary)

- 0.5 = 0.1 (in binary)

- So 42.5 = 101010.1 (binary)

2. Normalize the binary (move decimal until only one 1 is before decimal):

- 101010.1 = 1.010101 × 2^5

- Mantissa becomes: 010101

- Exponent becomes: 5

3. For FP32:

- Sign bit: 0 (positive number)

- Exponent: 5 + 127 (bias) = 132 = 10000100

- Mantissa: 01010100000000000000000

Example in different formats:

1. FP32 (32-bit):

```

Sign Exponent Mantissa

0 10000100 01010100000000000000000

```

2. FP16 (16-bit):

```

Sign Exponent Mantissa

0 10100 0101010000

```

3. FP8 (8-bit):

```

Sign Exponent Mantissa

0 1010 010

```

Real-world example:

```python

# Breaking down 42.5 in FP32

sign = 0 # positive

exponent = 5 + 127 # actual exponent + bias

mantissa = 0.328125 # binary 010101 converted to decimal

# Calculation

value = (-1)**sign * (1 + mantissa) * (2**(exponent - 127))

# = 1 * (1 + 0.328125) * (2**5)

# = 1.328125 * 32

# = 42.5

```

The tradeoffs:

- More exponent bits = larger range of numbers (very big/small)

- More mantissa bits = more precision (decimal places)

- FP8 sacrifices both for memory efficiency

- BFLOAT16 keeps exponent bits (range) but reduces precision

This is why different formats are used for different parts of ML models:

- Weights might use FP16/BF16 for good balance

- Activations might use FP8 for efficiency

- Final results might use FP32 for accuracy

MareArts Computer Vision Study.

Pages

1/07/2025

FP32, TF32, FP16, BFLOAT16, FP8

No comments:

Post a Comment