Let's use a simplified example with just 2 data points and walk through the process with actual numbers. This will help illustrate how gradients are calculated and accumulated for a batch.
Let's assume we have a very simple model with one parameter w
, currently set to 1.0. Our loss function is the square error, and we're using basic gradient descent with a learning rate of 0.1.
Data points:
- x1 = 2, y1 = 4
- x2 = 3, y2 = 5
Batch size = 2 (both data points in one batch)
Step 1: Forward pass
- For x1: prediction = w * x1 = 1.0 * 2 = 2
- For x2: prediction = w * x2 = 1.0 * 3 = 3
Step 2: Calculate losses
- Loss1 = (prediction1 - y1)^2 = (2 - 4)^2 = 4
- Loss2 = (prediction2 - y2)^2 = (3 - 5)^2 = 4
- Total batch loss = (Loss1 + Loss2) / 2 = (4 + 4) / 2 = 4
Step 3: Backward pass (calculate gradients)
- Gradient1 = 2 * (prediction1 - y1) * x1 = 2 * (2 - 4) * 2 = -8
- Gradient2 = 2 * (prediction2 - y2) * x2 = 2 * (3 - 5) * 3 = -12
Step 4: Accumulate gradients
- Total gradient = (Gradient1 + Gradient2) / 2 = (-8 + -12) / 2 = -10
Step 5: Update weight (once for the batch)
- New w = old w - learning_rate * total gradient
- New w = 1.0 - 0.1 * (-10) = 2.0
So, after processing this batch of 2 data points:
- We calculated 2 individual gradients (-8 and -12)
- We accumulated these into one total gradient (-10)
- We performed one weight update, changing w from 1.0 to 2.0
This process would then repeat for the next batch. In this case, we've processed all our data, so this completes one epoch.
No comments:
Post a Comment