Loss Scaling Portable Free -
During training, the loss value of a neural network can vary greatly, especially when using large batch sizes or complex models. This can cause issues with the gradients, leading to:
# Train the model for epoch in range(num_epochs): # Forward pass predictions = model(inputs) loss = loss_fn(labels, predictions)
There are several types of loss scaling techniques: loss scaling free
To fix this, developers use Loss Scaling , which multiplies the loss by a large factor before backpropagation to push gradients into a representable range, then divides them back down before the optimizer step.
It offers the speed of FP16 but eliminates the need for loss scaling entirely. It allows you to treat mixed precision as a "set it and forget it" configuration, letting you focus on model architecture rather than floating-point arithmetic. During training, the loss value of a neural
refers to training FP16 mixed precision without any explicit loss scaling , achieved by leveraging a combination of hardware features, data formats, and algorithmic insights.
To understand why we needed loss scaling in the first place, we have to look at the IEEE 754 standard for floating-point numbers. It allows you to treat mixed precision as
This is the . If your gradients zero out, your weights don't update, and your model stops learning.
| Format | Exponent Bits | Mantissa Bits | Dynamic Range (approx) | |--------|---------------|---------------|------------------------| | FP16 | 5 | 10 | 5.96e-8 to 65504 | | BF16 | 8 | 7 | 1.18e-38 to 3.4e38 |