A Quick Guide to Pytorch Loss Functions

I keep running into the same problem when I train neural networks: my model learns something, but how do I actually measure whether it is learning the right thing? The loss function is the answer to that question. It is the single number that tells a model how far off its predictions are from the truth, and every weight update in the network tries to make that number smaller.

This article covers PyTorch’s built-in loss functions, from basic ones like MSELoss and CrossEntropyLoss to specialized losses like HuberLoss and TripletMarginLoss. By the end, you will know which loss to reach for in different training scenarios and how to wire them up in your training loop.

TLDR

Loss functions measure how wrong a model’s predictions are
Use MSELoss for regression, CrossEntropyLoss for multi-class classification
BCEWithLogitsLoss combines sigmoid and binary cross-entropy in one numerically stable call
Custom losses are just nn.Module subclasses with a forward method
Pair NLLLoss with LogSoftmax, and use KL Divergence for distribution matching tasks

What are Loss Functions in Deep Learning?

A loss function takes the model’s predictions and the true labels, then boils them down to a single scalar value. During training, PyTorch computes this loss after each forward pass, then runs backpropagation to calculate gradients for every weight in the network. Those gradients tell each weight how much it should increase or decrease to reduce the loss on the next batch.

Smaller loss means better predictions. If the loss drops over time, the model is learning. If it plateaus or rises, something is wrong with the data, the learning rate, or the loss choice itself. The loss is the training signal, so picking the right one matters more than almost any other architectural decision.

PyTorch ships every common loss function under the torch.nn namespace. All of them inherit from nn.Module, which means they plug straight into the training loop just like any other layer.

Regression Losses: MSELoss and L1Loss

Regression problems predict continuous values. If your model outputs a raw number (house prices, temperatures, stock returns), you typically reach for one of two main losses.

L1Loss, also called Mean Absolute Error, computes the average absolute difference between predicted and true values. It is robust to outliers because it does not square the errors, so a single wildly wrong prediction does not dominate the loss.


import torch
import torch.nn as nn

criterion = nn.L1Loss()
predictions = torch.tensor([1.2, 3.4, 2.1])
targets = torch.tensor([1.0, 3.0, 2.0])
loss = criterion(predictions, targets)
print(loss)


tensor(0.2333)

MSELoss, or Mean Squared Error, squares each error before averaging. This punishes larger errors much more than smaller ones, which can help when you want the model to be confident in its predictions. The trade-off is that outliers in your training data can inflate the loss dramatically. Both L1Loss and SmoothL1Loss (which behaves like L1 for large errors and L2 for small ones) are also available in torch.nn.


import torch
import torch.nn as nn

criterion = nn.MSELoss()
predictions = torch.tensor([1.2, 3.4, 2.1])
targets = torch.tensor([1.0, 3.0, 2.0])
loss = criterion(predictions, targets)
print(loss)


tensor(0.0700)

Classification Losses: CrossEntropyLoss, BCEWithLogitsLoss, NLLLoss

Classification tasks group predictions into discrete categories. The right loss depends on how many classes you have and how the model outputs its predictions.

CrossEntropyLoss is the workhorse for any classification task with more than two classes. It combines LogSoftmax and NLLLoss into a single call, which is both more numerically stable and more convenient than chaining them manually. It expects raw logits as input, not probabilities. Under the hood, cross-entropy loss measures how close the model’s predicted probability distribution is to the one-hot ground truth distribution.


import torch
import torch.nn as nn

criterion = nn.CrossEntropyLoss()
logits = torch.tensor([[2.0, 0.5, -1.0], [-0.5, 1.5, 0.0]])
targets = torch.tensor([0, 2])
loss = criterion(logits, targets)
print(loss)


tensor(1.0238)

For binary classification, BCEWithLogitsLoss takes the raw sigmoid output and computes binary cross-entropy in one step. Passing logits directly to this loss is more numerically stable than passing pre-sigmoid probabilities because the internal sigmoid computation is fused with the loss calculation.


import torch
import torch.nn as nn

criterion = nn.BCEWithLogitsLoss()
logits = torch.tensor([1.5, -0.8, 3.2])
targets = torch.tensor([1.0, 0.0, 1.0])
loss = criterion(logits, targets)
print(loss)


tensor(0.2042)

NLLLoss expects log-probabilities (the output of LogSoftmax) as input. It is usually paired with a LogSoftmax layer in the model’s forward pass rather than being used standalone. The loss decreases when the model assigns a higher probability to the correct class, making it a natural fit for multi-class problems when you want to apply a specific temperature or regularization to the softmax.


import torch
import torch.nn as nn

criterion = nn.NLLLoss()
log_probs = torch.tensor([[-0.2, -1.5, -0.5], [-0.8, -0.1, -2.0]])
targets = torch.tensor([0, 1])
loss = criterion(log_probs, targets)
print(loss)


tensor(0.1500)

KL Divergence and Advanced Losses

Beyond basic regression and classification, PyTorch provides losses for more specialized training scenarios.

KL Divergence measures how one probability distribution diverges from a reference distribution. KLDivLoss expects the model output in log-probability form. It shows up in variational autoencoders and GANs, knowledge distillation, and any task where you want to match a predicted distribution to a target distribution rather than predicting a single label.


import torch
import torch.nn as nn

criterion = nn.KLDivLoss(reduction='batchmean')
pred_log_probs = torch.log(torch.tensor([[0.4, 0.3, 0.3], [0.2, 0.7, 0.1]]))
targets = torch.tensor([[0.5, 0.25, 0.25], [0.1, 0.8, 0.1]])
loss = criterion(pred_log_probs, targets)
print(loss)


tensor(0.0290)

HuberLoss behaves like L1Loss for large errors beyond a threshold and like MSELoss for small errors. This makes it more stable near the optimum where gradients are small, while still being robust to outliers. The threshold is controlled by the delta parameter. SmoothL1Loss is a simplified version with delta fixed at 1.0, and is the loss used in the SSD object detection model.


import torch
import torch.nn as nn

criterion = nn.HuberLoss(delta=1.0)
predictions = torch.tensor([1.0, 2.5, 10.0])
targets = torch.tensor([1.1, 2.0, 8.0])
loss = criterion(predictions, targets)
print(loss)


tensor(0.5433)

TripletMarginLoss is used in metric learning and face recognition. It takes an anchor, a positive example (same class as anchor), and a negative example (different class), then tries to make the anchor closer to the positive than to the negative by at least a margin. CosineEmbeddingLoss uses cosine distance instead of Euclidean distance, which makes it useful when the direction of the embedding vector matters more than its magnitude.


import torch
import torch.nn as nn

criterion = nn.TripletMarginLoss(margin=1.0, p=2)
anchor = torch.tensor([[1.0, 0.0, 0.0]])
positive = torch.tensor([[0.0, 1.0, 0.0]])
negative = torch.tensor([[0.0, 0.0, 1.0]])
loss = criterion(anchor, positive, negative)
print(loss)


tensor(1.0000)


import torch
import torch.nn as nn

criterion = nn.CosineEmbeddingLoss(margin=0.5)
input1 = torch.tensor([[1.0, 0.5, 0.2], [1.0, 0.0, 0.0]])
input2 = torch.tensor([[0.9, 0.6, 0.3], [0.0, 1.0, 0.0]])
target = torch.tensor([1.0, -1.0])
loss = criterion(input1, input2, target)
print(loss)


tensor(0.0058)

Custom Loss Functions and Training Loops

Sometimes none of the built-in losses fit your task. A custom loss is just an nn.Module with a forward method that takes predictions and targets and returns a scalar tensor. You can use any PyTorch operations inside, including other nn.Module layers if you need learnable parameters in the loss itself. These custom losses become especially useful when training GANs, where you need separate generator and discriminator losses, or in multi-task learning where different outputs should be weighted differently.


import torch
import torch.nn as nn

class WeightedMSELoss(nn.Module):
    def __init__(self, weights):
        super().__init__()
        self.weights = weights

    def forward(self, predictions, targets):
        squared_error = (predictions - targets) ** 2
        weighted_error = squared_error * self.weights
        return weighted_error.mean()

weights = torch.tensor([1.0, 2.0, 1.0])
criterion = WeightedMSELoss(weights)
predictions = torch.tensor([1.2, 3.5, 2.0])
targets = torch.tensor([1.0, 3.0, 2.0])
loss = criterion(predictions, targets)
print(loss)


tensor(0.1800)

All PyTorch loss functions integrate the same way into a training loop. You compute the loss, call loss.backward() to compute gradients, then optimizer.step() to update weights. Here is a minimal example with CrossEntropyLoss for a multi-class classification problem.


import torch
import torch.nn as nn

model = nn.Linear(10, 3)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

batch_inputs = torch.randn(32, 10)
batch_targets = torch.randint(0, 3, (32,))

optimizer.zero_grad()
outputs = model(batch_inputs)
loss = criterion(outputs, batch_targets)
print(f"Loss: {loss.item():.4f}")
loss.backward()
optimizer.step()


Loss: 1.1385

FAQ

Q: When should I use HuberLoss over MSELoss?

HuberLoss is preferred when the training data contains outliers. MSELoss squares errors, so a single extreme outlier can dominate the gradient and destabilize training. HuberLoss caps the squared penalty at a threshold, making it behave like L1Loss for large errors.

Q: What is the difference between CrossEntropyLoss and BCEWithLogitsLoss?

CrossEntropyLoss handles multi-class classification with more than two output classes. BCEWithLogitsLoss handles binary classification with a single output. Both accept raw logits, but the underlying math differs because multi-class and binary classification have different probability structures.

Q: Can I use multiple loss functions in one training step?

Yes. Multi-task learning often uses a weighted sum of multiple losses, one per output head. You compute each loss independently, then add them with appropriate weights. The combined scalar is what backpropagation sees.

Q: What loss should I use for object detection?

Object detection models typically combine multiple losses. A common setup is classification loss (CrossEntropyLoss) for the class prediction combined with regression loss (SmoothL1Loss) for the bounding box coordinates. The total loss is a weighted sum of both.

Q: How do I handle imbalanced classes in classification?

Pass a weight tensor to CrossEntropyLoss where underrepresented classes have higher weights. Alternatively, use weighted sampling to ensure each batch contains proportionally more examples from minority classes during training.

Summary

PyTorch’s torch.nn namespace covers the vast majority of training scenarios out of the box. MSELoss and L1Loss handle regression, CrossEntropyLoss handles multi-class classification, and BCEWithLogitsLoss handles binary classification. For more specialized tasks, HuberLoss, TripletMarginLoss, and KLDivLoss address specific training needs. When none of those fit, subclassing nn.Module gives you full control over the loss computation. The right loss function depends on the problem type, the data distribution, and what the model is expected to learn.