Yes, that is completely possible. Instead of using a mean of loss weights, you can use higher weights for tokens of choice and calculate weighted mean.

Pytorch code for the same Code Source

log_prob = torch.tensor([[-0.0141, -4.2669],
                     [-0.0141, -4.2669]])
target = torch.tensor([0, 1])
weight = torch.tensor([2.0, 3.0])
criterion = nn.NLLLoss()
criterion_weighted = nn.NLLLoss(weight=weight)

print(criterion(log_prob, target))
> tensor(2.1405)
print(criterion_weighted(log_prob, target))
> tensor(2.5658)

The weights will enforce/bias your transformer's attention mechanism to focus on specific relationships in the data. Note the weights will be a hyper parameter you need to fine tune.