The key is understanding what mu and sigma represent and how they connect to the loss. For each text token, the encoder outputs a Gaussian distribution defined by mu and sigma in latent space. The mel spectrogram gets transformed into a latent vector z through the normalizing flow. The loss then asks: how likely is each latent frame z under each token's Gaussian? That likelihood computation is what drives alignment without labels. Frames naturally get assigned to the token whose Gaussian best explains them. The full loss combines three terms:

Negative log likelihood of z under the token Gaussians — this is the main alignment signal Log determinant of the Jacobian from the flow — standard normalizing flow correction for volume change Duration predictor loss — learns how long each token lasts once soft alignments stabilize

The CTC constraint sits on top of this and enforces monotonic coverage of all tokens, which prevents degenerate alignments where the model skips tokens entirely. The reason it works without frame-level labels is that the flow gives you exact likelihoods, so the model can compute a clean probability for every frame-token pair and let the alignment emerge from that signal alone.