bugfix in transmittance based early stop criteria #541

wonjune23 · 2025-01-17T13:11:42Z

I think there's a bug in terms of when to stop rendering if the transmittance gets near zero.
Intuitively, the stopping criteria checks if the transmittance for rendering "next Gaussian" is very low or not.
Therefore, it should render the "current Gaussian" and then break out of the loop.
However, the current implementation instead checks the criteria and stop immediately, without rendering the current one.

To be more specific, say there are only two Gaussians to be rendered for a pixel, where the first one to be rendered has alpha of 0.5, and the other one has alpha of 1.0. Obviously, because the second Gaussian has alpha of 1.0, the rendered alpha of the pixel should be 1.0.
However, the current implementation only renders the first one because while trying to render the second one, it detects that T will get near zero and doesn't render the second one, leaving the pixel's rendered alpha 0.5.

This can be verified by manually rendering two very bright Gaussians (with the second one being very large):

self.means = torch.tensor([[0, 0, 1],
                            [0, 0, 1.5]], device=self.device)
self.scales = torch.tensor([[1.5e1, 1.5e1, 1.5e1],
                            [1e9, 1e9, 1e9]], device=self.device)
d = 3
self.rgbs = torch.ones(self.num_points, d, device=self.device) * 1e9

u = torch.ones(self.num_points, 1, device=self.device)
v = torch.ones(self.num_points, 1, device=self.device)
w = torch.ones(self.num_points, 1, device=self.device)

self.quats = torch.cat(
    [
        torch.sqrt(1.0 - u) * torch.sin(2.0 * math.pi * v),
        torch.sqrt(1.0 - u) * torch.cos(2.0 * math.pi * v),
        torch.sqrt(u) * torch.sin(2.0 * math.pi * w),
        torch.sqrt(u) * torch.cos(2.0 * math.pi * w),
    ],
    -1,
)

self.opacities = torch.tensor([1e9, 1e9], device=self.device)

It renders this weird gray ball, while after applying this bugfix, it correctly renders white image.

current	bugfix

If you cut and look at the middle section of the above image, it looks like this:

current	bugfix

Here I made randomly initialized Gaussians to learn an all-white image. This behavior compromises stability of training, and it makes the 3DGS never be able to learn a truly saturated image. You can see this from the loss not being able to get reduced below a certain point. The output image has slightly different color because the current implementation struggles to render a truly white image.

current	bugfix

If you discard the color values below 250 and map [250, 255] to be [0, 255] to see the effect better, you can clearly see the current implementation struggles because of the weird gray blobs.

current	bugfix

(please ignore the RGB being crazy at the beginning. I didn't really discard the values below 250 so they became minus and got repeatedly underflown.)

Though I think this bug has minimal effect in practice, I still think it needs a fix ;)

yzslab · 2025-02-03T11:55:46Z

There is a bug in your modification. The T = next_T should be placed before the break (always update the T, whether it breaks or not), otherwise, the render_alphas and render_colors will use the wrong transmittance.

wonjune23 · 2025-02-10T12:41:02Z

@yzslab Sorry for late reply, and thank you for letting me know. I think you're right so I changed the PR accordingly.
(I simply added another T = next_T in the if statement)

lopeLH · 2025-02-19T15:13:20Z

Note that the behaviour you are trying to fix is not a bug, but an intentional feature introduced by the original 3DGS authors to avoid numerical stability issues. With your changes, the final transmittance can be arbitrarily close to zero, causing numerical stability issues in the backward pass.

From annex C of the original 3DGS paper:

wonjune23 · 2025-03-01T10:32:53Z

@lopeLH Hi, thank you for letting me know, I was not aware if it was intended!

However, I've thought about all this and it looks like the only concern in terms of numerical stability with T is that we need to restore it starting from T_final by repeatedly dividing it by 1-alpha:

S ra = 1.0f / (1.0f - alpha);
T *= ra;

I argue that it's not really a problem in current implementation because of these reasons:
(Short summary: the bugfix renders only one more Gaussian at most, and the Gaussian's alpha and thus T are still well bounded, so it doesn't affect the numerical stability of restoring T)

maximum of alpha is truncated to be 0.999 :

S alpha = min(0.999f, opac * __expf(-sigma));

minimum of alpha is 1/255 (otherwise, the renderer just ignores it):

if (sigma < 0.f || alpha < 1.f / 255.f) {
    continue;
}

The minimum possible value of T_final will be 1e-7 after my bugfix (before the bugfix, it was 1e-4):
- This will happen when there are several different Gaussians rendered so that T became near 1e-4, and after that when there is the last Gaussian that would not have been rendered before the bugfix, with its alpha = 0.999. This is the case where T_final becomes near 1e-7 and it's the worst case possible. If the last Gaussian's alpha was smaller, T_final would be closer to 1e-4, making it numerically more stable.
Numerical error of restoring T is also well bounded and is negligible.
- Think of the backward pass: there will be the largest numerical drift when we start from the smallest possible T_final, and then iteratively divide the T with near 1.0 values (i.e., small alpha), so that the backward pass will perform the division as many times as possible. This maximum number of division operation is also bounded, because alpha is bounded by 1/255.
- If we start from the worst case scenario, T_final = 1e-7, it implies that the last Gaussian's alpha was 0.999. So the first thing the backward pass will do is to divide T_final with 1 - 0.999 and (hopefully) get T=1e-4 if numerical error was negligible.
- From here on, it's the same situation before the bugfix was applied, but I still tried the worst case scenario: all the remaining Gaussians have alpha=1/255. Even in this extreme case, there can be <2400 Gaussians because if we divide the T=1e-4 with (1-1/255) for 2400 times, it gets larger than 1.0.
- Specifically, the worst case experiment looks like this:

S temp_T = (1e-4) * (1.0 - 0.999);
temp_T /= (1.0 - 0.999);
for (int l = 0; l < 2400; l++) {
    temp_T /= (1.0 - 1.f / 255.f);
}
printf("%.10f\n", temp_T);

In the current implementation S is single precision float. The experiment prints 1.2459857464.
If we perform this in Python (that uses double precision float), it prints 1.2459859743472113.

Therefore, my conclusion is that even in the worst case scenario, the numerical stability regarding T is negligible, as the scale of numerical error is at most 1e-7. (Which, I assume this amout of numerical error would have also been present before the bugfix)

I feel like this PR is getting an overkill for such a negligible bug, but again, I still think it's a bug. If something is wrong with my logic, please let me know :)

lopeLH · 2025-03-03T08:11:46Z

I am still not fully convinced about the proposed changes, for a number of reasons:

From a software engineering perspective, I find it dangerous that the change's numerical stability relies on the values of other hyper-parameters which might be changed in the future, inadvertently triggering numerical stability issues. In this regard, I think a safer solution could involve providing the backward pass with the penultimate transmittance, instead of the last one, to bypass the issue altogether.
From an algorithmic point of view, I think the change is less trivial and safe than it seems, and might affect the training dynamics in unforeseen ways. Adopting it safely would require careful validation and QA, as it is very easy to miss small but critical details in such a complex codebase. For example, I think your numerical stability analysis overlooked the fact that transmittance is converted to opacity (1.0f. - T) and back to transmittance when passed from forward to backward kernels. I think you will find the numerical error is not negligible if you consider that in your analysis.
Don't know what the purpose of this library is. If the goal is to reproduce the original implementation results as faithfully as possible, even if the other concerns are addressed one could argue this change deviates from the original author's proposal. This is more of an "improvement" proposal than a trivial bugfix.

That said, I am not a maintainer of this library, so these are just my 2cents.

wonjune23 · 2025-03-03T09:01:31Z

Hi, I agree on all three points, especially the second point: as you mentioned, I overlooked the 1.0f -T part, and it indeed makes the error not negligible.

I also felt like this is getting deviated too much from the original implementation. As I mentioned earlier, I was not aware if it was intended in the original paper. But after knowing that, I still thought this feature (although intended) put too much constriant on the rendering process than what's required. However, as you pointed out, it was not too much.

I still am not happy with the fact that in some cases, despite being very rare, the rendered image can actually be very different than what's expected. However, it turns out the fix for this is not that easy without further changing the current method. But thank you for sharing all these. Knowing this might actually be very helpful on what I was working on, and hopefully to others as well.

fix to update T when break

1dc06b3

wonjune23 force-pushed the early-stop-bugfix branch from 91c51e8 to 1dc06b3 Compare February 10, 2025 12:33

wonjune23 closed this Mar 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bugfix in transmittance based early stop criteria #541

bugfix in transmittance based early stop criteria #541

wonjune23 commented Jan 17, 2025 •

edited

Loading

yzslab commented Feb 3, 2025 •

edited

Loading

wonjune23 commented Feb 10, 2025 •

edited

Loading

lopeLH commented Feb 19, 2025 •

edited

Loading

wonjune23 commented Mar 1, 2025 •

edited

Loading

lopeLH commented Mar 3, 2025 •

edited

Loading

wonjune23 commented Mar 3, 2025 •

edited

Loading

bugfix in transmittance based early stop criteria #541

bugfix in transmittance based early stop criteria #541

Conversation

wonjune23 commented Jan 17, 2025 • edited Loading

yzslab commented Feb 3, 2025 • edited Loading

wonjune23 commented Feb 10, 2025 • edited Loading

lopeLH commented Feb 19, 2025 • edited Loading

wonjune23 commented Mar 1, 2025 • edited Loading

lopeLH commented Mar 3, 2025 • edited Loading

wonjune23 commented Mar 3, 2025 • edited Loading

wonjune23 commented Jan 17, 2025 •

edited

Loading

yzslab commented Feb 3, 2025 •

edited

Loading

wonjune23 commented Feb 10, 2025 •

edited

Loading

lopeLH commented Feb 19, 2025 •

edited

Loading

wonjune23 commented Mar 1, 2025 •

edited

Loading

lopeLH commented Mar 3, 2025 •

edited

Loading

wonjune23 commented Mar 3, 2025 •

edited

Loading