Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix in transmittance based early stop criteria #541

Closed

Conversation

wonjune23
Copy link

@wonjune23 wonjune23 commented Jan 17, 2025

I think there's a bug in terms of when to stop rendering if the transmittance gets near zero.
Intuitively, the stopping criteria checks if the transmittance for rendering "next Gaussian" is very low or not.
Therefore, it should render the "current Gaussian" and then break out of the loop.
However, the current implementation instead checks the criteria and stop immediately, without rendering the current one.

To be more specific, say there are only two Gaussians to be rendered for a pixel, where the first one to be rendered has alpha of 0.5, and the other one has alpha of 1.0. Obviously, because the second Gaussian has alpha of 1.0, the rendered alpha of the pixel should be 1.0.
However, the current implementation only renders the first one because while trying to render the second one, it detects that T will get near zero and doesn't render the second one, leaving the pixel's rendered alpha 0.5.

This can be verified by manually rendering two very bright Gaussians (with the second one being very large):

self.means = torch.tensor([[0, 0, 1],
                            [0, 0, 1.5]], device=self.device)
self.scales = torch.tensor([[1.5e1, 1.5e1, 1.5e1],
                            [1e9, 1e9, 1e9]], device=self.device)
d = 3
self.rgbs = torch.ones(self.num_points, d, device=self.device) * 1e9

u = torch.ones(self.num_points, 1, device=self.device)
v = torch.ones(self.num_points, 1, device=self.device)
w = torch.ones(self.num_points, 1, device=self.device)

self.quats = torch.cat(
    [
        torch.sqrt(1.0 - u) * torch.sin(2.0 * math.pi * v),
        torch.sqrt(1.0 - u) * torch.cos(2.0 * math.pi * v),
        torch.sqrt(u) * torch.sin(2.0 * math.pi * w),
        torch.sqrt(u) * torch.cos(2.0 * math.pi * w),
    ],
    -1,
)

self.opacities = torch.tensor([1e9, 1e9], device=self.device)

It renders this weird gray ball, while after applying this bugfix, it correctly renders white image.

current bugfix
current bugfix

If you cut and look at the middle section of the above image, it looks like this:

current bugfix
current_midsection bugfix_midsection

Here I made randomly initialized Gaussians to learn an all-white image. This behavior compromises stability of training, and it makes the 3DGS never be able to learn a truly saturated image. You can see this from the loss not being able to get reduced below a certain point. The output image has slightly different color because the current implementation struggles to render a truly white image.

current bugfix
current_training training
current_loss bugfix_loss

If you discard the color values below 250 and map [250, 255] to be [0, 255] to see the effect better, you can clearly see the current implementation struggles because of the weird gray blobs.

current bugfix
current_training_enhanced training_enhanced
  • (please ignore the RGB being crazy at the beginning. I didn't really discard the values below 250 so they became minus and got repeatedly underflown.)

Though I think this bug has minimal effect in practice, I still think it needs a fix ;)

@yzslab
Copy link
Contributor

yzslab commented Feb 3, 2025

There is a bug in your modification. The T = next_T should be placed before the break (always update the T, whether it breaks or not), otherwise, the render_alphas and render_colors will use the wrong transmittance.

@wonjune23
Copy link
Author

wonjune23 commented Feb 10, 2025

@yzslab Sorry for late reply, and thank you for letting me know. I think you're right so I changed the PR accordingly.
(I simply added another T = next_T in the if statement)

@lopeLH
Copy link

lopeLH commented Feb 19, 2025

Note that the behaviour you are trying to fix is not a bug, but an intentional feature introduced by the original 3DGS authors to avoid numerical stability issues. With your changes, the final transmittance can be arbitrarily close to zero, causing numerical stability issues in the backward pass.

From annex C of the original 3DGS paper:

image

@wonjune23
Copy link
Author

wonjune23 commented Mar 1, 2025

@lopeLH Hi, thank you for letting me know, I was not aware if it was intended!

However, I've thought about all this and it looks like the only concern in terms of numerical stability with T is that we need to restore it starting from T_final by repeatedly dividing it by 1-alpha:

S ra = 1.0f / (1.0f - alpha);
T *= ra;

I argue that it's not really a problem in current implementation because of these reasons:
(Short summary: the bugfix renders only one more Gaussian at most, and the Gaussian's alpha and thus T are still well bounded, so it doesn't affect the numerical stability of restoring T)

  1. maximum of alpha is truncated to be 0.999 :
S alpha = min(0.999f, opac * __expf(-sigma));
  1. minimum of alpha is 1/255 (otherwise, the renderer just ignores it):
if (sigma < 0.f || alpha < 1.f / 255.f) {
    continue;
}
  1. The minimum possible value of T_final will be 1e-7 after my bugfix (before the bugfix, it was 1e-4):

    • This will happen when there are several different Gaussians rendered so that T became near 1e-4, and after that when there is the last Gaussian that would not have been rendered before the bugfix, with its alpha = 0.999. This is the case where T_final becomes near 1e-7 and it's the worst case possible. If the last Gaussian's alpha was smaller, T_final would be closer to 1e-4, making it numerically more stable.
  2. Numerical error of restoring T is also well bounded and is negligible.

    • Think of the backward pass: there will be the largest numerical drift when we start from the smallest possible T_final, and then iteratively divide the T with near 1.0 values (i.e., small alpha), so that the backward pass will perform the division as many times as possible. This maximum number of division operation is also bounded, because alpha is bounded by 1/255.
    • If we start from the worst case scenario, T_final = 1e-7, it implies that the last Gaussian's alpha was 0.999. So the first thing the backward pass will do is to divide T_final with 1 - 0.999 and (hopefully) get T=1e-4 if numerical error was negligible.
    • From here on, it's the same situation before the bugfix was applied, but I still tried the worst case scenario: all the remaining Gaussians have alpha=1/255. Even in this extreme case, there can be <2400 Gaussians because if we divide the T=1e-4 with (1-1/255) for 2400 times, it gets larger than 1.0.
    • Specifically, the worst case experiment looks like this:
S temp_T = (1e-4) * (1.0 - 0.999);
temp_T /= (1.0 - 0.999);
for (int l = 0; l < 2400; l++) {
    temp_T /= (1.0 - 1.f / 255.f);
}
printf("%.10f\n", temp_T);

In the current implementation S is single precision float. The experiment prints 1.2459857464.
If we perform this in Python (that uses double precision float), it prints 1.2459859743472113.

Therefore, my conclusion is that even in the worst case scenario, the numerical stability regarding T is negligible, as the scale of numerical error is at most 1e-7. (Which, I assume this amout of numerical error would have also been present before the bugfix)

I feel like this PR is getting an overkill for such a negligible bug, but again, I still think it's a bug. If something is wrong with my logic, please let me know :)

@lopeLH
Copy link

lopeLH commented Mar 3, 2025

I am still not fully convinced about the proposed changes, for a number of reasons:

  1. From a software engineering perspective, I find it dangerous that the change's numerical stability relies on the values of other hyper-parameters which might be changed in the future, inadvertently triggering numerical stability issues. In this regard, I think a safer solution could involve providing the backward pass with the penultimate transmittance, instead of the last one, to bypass the issue altogether.

  2. From an algorithmic point of view, I think the change is less trivial and safe than it seems, and might affect the training dynamics in unforeseen ways. Adopting it safely would require careful validation and QA, as it is very easy to miss small but critical details in such a complex codebase. For example, I think your numerical stability analysis overlooked the fact that transmittance is converted to opacity (1.0f. - T) and back to transmittance when passed from forward to backward kernels. I think you will find the numerical error is not negligible if you consider that in your analysis.

  3. Don't know what the purpose of this library is. If the goal is to reproduce the original implementation results as faithfully as possible, even if the other concerns are addressed one could argue this change deviates from the original author's proposal. This is more of an "improvement" proposal than a trivial bugfix.

That said, I am not a maintainer of this library, so these are just my 2cents.

@wonjune23
Copy link
Author

wonjune23 commented Mar 3, 2025

Hi, I agree on all three points, especially the second point: as you mentioned, I overlooked the 1.0f -T part, and it indeed makes the error not negligible.

I also felt like this is getting deviated too much from the original implementation. As I mentioned earlier, I was not aware if it was intended in the original paper. But after knowing that, I still thought this feature (although intended) put too much constriant on the rendering process than what's required. However, as you pointed out, it was not too much.

I still am not happy with the fact that in some cases, despite being very rare, the rendered image can actually be very different than what's expected. However, it turns out the fix for this is not that easy without further changing the current method. But thank you for sharing all these. Knowing this might actually be very helpful on what I was working on, and hopefully to others as well.

@wonjune23 wonjune23 closed this Mar 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants