Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

option to make training more deterministic #143

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

elliottzheng
Copy link
Contributor

@elliottzheng elliottzheng commented Feb 14, 2023

#114 #140

I have been trying to make it more deterministic, here I share some of my experiences.

  1. replace these lines with
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False # this should be False instead of True(current code)
torch.use_deterministic_algorithms(True) # will raise an error when nondeterministic functions are used.

check https://pytorch.org/docs/stable/notes/randomness.html#cuda-convolution-benchmarking for more details

  1. you may need to run the program with flag CUBLAS_WORKSPACE_CONFIG=:4096:8, if some error is raised after doing 1, check here for details

  2. replace the bilinear F.interpolate here with the implementation below, as it is nondeterministic, check here fore details.

class Interpolate(nn.Module):

    def __init__(self, channel: int, scale_factor: int):
        super().__init__()
        # assert 'mode' not in kwargs and 'align_corners' not in kwargs and 'size' not in kwargs
        assert isinstance(scale_factor, int) and scale_factor > 1 and scale_factor % 2 == 0
        self.scale_factor = scale_factor
        kernel_size = scale_factor + 1  # keep kernel size being odd
        self.weight = nn.Parameter(
            torch.empty((1, 1, kernel_size, kernel_size), dtype=torch.float32).expand(channel, -1, -1, -1)
        )
        self.conv = functools.partial(
            F.conv2d, weight=self.weight, bias=None, padding=scale_factor // 2, groups=channel
        )
        with torch.no_grad():
            self.weight.fill_(1 / (kernel_size * kernel_size))

    def forward(self, t):
        if t is None:
            return t
        return self.conv(F.interpolate(t, scale_factor=self.scale_factor, mode='nearest'))

    @staticmethod
    def naive(t: torch.Tensor, size: Tuple[int, int], **kwargs):
        if t is None or t.shape[2:] == size:
            return t
        else:
            assert 'mode' not in kwargs and 'align_corners' not in kwargs
            return F.interpolate(t, size, mode='nearest', **kwargs)

However, the training is still non-deterministic

As the raymarching_train here is non-deterministic, I am not familiar with CUDA extension, thus I don't know how to solve it, you might want to look at it, the rays outputted by the function is non-deterministic.

Here I provide a bash script test.sh to run deterministic experiments for debugging

gpu_id=$1
echo "Running on GPU $gpu_id"
rm -rf "results/squirrel_seed0_size64_deterministic_run$gpu_id"
rm deterministic_run_$gpu_id.txt

CUDA_LAUNCH_BLOCKING=1 CUDA_VISIBLE_DEVICES=$gpu_id python main.py \
--text "A DSLR photo of a squirrel" \
--cuda_ray \
--fp16 \
--dir_text \
--sd_version "2.0" \
--eval_interval 1 \
--seed 0 \
--deterministic \
--iters 20 \
--workspace "results/squirrel_seed0_size64_deterministic_run$gpu_id" > deterministic_run_$gpu_id.txt

run bash test.sh 0 and bash test.sh 1 to run on GPU 0 and GPU 1, and compare the output in deterministic_run_0.txt ,deterministic_run_1.txt ,

@ashawkey
Copy link
Owner

Thanks for your efforts! Making it deterministic must be quite hard...
For the ray marching extension, it involves race conditions for each ray to be written to the output, but the overall outcome should be the same. i.e., the point-wise results are different, but ray-wise results are the same:

# run 1
xyzs: [ray1's points] [ray2's points] [ray3's points] (point-wise, order of rays may vary)
colors: [ray1's color] [ray2's color] [ray3's color] (ray-wise, always ordered)
# run 2
xyzs: [ray3's points] [ray1's points] [ray2's points]
colors: [ray1's color] [ray2's color] [ray3's color]

So I guess there may be some other reasons. Also, you could first check if the non-cuda-ray mode can be deterministic.
Currently I'm not having enough resources to test, but I may help to check it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants