-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Observed_masks not behaving as expected #28914
Comments
Sorry, wrong mueller. Can't really help you😄 But thanks for suggesting me! |
Oops, sorry. I hope I fixed it now. |
cc @niels as well 🤗 |
checking thanks! |
@dparr005 can you try running the training on a single GPU to see the issue? and since your data has somewhat sane magnitudes, perhaps also set your |
I just used the basis from: https://huggingface.co/blog/time-series-transformers with my own data. I am not sure what portion of the code you are asking for. In addition, I was able to run the code on a single GPU already (using a local Jupyter Notebook). But when I run it on the HPC cluster using multi-GPU, it does not work. My hypothesis is that somehow it is seeding the samples differently perhaps and that is why it runs on Jupyter Notebook and not using the multi-GPU configuration. |
Actually, the error more looks like the following: future_observed_mask=batch["future_observed_mask"].to(device)
packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
accelerate/utils/operations.py", line 553, in forward
return model_forward(*args, **kwargs)
packages/accelerate/utils/operations.py", line 541, in __call__
return convert_to_fp32(self.model_forward(*args, **kwargs))
torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
packages/torch/nn/parallel/distributed.py", line 1026, in forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's |
i see @dparr005 so the issue is multi-gpu training... perhaps i need to gather the losses etc., I have a multi-gpu setup now so can finally test |
Yes. That was my other hypothesis, that somehow the code is expecting a gather statement (from all GPU's once a single training epoch is done) before going to the next epoch. What do you need from me to test this hypothesis? |
I just ran the given code (from gitlab) in a multiple GPU environment but it gives the same type of errors. The distribution environment is:
I converted the Jupyter Notebook into a .py script and am calling it from SLURM. Via: Can anyone help me? It seems to me that it is an issue with implementing accelerate. |
What's your exact code look like here? |
|
This is the code from data_utils.py. The above code was fromGithub.py
|
As a reminder, the code is taken from the following github code. It works as a Jupyter Notebook but not as a python script launched via SLURM. |
Gentle ping @muellerzr |
Another ping @muellerzr |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
System Info
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 4
use_cpu: false
Who can help?
@pacman100 @muellerzr
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I am doing TimeSeriesTransformerForPrediction but I am getting the following error when trying to train the model.
Expected behavior
For context, this issue is happening for TimeSeriesTransformerForPrediction.
From what I can tell, it is only happening when there are 0's in the beginning of the past_values segments. I believe that the error is getting thrown because the past_observed_mask is putting a corresponding 0 for all the 0's before any non-zero value (see pictures below). I would like the algorithm to learn/train on the 0's, since they are indeed 0's and not NaN or missing values (as the 0 in the past_observed_mask description would infer).
When I take the advice of the error message and set the find_unused_parameters=True, I get the following error:
ValueError: Expected parameter df (Tensor of shape (256, 45)) of distribution Chi2() to satisfy the constraint GreaterThan(lower_bound=0.0), but found invalid values.
Can someone please advice how to fix this issue?
The text was updated successfully, but these errors were encountered: