-
Notifications
You must be signed in to change notification settings - Fork 881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Enabling multiple gpu causes AssertionError #1385
Comments
This error is gone when I input at least 8 time series data which is equal as gpu count.
But, Predicting DataLoader takes too much time and predicting causes None error. Do I have to input time series data more than gpu counts when multiple gpus is enabled? |
I'm having the same issue with a TFT model. As a minimal reproduction: from darts.datasets import AirPassengersDataset
from darts.models import TFTModel
def main():
series = AirPassengersDataset().load()
model = TFTModel(
input_chunk_length=12,
output_chunk_length=1,
n_epochs=1,
pl_trainer_kwargs={
"accelerator": "gpu",
"devices": 4,
},
add_relative_index=True
)
model.fit(series=series)
model.predict(12) # AssertionError: assert self.num_samples >= 1 or self.total_size == 0
if __name__ == "__main__":
main() I'm using an AWS
|
While digging I found that this is actually tied to the Here's a super hack-y workaround by overwriting the model's trainer after fitting: from darts.datasets import AirPassengersDataset
from darts.models import TFTModel
import pytorch_lightning as pl
def main():
series = AirPassengersDataset().load()
model = TFTModel(
input_chunk_length=12,
output_chunk_length=1,
n_epochs=1,
pl_trainer_kwargs={
"accelerator": "gpu",
"devices": 4,
},
add_relative_index=True
)
model.fit(series=series)
model.trainer = pl.Trainer(**{**model.trainer_params, "devices": 1})
model.predict(12)
if __name__ == "__main__":
main() This will work for experimentation, but it's definitely not ideal |
Is this still an issue with 0.23.0? |
I just confirmed that it's still happening in 0.23:
With this script: import darts
from darts.datasets import AirPassengersDataset
from darts.models import TFTModel
def main():
series = AirPassengersDataset().load()
model = TFTModel(
input_chunk_length=12,
output_chunk_length=1,
n_epochs=1,
pl_trainer_kwargs={
"devices": 4,
},
add_relative_index=True
)
model.fit(series=series)
print(f"{darts.__version__=}")
model.predict(12) # AssertionError: assert self.num_samples >= 1 or self.total_size == 0
if __name__ == "__main__":
main() |
There is a temporary workaround here: #1287 (comment) |
Describe the bug
Enabling multiple gpus works fine on fitting model. But, it causes an error on doing predict,historical_forecasts, backtest.
Single gpu works fine on both stage.
pytorch_lightning/overrides/distributed.py", line 91, in init
self.num_samples = len(range(self.rank, len(self.dataset), self.num_replicas))
self.total_size = len(self.dataset)
assert self.num_samples >= 1 or self.total_size == 0
To Reproduce
I have 8 gpus. So, I enabled gpu option as follows
pl_trainer_kwargs={
"accelerator": "gpu",
"devices": [0, 1, 2, 3, 4, 5, 6, 7]
}
Expected behavior
It should be run without any problem on single or multi gpus.
System (please complete the following information):
Python 3.9.13 (main, Aug 25 2022, 23:26:10)
[GCC 11.2.0] :: Anaconda, Inc. on linux
open-clip-torch 2.7.0
pytorch-lightning 1.8.3.post0
torch 1.13.0+cu117
torchaudio 0.13.0+cu117
torchmetrics 0.9.3
torchvision 0.14.0+cu117
darts 0.22.0
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: