Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MULTI_GPU DATA_PARALLEL #1287

Closed
gunewar opened this issue Oct 13, 2022 · 13 comments · Fixed by #1509
Closed

MULTI_GPU DATA_PARALLEL #1287

gunewar opened this issue Oct 13, 2022 · 13 comments · Fixed by #1509
Labels
bug Something isn't working

Comments

@gunewar
Copy link

gunewar commented Oct 13, 2022

Describe the bug
I tried to use darts with multi GPU but keep getting "RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method" error with jpynb.
I also tried with py file but this time model fits but in the second prediction system exits without error or warnings.
"pred_series = model_nbeats.historical_forecasts(

series,

past_covariates=train_features,


num_samples=1,

start=0.7,

forecast_horizon=6,

stride=10,

retrain=False,

overlap_end=False,

last_points_only=True, 

verbose=True,"

)
To Reproduce
model_nbeats = NBEATSModel(
input_chunk_length=1440,
output_chunk_length=6,
generic_architecture=True,
num_stacks=50,
num_blocks=1,
num_layers=4,
layer_widths=512,
n_epochs=1,
nr_epochs_val_period=1,
batch_size=1024,
model_name="nbeats_run",
force_reset=True,
random_state = None,
pl_trainer_kwargs={
"accelerator": "gpu",
"devices": [0,1], # use all available GPUs
#"auto_select_gpus": True,
"strategy":"ddp_notebook_find_unused_parameters_false",
},
)

This is my model

model_nbeats.fit(train,verbose=True,past_covariates=train_features,
num_loader_workers=2)

And this is how i fit
Expected behavior
I guess there is some problem with "/miniconda3/envs/darts/lib/python3.9/site-packages/darts/utils/torch.py:112, in random_method..decorator(self, *args, **kwargs)
110 with fork_rng():
111 manual_seed(self._random_instance.randint(0, high=MAX_TORCH_SEED_VALUE))
--> 112 return decorated(self, *args, **kwargs)"

random decorator in darts torch.py
System (please complete the following information):

  • Python version: [3.9]
  • darts version [0.22.0]

Additional context
Can you publish documentation for using gpus , data_parallel and distributed_data_parallel

@gunewar gunewar added bug Something isn't working triage Issue waiting for triaging labels Oct 13, 2022
@dennisbader
Copy link
Collaborator

Hi @gunewar, unfortunately I don't have the hardware to test this using multiple GPUs.

  • Let's start from your .py file: when you say it exits after 2nd prediction without error or warning, do you mean it exits after the 2nd out of some_n historical forecasts?

  • Does the normal model.predict() work?

  • What exactly is the error message you get running it as a jupyter notebook?

@gunewar
Copy link
Author

gunewar commented Oct 13, 2022

Hi Dennis my py file is below:

############################################################################################

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from darts import TimeSeries

from darts.models import NBEATSModel

from darts.dataprocessing.transformers import Scaler, MissingValuesFiller

from darts.metrics import mape, r2_score

import matplotlib

import time as time

import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

def display_forecast(pred_series, ts_transformed, forecast_type, start_date=None):

plt.figure(figsize=(8, 5))

if start_date:

    ts_transformed = ts_transformed.drop_before(start_date)

ts_transformed.univariate_component(0).plot(label="actual")

pred_series.plot(label=("historic " + forecast_type + " forecasts"))

plt.title(

    "R2: {}".format(r2_score(ts_transformed.univariate_component(0), pred_series))

)

plt.legend()

df = pd.read_csv(r'/home/data.csv')

df.drop(df.index[range(680_000)], inplace=True)

df['date'] = pd.to_datetime(df['date'])

#df.set_index('date', inplace=True)

df = df.reset_index(drop=True)

df=df.dropna()

df.columns

df.shape

df_day_avg = df

filler = MissingValuesFiller()

scaler = Scaler()

series = scaler.fit_transform(

filler.transform(

    TimeSeries.from_dataframe(

        df_day_avg, "date", ["value"],fill_missing_dates=True, freq="min",

    )

)

).astype(np.float32)

series_feaures = scaler.fit_transform(

filler.transform(

    TimeSeries.from_dataframe(

        df_day_avg, "date",fill_missing_dates=True, freq="min",

    )

)

).astype(np.float32)

train, val = series.split_after(0.7)

#train_features, val_features = series_feaures.split_after(0.7)

train_features = series_feaures

import torch

print(torch.cuda.device_count())

print(torch.cuda.is_available())

print(torch.cuda.current_device())

model_nbeats = NBEATSModel(

input_chunk_length=144,

output_chunk_length=6,

generic_architecture=True,

num_stacks=10,

num_blocks=1,

num_layers=4,

layer_widths=512,

n_epochs=1,

nr_epochs_val_period=1,

batch_size=1024,

model_name="nbeats_run",

pl_trainer_kwargs={

  "accelerator": "gpu",

  "strategy": "ddp",

  "devices": -1,

  "auto_select_gpus": True

  



},

)

model_nbeats.fit(train,verbose=True,past_covariates=train_features)

####################################################################

Model fit works well, after than when predict starts i debugged code
####################################################################################
pred_series = model_nbeats.historical_forecasts(

series,

past_covariates=train_features,


num_samples=1,

start=0.7,

forecast_horizon=6,

stride=10,

retrain=False,

overlap_end=False,

last_points_only=True, 

verbose=True,

)
###########################################################################

in the second predict loop system ends the process without an error code

################################################################################

when i try to test in jupyter notebook after the model introduction with code below
##########################################################

model_nbeats = NBEATSModel(
input_chunk_length=1440,
output_chunk_length=6,
generic_architecture=True,
num_stacks=50,
num_blocks=1,
num_layers=4,
layer_widths=512,
n_epochs=1,
nr_epochs_val_period=1,
batch_size=1024,
model_name="nbeats_run",
force_reset=True,
random_state = None,
pl_trainer_kwargs={
"accelerator": "gpu",
"devices": [0,1], # use all available GPUs
#"auto_select_gpus": True,
#"strategy":"ddp_notebook_find_unused_parameters_false",
},
)##############################################################
Darts give a warning as
##################################################

/home/solymr/miniconda3/envs/darts/lib/python3.9/site-packages/torch/random.py:99: UserWarning: CUDA reports that you have 2 available devices, and you have used fork_rng without explicitly specifying which devices are being used. For safety, we initialize every CUDA device by default, which can be quite slow if you have a lot of GPUs. If you know that you are only making use of a few CUDA devices, set the environment variable CUDA_VISIBLE_DEVICES or the 'devices' keyword argument of fork_rng with the set of devices you are actually using. For example, if you are using CPU only, set CUDA_VISIBLE_DEVICES= or devices=[]; if you are using GPU 0 only, set CUDA_VISIBLE_DEVICES=0 or devices=[0]. To initialize all devices and suppress this warning, set the 'devices' keyword argument to range(torch.cuda.device_count()).
warnings.warn(

##############################################################################

When i ran model fit with
###################################################################
model_nbeats.fit(train,verbose=True,past_covariates=train_features,
num_loader_workers=2)
###################################################################

system error message
#########################################################
2022-10-13 01:44:41 pytorch_lightning.utilities.rank_zero INFO: GPU available: True (cuda), used: True
2022-10-13 01:44:41 pytorch_lightning.utilities.rank_zero INFO: TPU available: False, using: 0 TPU cores
2022-10-13 01:44:41 pytorch_lightning.utilities.rank_zero INFO: IPU available: False, using: 0 IPUs
2022-10-13 01:44:41 pytorch_lightning.utilities.rank_zero INFO: HPU available: False, using: 0 HPUs
2022-10-13 01:44:41 pytorch_lightning.utilities.distributed INFO: Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
2022-10-13 01:44:41 pytorch_lightning.utilities.distributed INFO: Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
2022-10-13 01:44:41 pytorch_lightning.utilities.rank_zero INFO: ----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes

Output exceeds the size limit. Open the full output data in a text editor

ProcessRaisedException Traceback (most recent call last)
/home/solymr/Desktop/Work_Space/DARTS.MODELS/Yuppii_Darts.ipynb Cell 26 in <cell line: 1>()
----> 1 model_nbeats.fit(train,verbose=True,past_covariates=train_features,
2 num_loader_workers=2)

File ~/miniconda3/envs/darts/lib/python3.9/site-packages/darts/utils/torch.py:112, in random_method..decorator(self, *args, **kwargs)
110 with fork_rng():
111 manual_seed(self._random_instance.randint(0, high=MAX_TORCH_SEED_VALUE))
--> 112 return decorated(self, *args, **kwargs)

File ~/miniconda3/envs/darts/lib/python3.9/site-packages/darts/models/forecasting/torch_forecasting_model.py:739, in TorchForecastingModel.fit(self, series, past_covariates, future_covariates, val_series, val_past_covariates, val_future_covariates, trainer, verbose, epochs, max_samples_per_ts, num_loader_workers)
731 logger.info(f"Train dataset contains {len(train_dataset)} samples.")
733 super().fit(
734 series=seq2series(series),
735 past_covariates=seq2series(past_covariates),
736 future_covariates=seq2series(future_covariates),
737 )
--> 739 return self.fit_from_dataset(
740 train_dataset, val_dataset, trainer, verbose, epochs, num_loader_workers
741 )

File ~/miniconda3/envs/darts/lib/python3.9/site-packages/darts/utils/torch.py:112, in random_method..decorator(self, *args, **kwargs)
110 with fork_rng():
111 manual_seed(self._random_instance.randint(0, high=MAX_TORCH_SEED_VALUE))
...
torch._C._cuda_setDevice(device)
File "/home/esc/miniconda3/envs/darts/lib/python3.9/site-packages/torch/cuda/init.py", line 207, in _lazy_init
raise RuntimeError(
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
############################################################################################

@solalatus
Copy link
Contributor

solalatus commented Dec 5, 2022

I am also having problems with multi GPU training.

I have tried with Jupyter and without it (py file), all DDP variants give this:

RuntimeError                              Traceback (most recent call last)
<ipython-input-19-92384a3569c2> in <module>
      1 print("starting training...")
----> 2 stuff = model_nhits.fit(train_datasets, val_series=valid_datasets, verbose=True)

~/.local/lib/python3.8/site-packages/darts/utils/torch.py in decorator(self, *args, **kwargs)
    110         with fork_rng():
    111             manual_seed(self._random_instance.randint(0, high=MAX_TORCH_SEED_VALUE))
--> 112             return decorated(self, *args, **kwargs)
    113 
    114     return decorator

~/.local/lib/python3.8/site-packages/darts/models/forecasting/torch_forecasting_model.py in fit(self, series, past_covariates, future_covariates, val_series, val_past_covariates, val_future_covariates, trainer, verbose, epochs, max_samples_per_ts, num_loader_workers)
    737         )
    738 
--> 739         return self.fit_from_dataset(
    740             train_dataset, val_dataset, trainer, verbose, epochs, num_loader_workers
    741         )

~/.local/lib/python3.8/site-packages/darts/utils/torch.py in decorator(self, *args, **kwargs)
    110         with fork_rng():
    111             manual_seed(self._random_instance.randint(0, high=MAX_TORCH_SEED_VALUE))
--> 112             return decorated(self, *args, **kwargs)
    113 
    114     return decorator

~/.local/lib/python3.8/site-packages/darts/models/forecasting/torch_forecasting_model.py in fit_from_dataset(self, train_dataset, val_dataset, trainer, verbose, epochs, num_loader_workers)
    892 
    893         # Train model
--> 894         self._train(train_loader, val_loader)
    895         return self
    896 

~/.local/lib/python3.8/site-packages/darts/models/forecasting/torch_forecasting_model.py in _train(self, train_loader, val_loader)
    914         self.load_ckpt_path = None
    915 
--> 916         self.trainer.fit(
    917             self.model,
    918             train_dataloaders=train_loader,

~/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py in fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    580             raise TypeError(f"Trainer.fit() requires a LightningModule, got: {model.__class__.__qualname__}")
    581         self.strategy._lightning_module = model
--> 582         call._call_and_handle_interrupt(
    583             self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
    584         )

~/.local/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
     34     try:
     35         if trainer.strategy.launcher is not None:
---> 36             return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
     37         else:
     38             return trainer_fn(*args, **kwargs)

~/.local/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py in launch(self, function, trainer, *args, **kwargs)
     94         self._check_torchdistx_support()
     95         if self._start_method in ("fork", "forkserver"):
---> 96             _check_bad_cuda_fork()
     97 
     98         # The default cluster environment in Lightning chooses a random free port number

~/.local/lib/python3.8/site-packages/lightning_lite/strategies/launchers/multiprocessing.py in _check_bad_cuda_fork()
    192     if _IS_INTERACTIVE:
    193         message += " You will have to restart the Python kernel."
--> 194     raise RuntimeError(message)

RuntimeError: Lightning can't create new processes if CUDA is already initialized. Did you manually call torch.cuda.* functions, have moved the model to the device, or allocated memory on the GPU any other way? Please remove any such calls, or change the selected strategy. You will have to restart the Python kernel.```

@solalatus
Copy link
Contributor

I also tried to use DP in py file, then the message ends as (apologies, trace was cut by ssh client):

modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/anaconda3/envs/tensorml/lib/python3.9/site-packages/darts/models/forecasting/nhits.py", line 179, in forward
    x = self.layers(x)
  File "/home/user/anaconda3/envs/tensorml/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/anaconda3/envs/tensorml/lib/python3.9/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/home/user/anaconda3/envs/tensorml/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/anaconda3/envs/tensorml/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument mat1 in method wrapper_addmm)```


@solalatus
Copy link
Contributor

I also tried installing Deepspeed and using it's strategies, also error.

Fresh pip install darts in a Conda environment.
Python 3.8.10 and Darts 0.22.0

Any advice would be highly appreciated!

(By the way: same code runs like charm in the same environment, on single GPU)

@hrzn
Copy link
Contributor

hrzn commented Dec 6, 2022

Unfortunately we haven't yet had the chance to properly test Darts on a multi-GPU setup. There's a discussion going on here about this too: https://gitter.im/u8darts/darts?at=63847beabcdb0060b8408787

@gunewar
Copy link
Author

gunewar commented Dec 9, 2022

Hi Julien , i am from Bilkent University and i can open the server for you with multi gpu ( 2 * 1080ti ) for your work . Please contact me from skype live:f337e4b1889436b3 or sakaryaemre@gmail.com if you are interested.

@solalatus
Copy link
Contributor

Any news on this? Could anyone train eg. an N-HiTS model on multi GPU?

@hrzn hrzn removed the triage Issue waiting for triaging label Jan 5, 2023
@hrzn
Copy link
Contributor

hrzn commented Jan 5, 2023

I added this to our backlog so one of us can take a look when we have some time (thanks for the kind proposal @gunewar !). In the meantime any PR/fix proposal is welcome.

@hrzn
Copy link
Contributor

hrzn commented Jan 6, 2023

See also: #1385

@solalatus
Copy link
Contributor

solalatus commented Jan 17, 2023

I managed to get the multi GPU setup working for me.

Steps needed:

  • I did a fork, had to change the logging in pl_forecasting_module.py by a tiny bit (see here: master...solalatus:darts:master )
  • running from Jupyter is not yet working, so do a .py file and run as script
  • you have to add the following pattern to your script:
import torch

if __name__ == '__main__':

    torch.multiprocessing.freeze_support()

(Mind you, main pattern is necessary! see this even though it is NOT a windows environment.)

Everything else is left on default, so no special trainer args, so only:

pl_trainer_kwargs = {"accelerator": "gpu", "devices": -1, "auto_select_gpus": True}

This results in using the ddp_spawn method, which is the default. I did not test other methods yet.

If you think it helps, happy to do PR for this "wonderful" twoliner in my fork @hrzn

@solalatus
Copy link
Contributor

solalatus commented Jan 19, 2023

Update: This proved to be more stable.
74477d9

@hrzn
Copy link
Contributor

hrzn commented Jan 25, 2023

Nice, thanks for sharing your solution and opening a PR @solalatus !
@gunewar are you by any chance able to confirm whether this fix also fixes your issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants