-
Notifications
You must be signed in to change notification settings - Fork 881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MULTI_GPU DATA_PARALLEL #1287
Comments
Hi @gunewar, unfortunately I don't have the hardware to test this using multiple GPUs.
|
Hi Dennis my py file is below: ############################################################################################ import numpy as np import pandas as pd import matplotlib.pyplot as plt from darts import TimeSeries from darts.models import NBEATSModel from darts.dataprocessing.transformers import Scaler, MissingValuesFiller from darts.metrics import mape, r2_score import matplotlib import time as time import os os.environ["CUDA_VISIBLE_DEVICES"] = "0,1" def display_forecast(pred_series, ts_transformed, forecast_type, start_date=None):
df = pd.read_csv(r'/home/data.csv') df.drop(df.index[range(680_000)], inplace=True) df['date'] = pd.to_datetime(df['date']) #df.set_index('date', inplace=True) df = df.reset_index(drop=True) df=df.dropna() df.columns df.shape df_day_avg = df filler = MissingValuesFiller() scaler = Scaler() series = scaler.fit_transform(
).astype(np.float32) series_feaures = scaler.fit_transform(
).astype(np.float32) train, val = series.split_after(0.7) #train_features, val_features = series_feaures.split_after(0.7) train_features = series_feaures import torch print(torch.cuda.device_count()) print(torch.cuda.is_available()) print(torch.cuda.current_device()) model_nbeats = NBEATSModel(
) model_nbeats.fit(train,verbose=True,past_covariates=train_features) #################################################################### Model fit works well, after than when predict starts i debugged code
) in the second predict loop system ends the process without an error code ################################################################################ when i try to test in jupyter notebook after the model introduction with code below model_nbeats = NBEATSModel( /home/solymr/miniconda3/envs/darts/lib/python3.9/site-packages/torch/random.py:99: UserWarning: CUDA reports that you have 2 available devices, and you have used fork_rng without explicitly specifying which devices are being used. For safety, we initialize every CUDA device by default, which can be quite slow if you have a lot of GPUs. If you know that you are only making use of a few CUDA devices, set the environment variable CUDA_VISIBLE_DEVICES or the 'devices' keyword argument of fork_rng with the set of devices you are actually using. For example, if you are using CPU only, set CUDA_VISIBLE_DEVICES= or devices=[]; if you are using GPU 0 only, set CUDA_VISIBLE_DEVICES=0 or devices=[0]. To initialize all devices and suppress this warning, set the 'devices' keyword argument to ############################################################################## When i ran model fit with system error message
|
I am also having problems with multi GPU training. I have tried with Jupyter and without it (py file), all DDP variants give this:
|
I also tried to use DP in py file, then the message ends as (apologies, trace was cut by ssh client):
|
I also tried installing Deepspeed and using it's strategies, also error. Fresh pip install darts in a Conda environment. Any advice would be highly appreciated! (By the way: same code runs like charm in the same environment, on single GPU) |
Unfortunately we haven't yet had the chance to properly test Darts on a multi-GPU setup. There's a discussion going on here about this too: https://gitter.im/u8darts/darts?at=63847beabcdb0060b8408787 |
Hi Julien , i am from Bilkent University and i can open the server for you with multi gpu ( 2 * 1080ti ) for your work . Please contact me from skype live:f337e4b1889436b3 or sakaryaemre@gmail.com if you are interested. |
Any news on this? Could anyone train eg. an N-HiTS model on multi GPU? |
I added this to our backlog so one of us can take a look when we have some time (thanks for the kind proposal @gunewar !). In the meantime any PR/fix proposal is welcome. |
See also: #1385 |
I managed to get the multi GPU setup working for me. Steps needed:
(Mind you, main pattern is necessary! see this even though it is NOT a windows environment.) Everything else is left on default, so no special trainer args, so only:
This results in using the If you think it helps, happy to do PR for this "wonderful" twoliner in my fork @hrzn |
Update: This proved to be more stable. |
Nice, thanks for sharing your solution and opening a PR @solalatus ! |
Describe the bug
I tried to use darts with multi GPU but keep getting "RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method" error with jpynb.
I also tried with py file but this time model fits but in the second prediction system exits without error or warnings.
"pred_series = model_nbeats.historical_forecasts(
)
To Reproduce
model_nbeats = NBEATSModel(
input_chunk_length=1440,
output_chunk_length=6,
generic_architecture=True,
num_stacks=50,
num_blocks=1,
num_layers=4,
layer_widths=512,
n_epochs=1,
nr_epochs_val_period=1,
batch_size=1024,
model_name="nbeats_run",
force_reset=True,
random_state = None,
pl_trainer_kwargs={
"accelerator": "gpu",
"devices": [0,1], # use all available GPUs
#"auto_select_gpus": True,
"strategy":"ddp_notebook_find_unused_parameters_false",
},
)
This is my model
model_nbeats.fit(train,verbose=True,past_covariates=train_features,
num_loader_workers=2)
And this is how i fit
Expected behavior
I guess there is some problem with "/miniconda3/envs/darts/lib/python3.9/site-packages/darts/utils/torch.py:112, in random_method..decorator(self, *args, **kwargs)
110 with fork_rng():
111 manual_seed(self._random_instance.randint(0, high=MAX_TORCH_SEED_VALUE))
--> 112 return decorated(self, *args, **kwargs)"
random decorator in darts torch.py
System (please complete the following information):
Additional context
Can you publish documentation for using gpus , data_parallel and distributed_data_parallel
The text was updated successfully, but these errors were encountered: