Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processes are terminated in multi-GPU setting when using multiple models and seeds #2519

Closed
KunzstBGR opened this issue Sep 2, 2024 · 3 comments
Labels
bug Something isn't working gpu Question or bug occuring with gpu

Comments

@KunzstBGR
Copy link

KunzstBGR commented Sep 2, 2024

Hi,
When comparing multiple models and multiple seeds using a nested loop, all processes are terminated when the loop switches from one model class to the next. Does anyone have an idea why? Maybe I'm doing this wrong. Or is this a pytorch-lightning issue?

Error message:
Child process with PID 652 terminated with code 1. Forcefully terminating all other processes to avoid zombies

Relevant code snippet:

# ...
from pytorch_lightning import seed_everything
from pytorch_lightning.strategies import DDPStrategy

def create_params(input_chunk_length,
                  output_chunk_length, 
                  quantiles,
                  batch_size,
                  n_epochs,
                  dropout):
        
        # ...
        
        pl_trainer_kwargs = {
                         'strategy':DDPStrategy(process_group_backend='gloo', accelerator='gpu'),
                         'devices':4
                         #...
                         }
       # ...


def dl_model_training(df, 
                      seeds, 
                      input_chunk_length,
                      output_chunk_length, 
                      quantiles,
                      batch_size,
                      n_epochs,
                      dropout):

  # Some data processing ...
   
  for model_arch, model_class in [('NHiTS', NHiTSModel), ('TiDE', TiDEModel), ('TFT', TFTModel)]:       
           for i in seeds: 
              # Set the seed
              seed_everything(i, workers=True)
  
              # Define the model name with seed
              model_arch_seed = f'{model_arch}_gws_{i}'
              
              # Train the model
              model = model_class(
                      **create_params(
                          input_chunk_length,
                          output_chunk_length, 
                          quantiles,
                          batch_size,
                          n_epochs,
                          dropout
                      ), 
                      model_name=model_arch_seed,
                      work_dir=os.path.join(MODEL_PATH, model_arch)
                  )
              
              # Fit the model
              model.fit(
                        series=train_gws, 
                        past_covariates=train_cov,
                        future_covariates=train_cov if model_arch in ['TFT', 'TiDE'] else None,
                        val_series=val_gws, 
                        val_past_covariates=val_cov,
                        val_future_covariates=val_cov if model_arch in ['TFT', 'TiDE'] else None,
                        verbose=True
                      ) 
              
              # Clean up to prevent memory issues
              del model
              gc.collect()
              torch.cuda.empty_cache() 

if __name__ == '__main__':
     torch.multiprocessing.freeze_support()
     dl_model_training(df=gws_bb_subset, 
                       seeds=seeds,
                       input_chunk_length=52,
                       output_chunk_length=16, 
                       quantiles=None, 
                       batch_size=4096,
                       n_epochs=10,
                       dropout=0.2)
  
@madtoinou madtoinou added bug Something isn't working gpu Question or bug occuring with gpu labels Sep 2, 2024
@madtoinou
Copy link
Collaborator

Hi @KunzstBGR,

This issue seems to be come from PytorchLightning and not Darts.

It might also arise from the fact that you use multi-gpu. Can you check if it persists when you use devices=[0]?

Have you tried to change num_nodes parameters of DDP? (based on pytorch doc)

Also, is it normal that you don't save checkpoints or generate any kind of forecasts in your code snippet?

@KunzstBGR
Copy link
Author

Hi @madtoinou ,
thanks for your quick response!

  • Multi-GPU: It works with one gpu. After some testing I realized that it has to do with the nested loop. If I switch the order of the for loops and move the seed_everything statement up, the processes do not terminate during the switch from one model class to the next in the multi-GPU setting. Honestly, I can't quite wrap my head around why this works, but I'm glad it does:
for i in seeds:
         # Set the seed
         seed_everything(i, workers=True)
         
         for model_arch, model_class in [('TiDE', TiDEModel), ('NHiTS', NHiTSModel)]: 
  • Nodes: If I set num_nodes higher than 1, the whole process get's stuck (I guess my gpus are all on one node? I'm not too familiar with these things)

  • Checkpoints: I enabled checkpointing, here's the full code for the model parameters:

def create_params(input_chunk_length,
                  output_chunk_length, 
                  quantiles,
                  batch_size,
                  n_epochs,
                  dropout):
        
        # Add metrics for evaluation
        torch_metrics = MetricCollection(
             [MeanSquaredError(), MeanAbsoluteError(), MeanAbsolutePercentageError()]
             )

        # Early stopping
        early_stopper = EarlyStopping(
             monitor='val_loss',
             patience=10,
             min_delta=0.001,
             mode='min'
         )
            
        lr_scheduler_cls = torch.optim.lr_scheduler.ExponentialLR
        lr_scheduler_kwargs = {'gamma': 0.999}
        lr_logger = LearningRateMonitor(logging_interval='step')  # log the learning rate ('step' or 'epoch')
        
        pl_trainer_kwargs = {
                         'strategy':DDPStrategy(process_group_backend='gloo', accelerator='gpu'),
                         'devices':4, 
                         'val_check_interval':0.5,
                         'log_every_n_steps':10,
                         'enable_model_summary':True,
                         'enable_checkpointing':True,
                         'callbacks':[early_stopper, lr_logger], 
                         'gradient_clip_val':1, 
                         'num_nodes':1}
        
        return {
            'input_chunk_length':input_chunk_length,  # lookback window
            'output_chunk_length':output_chunk_length,  # forecast/lookahead window
            'use_reversible_instance_norm':True,
            'pl_trainer_kwargs':pl_trainer_kwargs,
            'likelihood':None, 
            'loss_fn':torch.nn.MSELoss(),
            'save_checkpoints': True,  # checkpoint to retrieve the best performing model state,
            'force_reset':True, # previously existing models with the same name will be reset (& checkpoints will be discarded)
            'batch_size':batch_size,
            'n_epochs':n_epochs,
            'dropout':dropout,
            'log_tensorboard':True,
            'torch_metrics':torch_metrics,
            'lr_scheduler_cls':lr_scheduler_cls,
            'lr_scheduler_kwargs':lr_scheduler_kwargs
        }
  • Forecasts: I thought it is not recommended to do training and evaluation in one script when using multiple GPUs. Uneven inputs are not supported and the distributed sampler will influence the metrics:
    Support fit with DDP then test without DDP Lightning-AI/pytorch-lightning#8375.
    (one could use torch.distributed.destroy_process_group() to switch to one gpu though). Thus, I have a separate script for creating model forecasts, which is suboptimal because the data preprocessing has to be repeated (e.g. scaling). How would you do this?

@madtoinou
Copy link
Collaborator

Nice, I would not be able to tell why swapping the order of the loops fixed it but as long as it works, it's great!

All good if you save the checkpoints and perform evaluation in a separate loop, I was just curious since it was not visible in the code snippet. It's indeed better to do it separately.

If the issue is solved, can you please close it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gpu Question or bug occuring with gpu
Projects
None yet
Development

No branches or pull requests

2 participants