fine tuning process stops during training without error message #508

junsukha · 2024-10-27T12:05:05Z

This the script I used for fine tuning.

export HF_DATASETS_OFFLINE=1 
export TRANSFORMERS_OFFLINE=1
export PDSH_RCMD_TYPE=ssh
# NCCL setting
export GLOO_SOCKET_IFNAME=bond0
export NCCL_SOCKET_IFNAME=bond0
export NCCL_IB_HCA=mlx5_10:1,mlx5_11:1,mlx5_12:1,mlx5_13:1
export NCCL_IB_GID_INDEX=3
export NCCL_IB_TC=162
export NCCL_IB_TIMEOUT=25
export NCCL_PXN_DISABLE=0
export NCCL_IB_QPS_PER_CONNECTION=4
export NCCL_ALGO=Ring
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export NCCL_IB_RETRY_CNT=32
# export NCCL_ALGO=Tree
export OUTPUT_DIR="/mnt/singularity_home/jsha/repos/Open-Sora-Plan/output"

CUDA_VISIBLE_DEVICES=0,1,2 accelerate launch \
    --config_file scripts/accelerate_configs/deepspeed_zero2_config.yaml \
    opensora/train/train_t2v_diffusers.py \
    --model OpenSoraT2V_v1_3-2B/122 \
    --text_encoder_name_1 "/mnt/singularity_home/jsha/repos/Open-Sora-Plan/weights/google/mt5-xxl" \
    --cache_dir "../../cache_dir/" \
    --dataset t2v \
    --data "/mnt/singularity_home/jsha/repos/Open-Sora-Plan/open_sora_plan_dummy_data/training/data.txt" \
    --ae WFVAEModel_D8_4x8x8 \
    --ae_path "/gpfs/vision/drag_video/HF_downloads/Open-Sora-Plan-v1.3.0/vae" \
    --sample_rate 1 \
    --num_frames 65 \
    --max_height 352 \
    --max_width 640 \
    --interpolation_scale_t 1.0 \
    --interpolation_scale_h 1.0 \
    --interpolation_scale_w 1.0 \
    --gradient_checkpointing \
    --train_batch_size=1 \
    --dataloader_num_workers 16 \
    --gradient_accumulation_steps=1 \
    --max_train_steps=100 \
    --learning_rate=1e-5 \
    --lr_scheduler="constant" \
    --lr_warmup_steps=0 \
    --mixed_precision="bf16" \
    --report_to="tensorboard" \
    --checkpointing_steps=500 \
    --allow_tf32 \
    --model_max_length 512 \
    --use_ema \
    --ema_start_step 0 \
    --cfg 0.1 \
    --resume_from_checkpoint="latest" \
    --speed_factor 1.0 \
    --ema_decay 0.9999 \
    --drop_short_ratio 0.0 \
    --pretrained "" \
    --hw_stride 32 \
    --sparse1d --sparse_n 4 \
    --train_fps 16 \
    --seed 1234 \
    --trained_data_global_step 0 \
    --group_data \
    --use_decord \
    --prediction_type "v_prediction" \
    --snr_gamma 5.0 \
    --force_resolution \
    --rescale_betas_zero_snr \
    --output_dir=$OUTPUT_DIR

And below is the training process.

10/27/2024 07:56:45 - INFO - __main__ -   Num Epochs = 20
10/27/2024 07:56:45 - INFO - __main__ -   Instantaneous batch size per device = 1
10/27/2024 07:56:45 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 2
10/27/2024 07:56:45 - INFO - __main__ -   Gradient Accumulation steps = 1
10/27/2024 07:56:45 - INFO - __main__ -   Total optimization steps = 100
10/27/2024 07:56:45 - INFO - __main__ -   Total optimization steps (num_update_steps_per_epoch) = 5
10/27/2024 07:56:45 - INFO - __main__ -   Total training parameters = 2.7719816 B
10/27/2024 07:56:45 - INFO - __main__ -   AutoEncoder = WFVAEModel_D8_4x8x8; Dtype = torch.bfloat16; Parameters = 0.147347724 B
10/27/2024 07:56:45 - INFO - __main__ -   Text_enc_1 = /mnt/singularity_home/jsha/repos/Open-Sora-Plan/weights/google/mt5-xxl; Dtype = torch.bfloat16; Parameters = 5.65517312 B
Checkpoint 'latest' does not exist. Starting a new training run.
Steps:   0%|                                                                                                                               | 0/100 [00:00<?, ?it/s]  Args = Namespace(dataset='t2v', data='/mnt/singularity_home/jsha/repos/Open-Sora-Plan/open_sora_plan_dummy_data/training/data.txt', sample_rate=1, train_fps=16, drop_short_ratio=0.0, speed_factor=1.0, num_frames=65, max_height=352, max_width=640, max_hxw=None, min_hxw=None, ood_img_ratio=0.0, use_img_from_vid=False, model_max_length=512, cfg=0.1, dataloader_num_workers=16, train_batch_size=1, group_data=True, hw_stride=32, force_resolution=True, trained_data_global_step=0, use_decord=True, vae_fp32=False, extra_save_mem=False, model='OpenSoraT2V_v1_3-2B/122', enable_tiling=False, interpolation_scale_h=1.0, interpolation_scale_w=1.0, interpolation_scale_t=1.0, ae='WFVAEModel_D8_4x8x8', ae_path='/gpfs/vision/drag_video/HF_downloads/Open-Sora-Plan-v1.3.0/vae', text_encoder_name_1='/mnt/singularity_home/jsha/repos/Open-Sora-Plan/weights/google/mt5-xxl', text_encoder_name_2=None, cache_dir='../../cache_dir/', pretrained='', sparse1d=True, sparse_n=4, skip_connection=False, cogvideox_scheduler=False, v1_5_scheduler=False, rf_scheduler=False, weighting_scheme='logit_normal', logit_mean=0.0, logit_std=1.0, mode_scale=1.29, gradient_checkpointing=True, snr_gamma=5.0, use_ema=True, ema_decay=0.9999, ema_start_step=0, offload_ema=False, foreach_ema=False, noise_offset=0.0, prediction_type='v_prediction', rescale_betas_zero_snr=True, log_interval=10, enable_profiling=False, num_sampling_steps=20, guidance_scale=4.5, enable_tracker=False, seed=1234, output_dir='/mnt/singularity_home/jsha/repos/Open-Sora-Plan/output', checkpoints_total_limit=None, checkpointing_steps=500, resume_from_checkpoint='latest', logging_dir='logs', report_to='tensorboard', num_train_epochs=20, max_train_steps=100, gradient_accumulation_steps=1, optimizer='adamW', learning_rate=1e-05, lr_warmup_steps=0, use_8bit_adam=False, adam_beta1=0.9, adam_beta2=0.999, prodigy_decouple=True, adam_weight_decay=0.01, adam_weight_decay_text_encoder=None, adam_epsilon=1e-15, prodigy_use_bias_correction=True, prodigy_safeguard_warmup=True, max_grad_norm=1.0, prodigy_beta3=None, lr_scheduler='constant', allow_tf32=True, mixed_precision='bf16', local_rank=-1, sp_size=1, train_sp_batch_size=1, ae_stride_t=4, ae_stride_h=8, ae_stride_w=8, ae_stride=8, patch_size=2, patch_size_t=1, patch_size_h=2, patch_size_w=2, stride_t=4, stride=16, latent_size_t=17, total_batch_size=2)
  noise_scheduler = DDPMScheduler {
  "_class_name": "DDPMScheduler",
  "_diffusers_version": "0.30.2",
  "beta_end": 0.02,
  "beta_schedule": "linear",
  "beta_start": 0.0001,
  "clip_sample": true,
  "clip_sample_range": 1.0,
  "dynamic_thresholding_ratio": 0.995,
  "num_train_timesteps": 1000,
  "prediction_type": "v_prediction",
  "rescale_betas_zero_snr": true,
  "sample_max_value": 1.0,
  "steps_offset": 0,
  "thresholding": false,
  "timestep_spacing": "leading",
  "trained_betas": null,
  "variance_type": "fixed_small"
}

Skip the data of 0 step!
Skip the data of 0 step!
/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1652: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1652: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
/opt/conda/lib/python3.10/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
Steps:   5%|████▋                                                                                        | 5/100 [00:33<10:27,  6.61s/it, lr=1e-5, step_loss=0.734]
root@jsh_open_sora_plan_v13_clean:/mnt/singularity_home/jsha/repos/Open-Sora-Plan#

As you can see at the bottom, the process stops at 5 (sometimes 2 or 3. It's random) without any error message.

A minor change I made
Any guesses why this is happening?
Appreciate your help!

UPDATE

perhaps I used too small number of videos for training. I copied the same video dataset many times in data_json_1.json and seems the process is working. Before, I used less than 20 videos.
Now, the step goes upto 45/100 (but still crashes before completion though).

So why does the training process is terminated before completion when too small number of videos are used?

The text was updated successfully, but these errors were encountered:

LinB203 · 2024-10-27T16:14:13Z

We only train min(step of 1 epoch, max_step).
For example, if we have 64 videos, global batch size is 4, the steps of one epoch is 16, it will stop training after 16 training step even max_train_steps is 1000 or other number.

junsukha · 2024-10-30T00:14:16Z

We only train min(step of 1 epoch, max_step). For example, if we have 64 videos, global batch size is 4, the steps of one epoch is 16, it will stop training after 16 training step even max_train_steps is 1000 or other number.

@LinB203 Thx!
how do you increase the number of epochs to be trained?

UPDATE

From v1.2 train_t2v_diffusers.py, I found

    def train_all_epoch(prof_=None):
        for epoch in range(first_epoch, args.num_train_epochs):
            progress_info.train_loss = 0.0
            if progress_info.global_step >= args.max_train_steps:
                return True

            for step, data_item in enumerate(train_dataloader):
                if train_one_step(step, data_item, prof_):
                    break

                if step >= 2 and torch_npu is not None and npu_config is not None:
                    npu_config.free_mm()

and train_all_epoch() is used here in v1.2

    if npu_config is not None and npu_config.on_npu and npu_config.profiling:
        experimental_config = torch_npu.profiler._ExperimentalConfig(
            profiler_level=torch_npu.profiler.ProfilerLevel.Level1,
            aic_metrics=torch_npu.profiler.AiCMetrics.PipeUtilization
        )
        profile_output_path = f"/home/image_data/npu_profiling_t2v/{os.getenv('PROJECT_NAME', 'local')}"
        os.makedirs(profile_output_path, exist_ok=True)

        with torch_npu.profiler.profile(
                activities=[torch_npu.profiler.ProfilerActivity.NPU, torch_npu.profiler.ProfilerActivity.CPU],
                with_stack=True,
                record_shapes=True,
                profile_memory=True,
                experimental_config=experimental_config,
                schedule=torch_npu.profiler.schedule(wait=npu_config.profiling_step, warmup=0, active=1, repeat=1,
                                                     skip_first=0),
                on_trace_ready=torch_npu.profiler.tensorboard_trace_handler(f"{profile_output_path}/")
        ) as prof:
            train_all_epoch(prof)
    else:
        train_all_epoch()
    accelerator.wait_for_everyone()
    accelerator.end_training()
    if get_sequence_parallel_state():
        destroy_sequence_parallel_group()

So in v1.3, do I just replace train_one_epoch() with train_all_epoch() as below in train_t2v_diffusers.py?

   if npu_config is not None and npu_config.on_npu and npu_config.profiling:
        experimental_config = torch_npu.profiler._ExperimentalConfig(
            profiler_level=torch_npu.profiler.ProfilerLevel.Level1,
            aic_metrics=torch_npu.profiler.AiCMetrics.PipeUtilization
        )
        profile_output_path = f"/home/image_data/npu_profiling_t2v/{os.getenv('PROJECT_NAME', 'local')}"
        os.makedirs(profile_output_path, exist_ok=True)

        with torch_npu.profiler.profile(
                activities=[
                    torch_npu.profiler.ProfilerActivity.CPU, 
                    torch_npu.profiler.ProfilerActivity.NPU, 
                    ],
                with_stack=True,
                record_shapes=True,
                profile_memory=True,
                experimental_config=experimental_config,
                schedule=torch_npu.profiler.schedule(
                    wait=npu_config.profiling_step, warmup=0, active=1, repeat=1, skip_first=0
                    ),
                on_trace_ready=torch_npu.profiler.tensorboard_trace_handler(f"{profile_output_path}/")
        ) as prof:
            train_one_epoch(prof)
    else:
        if args.enable_profiling:
            with torch.profiler.profile(
                activities=[
                    torch.profiler.ProfilerActivity.CPU, 
                    torch.profiler.ProfilerActivity.CUDA, 
                    ], 
                schedule=torch.profiler.schedule(wait=5, warmup=1, active=1, repeat=1, skip_first=0),
                on_trace_ready=torch.profiler.tensorboard_trace_handler('./gpu_profiling_active_1_delmask_delbkmask_andvaemask_curope_gpu'),
                record_shapes=True,
                profile_memory=True,
                with_stack=True
            ) as prof:
                # train_one_epoch(prof)
                train_all_epoch(prof)
        else:
            # train_one_epoch()
            train_all_epoch()
    accelerator.wait_for_everyone()
    accelerator.end_training()
    if get_sequence_parallel_state():
        destroy_sequence_parallel_group()

LinB203 · 2024-10-30T04:27:48Z

For multiple epochs, setting this in the code introduces significant complexity for resuming training. Therefore, we default to training for only one epoch.

In large model training, the model often does not see the complete dataset. I understand your requirement, as fine-tuning typically involves a smaller dataset that often requires multiple epochs.

As a temporary solution, you can repeat your training data in train_data.txt to achieve the equivalent of multiple epochs within one epoch. This approach greatly simplifies the training code.

junsukha · 2024-10-30T04:50:57Z

For multiple epochs, setting this in the code introduces significant complexity for resuming training. Therefore, we default to training for only one epoch.

In large model training, the model often does not see the complete dataset. I understand your requirement, as fine-tuning typically involves a smaller dataset that often requires multiple epochs.

As a temporary solution, you can repeat your training data in train_data.txt to achieve the equivalent of multiple epochs within one epoch. This approach greatly simplifies the training code.

Ah. I see your point.

Just for the clarification, then do you mean that I shouldn't use multiple epochs? I'm already in the middle of the training phase using train_all_epoch(). Wondering if you're saying my fine-tuning process won't work eventually. It "works" fine until now, meaning doesn't crash at least (I've also already tried stopping the training process and resuming from the last checkpoint.)

Because I'm a bit concerned that if it's okay to stop the training process and convert back to "one epoch training" using your temporary solution and resume fine-tuning from the last checkpoint.

Much appreciated! Thanks for the help.

junsukha closed this as completed Oct 28, 2024

junsukha reopened this Oct 29, 2024

LinB203 mentioned this issue Nov 8, 2024

Issue with Multi-Epoch Training and max_train_steps Limitation #530

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fine tuning process stops during training without error message #508

fine tuning process stops during training without error message #508

junsukha commented Oct 27, 2024 •

edited

Loading

LinB203 commented Oct 27, 2024

junsukha commented Oct 30, 2024

LinB203 commented Oct 30, 2024

junsukha commented Oct 30, 2024 •

edited

Loading

fine tuning process stops during training without error message #508

fine tuning process stops during training without error message #508

Comments

junsukha commented Oct 27, 2024 • edited Loading

UPDATE

LinB203 commented Oct 27, 2024

junsukha commented Oct 30, 2024

UPDATE

LinB203 commented Oct 30, 2024

junsukha commented Oct 30, 2024 • edited Loading

junsukha commented Oct 27, 2024 •

edited

Loading

junsukha commented Oct 30, 2024 •

edited

Loading