Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fine tuning process stops during training without error message #508

Open
junsukha opened this issue Oct 27, 2024 · 4 comments
Open

fine tuning process stops during training without error message #508

junsukha opened this issue Oct 27, 2024 · 4 comments

Comments

@junsukha
Copy link

junsukha commented Oct 27, 2024

This the script I used for fine tuning.

export HF_DATASETS_OFFLINE=1 
export TRANSFORMERS_OFFLINE=1
export PDSH_RCMD_TYPE=ssh
# NCCL setting
export GLOO_SOCKET_IFNAME=bond0
export NCCL_SOCKET_IFNAME=bond0
export NCCL_IB_HCA=mlx5_10:1,mlx5_11:1,mlx5_12:1,mlx5_13:1
export NCCL_IB_GID_INDEX=3
export NCCL_IB_TC=162
export NCCL_IB_TIMEOUT=25
export NCCL_PXN_DISABLE=0
export NCCL_IB_QPS_PER_CONNECTION=4
export NCCL_ALGO=Ring
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export NCCL_IB_RETRY_CNT=32
# export NCCL_ALGO=Tree
export OUTPUT_DIR="/mnt/singularity_home/jsha/repos/Open-Sora-Plan/output"

CUDA_VISIBLE_DEVICES=0,1,2 accelerate launch \
    --config_file scripts/accelerate_configs/deepspeed_zero2_config.yaml \
    opensora/train/train_t2v_diffusers.py \
    --model OpenSoraT2V_v1_3-2B/122 \
    --text_encoder_name_1 "/mnt/singularity_home/jsha/repos/Open-Sora-Plan/weights/google/mt5-xxl" \
    --cache_dir "../../cache_dir/" \
    --dataset t2v \
    --data "/mnt/singularity_home/jsha/repos/Open-Sora-Plan/open_sora_plan_dummy_data/training/data.txt" \
    --ae WFVAEModel_D8_4x8x8 \
    --ae_path "/gpfs/vision/drag_video/HF_downloads/Open-Sora-Plan-v1.3.0/vae" \
    --sample_rate 1 \
    --num_frames 65 \
    --max_height 352 \
    --max_width 640 \
    --interpolation_scale_t 1.0 \
    --interpolation_scale_h 1.0 \
    --interpolation_scale_w 1.0 \
    --gradient_checkpointing \
    --train_batch_size=1 \
    --dataloader_num_workers 16 \
    --gradient_accumulation_steps=1 \
    --max_train_steps=100 \
    --learning_rate=1e-5 \
    --lr_scheduler="constant" \
    --lr_warmup_steps=0 \
    --mixed_precision="bf16" \
    --report_to="tensorboard" \
    --checkpointing_steps=500 \
    --allow_tf32 \
    --model_max_length 512 \
    --use_ema \
    --ema_start_step 0 \
    --cfg 0.1 \
    --resume_from_checkpoint="latest" \
    --speed_factor 1.0 \
    --ema_decay 0.9999 \
    --drop_short_ratio 0.0 \
    --pretrained "" \
    --hw_stride 32 \
    --sparse1d --sparse_n 4 \
    --train_fps 16 \
    --seed 1234 \
    --trained_data_global_step 0 \
    --group_data \
    --use_decord \
    --prediction_type "v_prediction" \
    --snr_gamma 5.0 \
    --force_resolution \
    --rescale_betas_zero_snr \
    --output_dir=$OUTPUT_DIR

And below is the training process.

10/27/2024 07:56:45 - INFO - __main__ -   Num Epochs = 20
10/27/2024 07:56:45 - INFO - __main__ -   Instantaneous batch size per device = 1
10/27/2024 07:56:45 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 2
10/27/2024 07:56:45 - INFO - __main__ -   Gradient Accumulation steps = 1
10/27/2024 07:56:45 - INFO - __main__ -   Total optimization steps = 100
10/27/2024 07:56:45 - INFO - __main__ -   Total optimization steps (num_update_steps_per_epoch) = 5
10/27/2024 07:56:45 - INFO - __main__ -   Total training parameters = 2.7719816 B
10/27/2024 07:56:45 - INFO - __main__ -   AutoEncoder = WFVAEModel_D8_4x8x8; Dtype = torch.bfloat16; Parameters = 0.147347724 B
10/27/2024 07:56:45 - INFO - __main__ -   Text_enc_1 = /mnt/singularity_home/jsha/repos/Open-Sora-Plan/weights/google/mt5-xxl; Dtype = torch.bfloat16; Parameters = 5.65517312 B
Checkpoint 'latest' does not exist. Starting a new training run.
Steps:   0%|                                                                                                                               | 0/100 [00:00<?, ?it/s]  Args = Namespace(dataset='t2v', data='/mnt/singularity_home/jsha/repos/Open-Sora-Plan/open_sora_plan_dummy_data/training/data.txt', sample_rate=1, train_fps=16, drop_short_ratio=0.0, speed_factor=1.0, num_frames=65, max_height=352, max_width=640, max_hxw=None, min_hxw=None, ood_img_ratio=0.0, use_img_from_vid=False, model_max_length=512, cfg=0.1, dataloader_num_workers=16, train_batch_size=1, group_data=True, hw_stride=32, force_resolution=True, trained_data_global_step=0, use_decord=True, vae_fp32=False, extra_save_mem=False, model='OpenSoraT2V_v1_3-2B/122', enable_tiling=False, interpolation_scale_h=1.0, interpolation_scale_w=1.0, interpolation_scale_t=1.0, ae='WFVAEModel_D8_4x8x8', ae_path='/gpfs/vision/drag_video/HF_downloads/Open-Sora-Plan-v1.3.0/vae', text_encoder_name_1='/mnt/singularity_home/jsha/repos/Open-Sora-Plan/weights/google/mt5-xxl', text_encoder_name_2=None, cache_dir='../../cache_dir/', pretrained='', sparse1d=True, sparse_n=4, skip_connection=False, cogvideox_scheduler=False, v1_5_scheduler=False, rf_scheduler=False, weighting_scheme='logit_normal', logit_mean=0.0, logit_std=1.0, mode_scale=1.29, gradient_checkpointing=True, snr_gamma=5.0, use_ema=True, ema_decay=0.9999, ema_start_step=0, offload_ema=False, foreach_ema=False, noise_offset=0.0, prediction_type='v_prediction', rescale_betas_zero_snr=True, log_interval=10, enable_profiling=False, num_sampling_steps=20, guidance_scale=4.5, enable_tracker=False, seed=1234, output_dir='/mnt/singularity_home/jsha/repos/Open-Sora-Plan/output', checkpoints_total_limit=None, checkpointing_steps=500, resume_from_checkpoint='latest', logging_dir='logs', report_to='tensorboard', num_train_epochs=20, max_train_steps=100, gradient_accumulation_steps=1, optimizer='adamW', learning_rate=1e-05, lr_warmup_steps=0, use_8bit_adam=False, adam_beta1=0.9, adam_beta2=0.999, prodigy_decouple=True, adam_weight_decay=0.01, adam_weight_decay_text_encoder=None, adam_epsilon=1e-15, prodigy_use_bias_correction=True, prodigy_safeguard_warmup=True, max_grad_norm=1.0, prodigy_beta3=None, lr_scheduler='constant', allow_tf32=True, mixed_precision='bf16', local_rank=-1, sp_size=1, train_sp_batch_size=1, ae_stride_t=4, ae_stride_h=8, ae_stride_w=8, ae_stride=8, patch_size=2, patch_size_t=1, patch_size_h=2, patch_size_w=2, stride_t=4, stride=16, latent_size_t=17, total_batch_size=2)
  noise_scheduler = DDPMScheduler {
  "_class_name": "DDPMScheduler",
  "_diffusers_version": "0.30.2",
  "beta_end": 0.02,
  "beta_schedule": "linear",
  "beta_start": 0.0001,
  "clip_sample": true,
  "clip_sample_range": 1.0,
  "dynamic_thresholding_ratio": 0.995,
  "num_train_timesteps": 1000,
  "prediction_type": "v_prediction",
  "rescale_betas_zero_snr": true,
  "sample_max_value": 1.0,
  "steps_offset": 0,
  "thresholding": false,
  "timestep_spacing": "leading",
  "trained_betas": null,
  "variance_type": "fixed_small"
}

Skip the data of 0 step!
Skip the data of 0 step!
/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1652: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py:1652: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:83.)
  total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)])
/opt/conda/lib/python3.10/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
Steps:   5%|████▋                                                                                        | 5/100 [00:33<10:27,  6.61s/it, lr=1e-5, step_loss=0.734]
root@jsh_open_sora_plan_v13_clean:/mnt/singularity_home/jsha/repos/Open-Sora-Plan#

As you can see at the bottom, the process stops at 5 (sometimes 2 or 3. It's random) without any error message.

A minor change I made
Any guesses why this is happening?
Appreciate your help!

UPDATE

perhaps I used too small number of videos for training. I copied the same video dataset many times in data_json_1.json and seems the process is working. Before, I used less than 20 videos.
Now, the step goes upto 45/100 (but still crashes before completion though).
image

So why does the training process is terminated before completion when too small number of videos are used?

@LinB203
Copy link
Member

LinB203 commented Oct 27, 2024

We only train min(step of 1 epoch, max_step).
For example, if we have 64 videos, global batch size is 4, the steps of one epoch is 16, it will stop training after 16 training step even max_train_steps is 1000 or other number.

@junsukha junsukha reopened this Oct 29, 2024
@junsukha
Copy link
Author

We only train min(step of 1 epoch, max_step). For example, if we have 64 videos, global batch size is 4, the steps of one epoch is 16, it will stop training after 16 training step even max_train_steps is 1000 or other number.

@LinB203 Thx!
how do you increase the number of epochs to be trained?

UPDATE

From v1.2 train_t2v_diffusers.py, I found

    def train_all_epoch(prof_=None):
        for epoch in range(first_epoch, args.num_train_epochs):
            progress_info.train_loss = 0.0
            if progress_info.global_step >= args.max_train_steps:
                return True

            for step, data_item in enumerate(train_dataloader):
                if train_one_step(step, data_item, prof_):
                    break

                if step >= 2 and torch_npu is not None and npu_config is not None:
                    npu_config.free_mm()

and train_all_epoch() is used here in v1.2

    if npu_config is not None and npu_config.on_npu and npu_config.profiling:
        experimental_config = torch_npu.profiler._ExperimentalConfig(
            profiler_level=torch_npu.profiler.ProfilerLevel.Level1,
            aic_metrics=torch_npu.profiler.AiCMetrics.PipeUtilization
        )
        profile_output_path = f"/home/image_data/npu_profiling_t2v/{os.getenv('PROJECT_NAME', 'local')}"
        os.makedirs(profile_output_path, exist_ok=True)

        with torch_npu.profiler.profile(
                activities=[torch_npu.profiler.ProfilerActivity.NPU, torch_npu.profiler.ProfilerActivity.CPU],
                with_stack=True,
                record_shapes=True,
                profile_memory=True,
                experimental_config=experimental_config,
                schedule=torch_npu.profiler.schedule(wait=npu_config.profiling_step, warmup=0, active=1, repeat=1,
                                                     skip_first=0),
                on_trace_ready=torch_npu.profiler.tensorboard_trace_handler(f"{profile_output_path}/")
        ) as prof:
            train_all_epoch(prof)
    else:
        train_all_epoch()
    accelerator.wait_for_everyone()
    accelerator.end_training()
    if get_sequence_parallel_state():
        destroy_sequence_parallel_group()

So in v1.3, do I just replace train_one_epoch() with train_all_epoch() as below in train_t2v_diffusers.py?

   if npu_config is not None and npu_config.on_npu and npu_config.profiling:
        experimental_config = torch_npu.profiler._ExperimentalConfig(
            profiler_level=torch_npu.profiler.ProfilerLevel.Level1,
            aic_metrics=torch_npu.profiler.AiCMetrics.PipeUtilization
        )
        profile_output_path = f"/home/image_data/npu_profiling_t2v/{os.getenv('PROJECT_NAME', 'local')}"
        os.makedirs(profile_output_path, exist_ok=True)

        with torch_npu.profiler.profile(
                activities=[
                    torch_npu.profiler.ProfilerActivity.CPU, 
                    torch_npu.profiler.ProfilerActivity.NPU, 
                    ],
                with_stack=True,
                record_shapes=True,
                profile_memory=True,
                experimental_config=experimental_config,
                schedule=torch_npu.profiler.schedule(
                    wait=npu_config.profiling_step, warmup=0, active=1, repeat=1, skip_first=0
                    ),
                on_trace_ready=torch_npu.profiler.tensorboard_trace_handler(f"{profile_output_path}/")
        ) as prof:
            train_one_epoch(prof)
    else:
        if args.enable_profiling:
            with torch.profiler.profile(
                activities=[
                    torch.profiler.ProfilerActivity.CPU, 
                    torch.profiler.ProfilerActivity.CUDA, 
                    ], 
                schedule=torch.profiler.schedule(wait=5, warmup=1, active=1, repeat=1, skip_first=0),
                on_trace_ready=torch.profiler.tensorboard_trace_handler('./gpu_profiling_active_1_delmask_delbkmask_andvaemask_curope_gpu'),
                record_shapes=True,
                profile_memory=True,
                with_stack=True
            ) as prof:
                # train_one_epoch(prof)
                train_all_epoch(prof)
        else:
            # train_one_epoch()
            train_all_epoch()
    accelerator.wait_for_everyone()
    accelerator.end_training()
    if get_sequence_parallel_state():
        destroy_sequence_parallel_group()

@LinB203
Copy link
Member

LinB203 commented Oct 30, 2024

For multiple epochs, setting this in the code introduces significant complexity for resuming training. Therefore, we default to training for only one epoch.

In large model training, the model often does not see the complete dataset. I understand your requirement, as fine-tuning typically involves a smaller dataset that often requires multiple epochs.

As a temporary solution, you can repeat your training data in train_data.txt to achieve the equivalent of multiple epochs within one epoch. This approach greatly simplifies the training code.

@junsukha
Copy link
Author

junsukha commented Oct 30, 2024

For multiple epochs, setting this in the code introduces significant complexity for resuming training. Therefore, we default to training for only one epoch.

In large model training, the model often does not see the complete dataset. I understand your requirement, as fine-tuning typically involves a smaller dataset that often requires multiple epochs.

As a temporary solution, you can repeat your training data in train_data.txt to achieve the equivalent of multiple epochs within one epoch. This approach greatly simplifies the training code.

Ah. I see your point.

Just for the clarification, then do you mean that I shouldn't use multiple epochs? I'm already in the middle of the training phase using train_all_epoch(). Wondering if you're saying my fine-tuning process won't work eventually. It "works" fine until now, meaning doesn't crash at least (I've also already tried stopping the training process and resuming from the last checkpoint.)

Because I'm a bit concerned that if it's okay to stop the training process and convert back to "one epoch training" using your temporary solution and resume fine-tuning from the last checkpoint.

Much appreciated! Thanks for the help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants