-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fine tuning process stops during training without error message #508
Comments
We only train |
@LinB203 Thx! UPDATEFrom v1.2 def train_all_epoch(prof_=None):
for epoch in range(first_epoch, args.num_train_epochs):
progress_info.train_loss = 0.0
if progress_info.global_step >= args.max_train_steps:
return True
for step, data_item in enumerate(train_dataloader):
if train_one_step(step, data_item, prof_):
break
if step >= 2 and torch_npu is not None and npu_config is not None:
npu_config.free_mm() and if npu_config is not None and npu_config.on_npu and npu_config.profiling:
experimental_config = torch_npu.profiler._ExperimentalConfig(
profiler_level=torch_npu.profiler.ProfilerLevel.Level1,
aic_metrics=torch_npu.profiler.AiCMetrics.PipeUtilization
)
profile_output_path = f"/home/image_data/npu_profiling_t2v/{os.getenv('PROJECT_NAME', 'local')}"
os.makedirs(profile_output_path, exist_ok=True)
with torch_npu.profiler.profile(
activities=[torch_npu.profiler.ProfilerActivity.NPU, torch_npu.profiler.ProfilerActivity.CPU],
with_stack=True,
record_shapes=True,
profile_memory=True,
experimental_config=experimental_config,
schedule=torch_npu.profiler.schedule(wait=npu_config.profiling_step, warmup=0, active=1, repeat=1,
skip_first=0),
on_trace_ready=torch_npu.profiler.tensorboard_trace_handler(f"{profile_output_path}/")
) as prof:
train_all_epoch(prof)
else:
train_all_epoch()
accelerator.wait_for_everyone()
accelerator.end_training()
if get_sequence_parallel_state():
destroy_sequence_parallel_group() So in v1.3, do I just replace if npu_config is not None and npu_config.on_npu and npu_config.profiling:
experimental_config = torch_npu.profiler._ExperimentalConfig(
profiler_level=torch_npu.profiler.ProfilerLevel.Level1,
aic_metrics=torch_npu.profiler.AiCMetrics.PipeUtilization
)
profile_output_path = f"/home/image_data/npu_profiling_t2v/{os.getenv('PROJECT_NAME', 'local')}"
os.makedirs(profile_output_path, exist_ok=True)
with torch_npu.profiler.profile(
activities=[
torch_npu.profiler.ProfilerActivity.CPU,
torch_npu.profiler.ProfilerActivity.NPU,
],
with_stack=True,
record_shapes=True,
profile_memory=True,
experimental_config=experimental_config,
schedule=torch_npu.profiler.schedule(
wait=npu_config.profiling_step, warmup=0, active=1, repeat=1, skip_first=0
),
on_trace_ready=torch_npu.profiler.tensorboard_trace_handler(f"{profile_output_path}/")
) as prof:
train_one_epoch(prof)
else:
if args.enable_profiling:
with torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
],
schedule=torch.profiler.schedule(wait=5, warmup=1, active=1, repeat=1, skip_first=0),
on_trace_ready=torch.profiler.tensorboard_trace_handler('./gpu_profiling_active_1_delmask_delbkmask_andvaemask_curope_gpu'),
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof:
# train_one_epoch(prof)
train_all_epoch(prof)
else:
# train_one_epoch()
train_all_epoch()
accelerator.wait_for_everyone()
accelerator.end_training()
if get_sequence_parallel_state():
destroy_sequence_parallel_group() |
For multiple epochs, setting this in the code introduces significant complexity for resuming training. Therefore, we default to training for only one epoch. In large model training, the model often does not see the complete dataset. I understand your requirement, as fine-tuning typically involves a smaller dataset that often requires multiple epochs. As a temporary solution, you can repeat your training data in |
Ah. I see your point. Just for the clarification, then do you mean that I shouldn't use multiple epochs? I'm already in the middle of the training phase using Because I'm a bit concerned that if it's okay to stop the training process and convert back to "one epoch training" using your temporary solution and resume fine-tuning from the last checkpoint. Much appreciated! Thanks for the help. |
This the script I used for fine tuning.
And below is the training process.
As you can see at the bottom, the process stops at 5 (sometimes 2 or 3. It's random) without any error message.
A minor change I made
Any guesses why this is happening?
Appreciate your help!
UPDATE
perhaps I used too small number of videos for training. I copied the same video dataset many times in data_json_1.json and seems the process is working. Before, I used less than 20 videos.
Now, the step goes upto 45/100 (but still crashes before completion though).
So why does the training process is terminated before completion when too small number of videos are used?
The text was updated successfully, but these errors were encountered: