You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
args:
checkpoint_activations: True # using gradient checkpointingmodel_parallel_size: 1experiment_name: full_storyboardmode: finetuneload: /suqinzs/jwargrave/CogVideo-293/sat/pretrained_weights/CogVideoX1.5-5B-SAT/transformer_i2vno_load_rng: Truetrain_iters: 1000# Suggest more than 1000 For Lora and SFT For 500 is enougheval_iters: 1eval_interval: 100eval_batch_size: 1save: ckpts_CogVideoX1.5-5B-SAT_i2v_fullsave_interval: 500log_interval: 20train_data: [ "/suqinzs/jwargrave/CogVideo-293/sat/storyboard_data_for_cog" ] # Train data pathvalid_data: [ "/suqinzs/jwargrave/CogVideo-293/sat/storyboard_data_for_cog" ] # Validation data path, can be the same as train_data(not recommended)split: 1,0,0num_workers: 8force_train: Trueonly_log_video_latents: Truedata:
target: data_video.SFTDatasetparams:
video_size: [ 480, 720 ]fps: 8max_num_frames: 49skip_frms_num: 3.deepspeed:
# Minimum for 16 videos per batch for ALL GPUs, This setting is for 8 x A100 GPUstrain_micro_batch_size_per_gpu: 1gradient_accumulation_steps: 2steps_per_print: 50gradient_clipping: 0.1zero_optimization:
stage: 2cpu_offload: falsecontiguous_gradients: falseoverlap_comm: truereduce_scatter: truereduce_bucket_size: 1000000000allgather_bucket_size: 1000000000load_from_fp32_weights: falsezero_allow_untested_optimizer: truebf16:
enabled: True # For CogVideoX-2B Turn to False and For CogVideoX-5B Turn to Truefp16:
enabled: False # For CogVideoX-2B Turn to True and For CogVideoX-5B Turn to Falseloss_scale: 0loss_scale_window: 400hysteresis: 2min_loss_scale: 1optimizer:
type: sat.ops.FusedEmaAdamparams:
lr: 0.00001# Between 1E-3 and 5E-4 For Lora and 1E-5 For SFTbetas: [ 0.9, 0.95 ]eps: 1e-8weight_decay: 1e-4activation_checkpointing:
partition_activations: falsecontiguous_memory_optimization: falsewall_clock_breakdown: false
System Info / 系統信息
pip list结果如下(都已满足
requirements.txt
和sat/requirements.txt
):系统信息如下(由
python -m torch.utils.collect_env
给出):Information / 问题信息
Reproduction / 复现过程
首先修改了
sat/configs/sft.yaml
,如下所示(主要改了load: CogVideoX1.5-5B-SAT/transformer_i2v
,train_micro_batch_size_per_gpu: 1
,gradient_accumulation_steps: 2
,video_size: [ 480, 720 ]
),我想训image-to-video:数据集路径
storyboard_data_for_cog
已经按照这里所说的整理好了,每个txt
文件都只有一行,是对应视频的caption,一共大概7w多个视频,视频长短不一,有几百帧的,也有几帧的。然后修改了
sat/finetune_multi_gpus.sh
,如下所示(单机8卡训练):然后运行
bash finetune_multi_gpus.sh
,遇到了如下的报错:请问这个问题要如何解决?
Expected behavior / 期待表现
期望能够解决KeyError,顺利全微调CogVideoX1.5-5B-SAT
The text was updated successfully, but these errors were encountered: