Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add All Multimodal Source Code #7791

Merged
merged 508 commits into from
Dec 13, 2023
Merged
Show file tree
Hide file tree
Changes from 250 commits
Commits
Show all changes
508 commits
Select commit Hold shift + click to select a range
a807605
Add comprehensive error messages (#7261)
PeganovAnton Sep 12, 2023
92a71ca
check NEMO_PATH (#7418)
karpnv Sep 12, 2023
603c77e
layer selection for ia3 (#7417)
arendu Sep 13, 2023
3c4bba6
Fix missing pip package 'einops' (#7397)
RobinDong Sep 14, 2023
429932a
Fix failure of pyaudio in Google Colab (#7396)
RobinDong Sep 15, 2023
e6ba5f0
Update README.md: output_path --> output_manifest_filepath (#7442)
popcornell Sep 18, 2023
4f12143
Updating FlashAttention API to match FlashAttentionV2
parthmannan Sep 19, 2023
93ffdd0
Merge branch 'pmannan/sd_fav2' into 'internal/main'
Victor49152 Sep 19, 2023
70a3602
Multiple fixes for mm
yaoyu-33 Sep 19, 2023
61569c1
Fix CI inductor issue and update to torch compile
Victor49152 Sep 20, 2023
b88f5fe
Merge branch 'mingyuanm/compile' into 'internal/main'
Victor49152 Sep 20, 2023
9fe3784
Remove suppress error
Victor49152 Sep 20, 2023
d99cbd8
Merge branch 'mingyuanm/remove_suppress_error' into 'internal/main'
Victor49152 Sep 20, 2023
8801611
Fix when conversion config uses fp16 and it complains about precision…
Victor49152 Sep 21, 2023
630e713
Merge branch 'mingyuanm/fix_ckpt_loading' into 'internal/main'
Victor49152 Sep 21, 2023
3c33193
Fixing FAv2 API usage
parthmannan Sep 29, 2023
32f592c
Merge branch 'pmannan/sd_fav2_fix' into 'internal/main'
Victor49152 Sep 29, 2023
f24a224
Initial release of content filtering model
hXl3s Oct 3, 2023
09fa8f6
Merge branch 'lukaszp/content_filtering/initial_release' into 'intern…
Oct 3, 2023
0696c11
Added synthetic dataloader for precached and online mode
Victor49152 Oct 4, 2023
4fbb42d
Merge branch 'mingyuanm/synthetic_data_sd' into 'internal/main'
Victor49152 Oct 4, 2023
2cda573
Mingyuanm/dreambooth opt
Victor49152 Oct 4, 2023
57caf44
Merge branch 'mingyuanm/dreambooth_opt' into 'internal/main'
Victor49152 Oct 4, 2023
eb252f9
Add llama2 support in neva training
Oct 3, 2023
ee82ff5
Fix sampler length
yaoyu-33 Oct 5, 2023
f39c629
Fix all precision issues in nemo multimodal
yaoyu-33 Sep 21, 2023
6a6c286
Add rope dynamic linear scaling (#7437)
hsiehjackson Sep 18, 2023
703d1ef
Fix None dataloader issue in PTL2.0 (#7455)
KunalDhawan Sep 19, 2023
88e2285
[ASR] Confidence measure -> method renames (#7434)
GNroy Sep 19, 2023
38942ee
Add steps for document of getting dataset 'SF Bilingual Speech' (#7378)
RobinDong Sep 19, 2023
4be356a
RNN-T confidence and alignment bugfix (#7381)
GNroy Sep 19, 2023
85a8bf1
Fix resume from checkpoint in exp_manager (#7424) (#7426)
github-actions[bot] Sep 19, 2023
e6d8fa9
Fix checking of cuda/cpu device for inputs of Decoder (#7444)
RobinDong Sep 19, 2023
701befe
Fix failure of ljspeech's get_data.py (#7430)
RobinDong Sep 19, 2023
16bcf5a
[TTS] Fix audio codec type checks (#7373)
rlangman Sep 19, 2023
b9f2cfe
[TTS] Add dataset to path of logged artifacts (#7462)
rlangman Sep 20, 2023
2e0133c
Fix sft dataset truncation (#7464)
hsiehjackson Sep 20, 2023
d25cac2
Automatic Lip Reading Recognition (ALR) - ASR/CV (Visual ASR) (#7330)
burchim Sep 20, 2023
1dbf71d
HF StarCoder to NeMo conversion script (#7421)
janekl Sep 20, 2023
e16856e
fix bug when loading dist ckpt in peft (#7452)
lhb8125 Sep 21, 2023
20dc142
Fix adding positional embeddings in-place in transformer module (#7440)
The0nix Sep 21, 2023
19358af
Fix (#7478)
hsiehjackson Sep 22, 2023
30e4ca6
add sleep (#7498) (#7499)
github-actions[bot] Sep 24, 2023
96ece09
Fix exp manager check for sleep (#7503) (#7504)
github-actions[bot] Sep 25, 2023
c1d05d5
bugfix: trainer.accelerator=auto from None. (#7492) (#7493)
github-actions[bot] Sep 25, 2023
6a8dd48
[doc] fix broken link (#7481)
stas00 Sep 25, 2023
3efd327
[TTS] Read audio as int32 to avoid flac read errors (#7477)
rlangman Sep 26, 2023
2ab2ff7
Add dataset 'AISHELL-3' from OpenSLR for training mandarin TTS (#7409)
RobinDong Sep 26, 2023
a62d4e9
dllogger - log on rank 0 only (#7513)
stas00 Sep 26, 2023
6866095
Fix TTS FastPitch tutorial (#7494) (#7516)
github-actions[bot] Sep 26, 2023
89e97b4
Fix get_dist() tensor dimension (#7506) (#7515)
github-actions[bot] Sep 26, 2023
e1419a8
bugfix: specify trainer.strategy=auto when devices=1 (#7509) (#7512)
github-actions[bot] Sep 26, 2023
e5b7c26
fix (#7511)
aklife97 Sep 26, 2023
febcab0
[TTS] Fix FastPitch data prep tutorial (#7524)
rlangman Sep 27, 2023
147e7ac
add italian tokenization (#7486)
GiacomoLeoneMaria Sep 27, 2023
301a266
Replace None strategy with auto in tutorial notebooks (#7521) (#7527)
github-actions[bot] Sep 27, 2023
9bee661
unpin setuptools (#7534) (#7535)
github-actions[bot] Sep 27, 2023
b546643
remove auto generated examples (#7510)
arendu Sep 27, 2023
08e91e1
Add the `strategy` argument to `MegatronGPTModel.generate()` (#7264)
odelalleau Sep 27, 2023
677960d
Fix PTL2.0 related ASR bugs in r1.21.0: Val metrics logging, None dat…
github-actions[bot] Sep 27, 2023
3b435ed
gpus -> devices (#7542) (#7545)
github-actions[bot] Sep 28, 2023
2bb56f7
Update FFMPEG version to fix issue with torchaudio (#7551) (#7553)
github-actions[bot] Sep 28, 2023
82f547f
PEFT GPT & T5 Refactor (#7308)
meatybobby Sep 28, 2023
1689790
fix a typo (#7496)
BestJuly Sep 28, 2023
3d28306
[TTS] remove curly braces from ${BRANCH} in jupyer notebook cell. (#7…
github-actions[bot] Sep 28, 2023
b38c28a
add youtube embed url (#7570)
XuesongYang Sep 29, 2023
b9033f2
Remap speakers to continuous range of speaker_id for dataset AISHELL3…
RobinDong Sep 29, 2023
62097e5
fix validation_step_outputs initialization for multi-dataloader (#754…
github-actions[bot] Sep 29, 2023
fe50fa3
Append output of val step to self.validation_step_outputs (#7530) (#7…
github-actions[bot] Sep 29, 2023
bf88a23
[TTS] fixed trainer's accelerator and strategy. (#7569) (#7574)
github-actions[bot] Sep 29, 2023
7987c21
Append val/test output to instance variable in EncDecSpeakerLabelMode…
github-actions[bot] Sep 29, 2023
50ab483
Fix CustomProgressBar for resume (#7427) (#7522)
github-actions[bot] Sep 30, 2023
2cb9e4c
fix typos in nfa and speech enhancement tutorials (#7580) (#7583)
github-actions[bot] Sep 30, 2023
2295e44
Add strategy as ddp_find_unused_parameters_true for glue_benchmark.py…
github-actions[bot] Sep 30, 2023
1be5988
update strategy (#7577) (#7578)
github-actions[bot] Sep 30, 2023
8f36214
Fix typos (#7581)
Kipok Oct 2, 2023
f29a917
Change hifigan finetune strategy to ddp_find_unused_parameters_true (…
github-actions[bot] Oct 2, 2023
dc60a47
[BugFix] Add missing quotes for auto strategy in tutorial notebooks (…
github-actions[bot] Oct 2, 2023
879047e
add build os key (#7596) (#7599)
github-actions[bot] Oct 2, 2023
0b1ea36
StarCoder SFT test + bump PyT NGC image to 23.09 (#7540)
janekl Oct 2, 2023
703d2e8
defaults changed (#7600)
arendu Oct 3, 2023
8b77683
add ItalianPhonemesTokenizer (#7587)
GiacomoLeoneMaria Oct 3, 2023
e603cad
best ckpt fix (#7564) (#7588)
github-actions[bot] Oct 3, 2023
4d5184c
Add files via upload (#7598)
Jorjeous Oct 3, 2023
f10f93b
Fix validation in G2PModel and ThutmoseTaggerModel (#7597) (#7606)
github-actions[bot] Oct 3, 2023
a12835e
Broadcast loss only when using pipeline parallelism and within the pi…
github-actions[bot] Oct 3, 2023
5211e5b
Safeguard nemo_text_processing installation on ARM (#7485)
blisc Oct 3, 2023
9590c3c
Bound transformers version in requirements (#7620)
athitten Oct 4, 2023
fe5af22
fix llama2 70b lora tuning bug (#7622)
cuichenx Oct 4, 2023
381d84e
Fix import error no module name model_utils (#7629)
menon92 Oct 4, 2023
19f32c5
add fc large ls models (#7641)
nithinraok Oct 4, 2023
329bd3c
bugfix: trainer.gpus, trainer.strategy, trainer.accelerator (#7621) (…
github-actions[bot] Oct 5, 2023
e109c6e
fix ssl models ptl monitor val through logging (#7608) (#7614)
github-actions[bot] Oct 5, 2023
b36555b
Fix metrics for SE tutorial (#7604) (#7612)
github-actions[bot] Oct 5, 2023
a0053a6
Add ddp_find_unused_parameters=True and change accelerator to auto (#…
github-actions[bot] Oct 5, 2023
358f5c6
Fix py3.11 dataclasses issue (#7616)
github-actions[bot] Oct 5, 2023
6aff5e9
[Stable Diffusion/ControlNet] Enable O2 training for SD and Fix Contr…
Victor49152 Oct 9, 2023
37e9706
Merge branch 'mingyuanm/sd_o2' into 'internal/main'
Victor49152 Oct 9, 2023
32e4fba
Mingyuanm/dreambooth fix
Victor49152 Oct 10, 2023
3bf91c3
Merge branch 'mingyuanm/dreambooth_fix' into 'internal/main'
Victor49152 Oct 10, 2023
173c468
Fix NeMo CI Infer Issue
suiyoubi Oct 10, 2023
32af5bc
Merge branch 'aot/imagen_fix' into 'internal/main'
Oct 10, 2023
3e038fd
DreamFusion
ahmadki Oct 11, 2023
981fca5
Move neva export changes
meatybobby Oct 12, 2023
ed895d0
Add Imagen Synthetic Dataloader
suiyoubi Oct 13, 2023
ad7ef5a
Merge branch 'aot/syn_dataset_imagen' into 'internal/main'
Oct 13, 2023
fd7c1d3
Add VITWrapper and export stuff to wrapper
meatybobby Oct 13, 2023
dd67c95
Update neva with megatron-core support
Oct 13, 2023
90c0559
Merge branch 'yuya/neva_mcore2' into 'internal/main'
Oct 13, 2023
1e4c2b2
Fix issues with Dockerfile (#7650) (#7652)
github-actions[bot] Oct 6, 2023
798f6fc
[ASR] RNN-T greedy decoding max_frames fix for alignment and confiden…
GNroy Oct 6, 2023
d9861d1
[ASR] Fix type error in jasper (#7636) (#7653)
github-actions[bot] Oct 6, 2023
3e38b79
[TTS] Add STFT and SI-SDR loss to audio codec recipe (#7468)
rlangman Oct 6, 2023
2209a30
Create per.py (#7538)
ssh-meister Oct 7, 2023
70c0a37
conversion issue fix (#7648) (#7668)
github-actions[bot] Oct 10, 2023
b7bcf08
layernorm1p fix (#7523) (#7567)
github-actions[bot] Oct 10, 2023
b3da442
generalized chat sft prompt (#7655)
yidong72 Oct 10, 2023
188f0a1
Fix vad & speech command tutorial - onnx (#7671) (#7672)
github-actions[bot] Oct 10, 2023
33d04b2
Fix in the confidence ensemble test (#7682)
Kipok Oct 11, 2023
40f8256
PEFT eval fix (#7626) (#7638)
github-actions[bot] Oct 11, 2023
79c3703
propagate mp config (#7637) (#7639)
github-actions[bot] Oct 11, 2023
aba4a00
Add find_unused_parameters_true for text_classiftn and punctuation_ca…
github-actions[bot] Oct 11, 2023
503301b
Hotfix (#7501) (#7568)
github-actions[bot] Oct 11, 2023
98e6ffe
Avoid duplicated checkpoint save (#7555) (#7566)
github-actions[bot] Oct 11, 2023
b6fecc5
Cache FP8 weight and transpose only at the first micro-batch in each …
github-actions[bot] Oct 11, 2023
292d232
Add an option to disable manual GC in validation (#7467) (#7476)
github-actions[bot] Oct 11, 2023
9c48ce1
Remove PUBLICATIONS.md, point to github.io NeMo page instead (#7694) …
github-actions[bot] Oct 11, 2023
762b5ca
Fix multi rank finetune for ASR (#7684) (#7699)
github-actions[bot] Oct 11, 2023
7755c17
Update docs: readme, getting started, ASR intro (#7679)
erastorgueva-nv Oct 11, 2023
5f35a8c
fix onnx (#7703) (#7704)
github-actions[bot] Oct 12, 2023
29910cd
move core install to /workspace (#7706)
aklife97 Oct 12, 2023
aa3a977
Fix typo in audio codec config, encoder target (#7697)
anteju Oct 12, 2023
eab0f54
Replace strategy='dp'/None with 'auto' (#7681) (#7696)
github-actions[bot] Oct 13, 2023
233e62b
[ASR] Multichannel mask estimator with flex number of channels (#7317)
anteju Oct 13, 2023
3cd9fbd
fix ptl_bugs in slu_models.py (#7689) (#7712)
github-actions[bot] Oct 13, 2023
ddf546d
fix code block typo (#7717)
erastorgueva-nv Oct 13, 2023
ff7154d
Update key mapping logic
Victor49152 Oct 16, 2023
f73180d
Merge branch 'main' into internal/main
yaoyu-33 Oct 16, 2023
0087ee3
Few merge fixes
yaoyu-33 Oct 16, 2023
8bdbd47
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 16, 2023
7be8108
Fix diff for non-mm models
yaoyu-33 Oct 16, 2023
aab3c40
Fix diff for non-mm models
yaoyu-33 Oct 16, 2023
38dc290
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 16, 2023
563cadb
Remove deployment and export scripts
yaoyu-33 Oct 16, 2023
9a566be
Improve the unet ckpt loading logic.
Victor49152 Oct 16, 2023
7a0ae36
Improve the unet ckpt loading logic.
Victor49152 Oct 16, 2023
576c652
Add checkpoint_averaging script
yaoyu-33 Oct 17, 2023
d6900f9
Hide multimodal code changes
yaoyu-33 Oct 17, 2023
3b1b802
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 17, 2023
526924d
Merge branch 'main' into multimodal_merge
ericharper Oct 19, 2023
a1f7296
Fix Eric's comments
yaoyu-33 Oct 23, 2023
41632c6
Revert "Hide multimodal code changes"
yaoyu-33 Oct 23, 2023
f40b56e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 23, 2023
c032a6d
Merge branch 'multimodal/merge_mm_code' into internal/main
yaoyu-33 Oct 24, 2023
ec8256b
Fix configs
yaoyu-33 Oct 24, 2023
5dad277
Fix neva model
yaoyu-33 Oct 24, 2023
c1c5981
Fix neva casting
yaoyu-33 Oct 24, 2023
b0c5320
Fix neva LoRA non MCore version
yaoyu-33 Oct 25, 2023
14cf3bd
Merge branch 'main' into multimodal_merge
ericharper Oct 25, 2023
4e178e3
Fix neva LoRA MCore
yaoyu-33 Oct 25, 2023
cacf9a8
[SD] group norm fixes
sjmikler Oct 25, 2023
2da64db
Fix neva cfg merge
yaoyu-33 Oct 26, 2023
fba2548
remove groupnorm dependency
suiyoubi Oct 27, 2023
a2da20d
Merge branch 'main' into multimodal_merge
ericharper Oct 30, 2023
41b1b51
Fix copyright headers
yaoyu-33 Oct 30, 2023
438617e
Merge branch 'aot/apex_gn' into 'internal/main'
Oct 30, 2023
7422dbe
LLaVA 1_5 and LORA update
Oct 30, 2023
de405b9
Merge branch 'yuya/llava_1_5_update' into 'internal/main'
Oct 30, 2023
5965a5f
Fix logs
yaoyu-33 Oct 30, 2023
26ee7dc
Fix neva mcore infernece
yaoyu-33 Oct 31, 2023
7356b1c
Fix ema
yaoyu-33 Oct 31, 2023
93e4f99
Fix ema
yaoyu-33 Oct 31, 2023
ca3d8f9
Address Somshubra comments
yaoyu-33 Nov 1, 2023
544e5ea
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 1, 2023
8493a8a
Fix NeVA
yaoyu-33 Nov 1, 2023
ea3d4fc
Remove llama tricks since we are padding the embedding weights direct…
yaoyu-33 Nov 1, 2023
2d5f5ab
Merge branch 'multimodal/merge' into multimodal/merge_mm_code
yaoyu-33 Nov 1, 2023
6f5df3f
Update Dockerfile and mm requirements
meatybobby Nov 1, 2023
65bcec3
Merge branch 'bobchen/nemo_toolkit' into 'internal/main'
Nov 1, 2023
4dff83f
Multimodal unit and jenkins tests
Nov 1, 2023
02cc05d
Merge branch 'mm_tests' into 'internal/main'
Nov 1, 2023
724c956
Add Multimodal Docs
Nov 1, 2023
4951f4f
Merge branch 'mm_docs' into 'internal/main'
Nov 1, 2023
6beaa50
update default conv_template
yaoyu-33 Nov 1, 2023
2f4e334
Merge branch 'internal/main' into multimodal/merge_mm_code
yaoyu-33 Nov 1, 2023
c083f0f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 1, 2023
367723f
Merge branch 'main' into multimodal_merge
ericharper Nov 1, 2023
2840014
Fix neva evaluation
yaoyu-33 Nov 1, 2023
97d9bf9
Update Dockerfile
yaoyu-33 Nov 1, 2023
9173dc2
Merge branch 'internal/main' into multimodal/merge_mm_code
yaoyu-33 Nov 1, 2023
149cdde
Merge branch 'main' into multimodal_merge
ericharper Nov 2, 2023
6b84cef
Fix evaluation loading
yaoyu-33 Nov 2, 2023
ccd6cb5
Fix evaluation API
yaoyu-33 Nov 2, 2023
e0a74da
Merge branch 'internal/main' into multimodal/merge_mm_code
yaoyu-33 Nov 2, 2023
85bd797
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 2, 2023
9b4c9c2
Change quick-gelu to approx-gelu
yaoyu-33 Nov 2, 2023
e2ccc88
hide multimodal
yaoyu-33 Nov 2, 2023
1057139
Merge branch 'multimodal/merge' into multimodal/merge_mm_code
yaoyu-33 Nov 2, 2023
7ed6283
Revert "hide multimodal"
yaoyu-33 Nov 2, 2023
f6ef703
REstructure
yaoyu-33 Nov 2, 2023
9751d10
REstructure again
yaoyu-33 Nov 3, 2023
9ac6102
Update neva evalution code
yaoyu-33 Nov 3, 2023
d4fe16c
Merge branch 'internal/main_change_structure' into multimodal/merge_m…
yaoyu-33 Nov 3, 2023
488d7e9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 3, 2023
2f29c5d
Merge branch 'main' into multimodal/merge_mm_code
yaoyu-33 Nov 3, 2023
0e9c30c
Fix neva model after merging
yaoyu-33 Nov 3, 2023
f68ba2c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 3, 2023
5df0c40
Restructure
yaoyu-33 Nov 6, 2023
b1555a6
Restructure, rename
yaoyu-33 Nov 6, 2023
71141c5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 6, 2023
87b724e
Restructure
yaoyu-33 Nov 6, 2023
e9ba432
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 6, 2023
76df1d8
Merge branch 'main' into multimodal/merge_mm_code
yaoyu-33 Nov 6, 2023
b49f12b
Remove package requirement
meatybobby Nov 3, 2023
d2c200c
hide docs and artifacts
yaoyu-33 Nov 6, 2023
8007765
Merge remote-tracking branch 'github/multimodal/merge_mm_code' into m…
yaoyu-33 Nov 6, 2023
72c683e
Rename Nerf
yaoyu-33 Nov 7, 2023
782316f
Hide Nerf and text to image
yaoyu-33 Nov 7, 2023
d24f74d
Merge branch 'main' into multimodal/merge_mm_code
ericharper Nov 10, 2023
66d42be
Merge branch 'main' into multimodal/merge_mm_code
ericharper Nov 16, 2023
c8dd7e3
Update examples/multimodal/multimodal_llm/neva/convert_hf_llava_to_ne…
yaoyu-33 Nov 16, 2023
565e617
Update examples/multimodal/multimodal_llm/neva/convert_hf_llava_to_ne…
yaoyu-33 Nov 16, 2023
fd1ada8
Fix PR comments, clean comments, move to torch_dtype_from_precision
yaoyu-33 Nov 16, 2023
bccb0ea
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 16, 2023
d596f59
Update to torch_dtype_from_precision
yaoyu-33 Nov 16, 2023
ed9145c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 16, 2023
084ff69
Merge branch 'main' into multimodal/merge_mm_code
ericharper Nov 21, 2023
993f969
Merge branch 'main' into multimodal/merge_mm_code
ericharper Nov 27, 2023
2d3a6b7
Fix PR comments
yaoyu-33 Dec 4, 2023
00ef2b4
Fix copyright and docstrings
yaoyu-33 Dec 4, 2023
0ccf916
Update docstrings
yaoyu-33 Dec 4, 2023
3574590
Optimize imports
yaoyu-33 Dec 4, 2023
90d08a8
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 4, 2023
50e8871
Optimize imports
yaoyu-33 Dec 4, 2023
2ce8e36
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 4, 2023
d6c1b47
Merge branch 'main' into multimodal/merge_mm_code
ericharper Dec 6, 2023
2b62ad3
Merge branch 'main' into multimodal/merge_mm_code
ericharper Dec 7, 2023
1377089
Clean imports
yaoyu-33 Dec 7, 2023
fab06fa
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 7, 2023
9d04cb5
Clean more imports
yaoyu-33 Dec 8, 2023
0d61bfb
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 8, 2023
81d35d8
Merge branch 'main' into multimodal/merge_mm_code
ericharper Dec 8, 2023
a3cc9b4
Fix jenkins
yaoyu-33 Dec 11, 2023
f3c01f7
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 11, 2023
92d0fb4
Add guard to webdataset
yaoyu-33 Dec 11, 2023
4cad2fb
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 11, 2023
ffaa865
Update webdataset guard
yaoyu-33 Dec 11, 2023
b86b879
Update webdataset guard
yaoyu-33 Dec 12, 2023
f953f53
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 12, 2023
373ef57
Merge branch 'main' into multimodal/merge_mm_code
ericharper Dec 12, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
193 changes: 193 additions & 0 deletions examples/multimodal/convert_ckpt_to_nemo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

r"""
Conversion script to convert PTL checkpoints into nemo checkpoint.
Example to run this conversion script:
python -m torch.distributed.launch --nproc_per_node=<tensor_model_parallel_size> * <pipeline_model_parallel_size> \
convert_ckpt_to_nemo.py \
--checkpoint_folder <path_to_PTL_checkpoints_folder> \
--checkpoint_name <checkpoint_name> \
--nemo_file_path <path_to_output_nemo_file> \
--tensor_model_parallel_size <tensor_model_parallel_size> \
--pipeline_model_parallel_size <pipeline_model_parallel_size>
"""

import os
from argparse import ArgumentParser

import torch
from omegaconf.omegaconf import OmegaConf, open_dict

from nemo.collections.multimodal.models.multimodal_llm.kosmos import MegatronKosmosModel
from nemo.collections.multimodal.models.multimodal_llm.neva.neva_model import MegatronNevaModel
from nemo.collections.multimodal.models.text_to_image.controlnet.controlnet import MegatronControlNet
from nemo.collections.multimodal.models.text_to_image.imagen import MegatronImagen
from nemo.collections.multimodal.models.text_to_image.instruct_pix2pix.ldm.ddpm_edit import MegatronLatentDiffusionEdit
from nemo.collections.multimodal.models.text_to_image.stable_diffusion.ldm.ddpm import MegatronLatentDiffusion
from nemo.collections.multimodal.models.vision_language_foundation.clip import MegatronCLIPModel
from nemo.collections.nlp.parts.megatron_trainer_builder import MegatronTrainerBuilder
from nemo.collections.nlp.parts.nlp_overrides import NLPSaveRestoreConnector
from nemo.utils import AppState, logging
from nemo.utils.distributed import initialize_distributed
from nemo.utils.model_utils import inject_model_parallel_rank

try:
from megatron.core import parallel_state

HAVE_MEGATRON_CORE = True
Dismissed Show dismissed Hide dismissed

except (ImportError, ModuleNotFoundError):

HAVE_MEGATRON_CORE = False

Check notice

Code scanning / CodeQL

Unused global variable Note

The global variable 'HAVE_MEGATRON_CORE' is not used.


def get_args():
parser = ArgumentParser()
parser.add_argument(
"--checkpoint_folder",
type=str,
default=None,
required=True,
help="Path to PTL checkpoints saved during training. Ex: /raid/nemo_experiments/multimodal/checkpoints",
)
parser.add_argument(
"--checkpoint_name",
type=str,
default=None,
required=True,
help="Name of checkpoint to be used. Ex: megatron_gpt--val_loss=6.34-step=649-last.ckpt",
)

parser.add_argument(
"--hparams_file",
type=str,
default=None,
required=False,
help="Path config for restoring. It's created during training and may need to be modified during restore if restore environment is different than training. Ex: /raid/nemo_experiments/megatron_gpt/hparams.yaml",
)
parser.add_argument("--nemo_file_path", type=str, default=None, required=True, help="Path to output .nemo file.")
parser.add_argument("--gpus_per_node", type=int, required=False, default=1)
parser.add_argument("--tensor_model_parallel_size", type=int, required=False, default=1)
parser.add_argument("--pipeline_model_parallel_size", type=int, required=False, default=1)
parser.add_argument(
"--pipeline_model_parallel_split_rank",
type=int,
required=False,
default=None,
help="If pipeline parallel size > 1, this is the rank at which the encoder ends and the decoder begins.",
)
parser.add_argument("--model_type", type=str, required=False, default="megatron_clip")
parser.add_argument("--local_rank", type=int, required=False, default=os.getenv('LOCAL_RANK', -1))
parser.add_argument("--bcp", action="store_true", help="Whether on BCP platform")

args = parser.parse_args()
return args


def convert(local_rank, rank, world_size, args):
app_state = AppState()
app_state.data_parallel_rank = 0

cfg = OmegaConf.load(args.hparams_file)
with open_dict(cfg):
cfg['model'] = cfg['cfg']
cfg['trainer'] = {'precision': cfg['model']['precision']}
if args.bcp:
cfg['cluster_type'] = 'BCP'
trainer = MegatronTrainerBuilder(cfg).create_trainer()

app_state.pipeline_model_parallel_size = args.pipeline_model_parallel_size
app_state.tensor_model_parallel_size = args.tensor_model_parallel_size

# no use atm, use to split ranks in encoder/decoder models.
if args.pipeline_model_parallel_size > 1 and args.model_type in []:
if args.pipeline_model_parallel_split_rank is not None:
app_state.pipeline_model_parallel_split_rank = args.pipeline_model_parallel_split_rank
else:
if args.pipeline_model_parallel_size % 2 != 0:
raise ValueError(
f"Pipeline model parallel size {args.pipeline_model_parallel_size} must be even if split rank is not specified."
)
else:
# If split rank is not set, then we set it to be pipeline_model_parallel_size // 2 - this is because in most cases we have the same number of enc/dec layers.
app_state.pipeline_model_parallel_split_rank = args.pipeline_model_parallel_size // 2
else:
app_state.pipeline_model_parallel_split_rank = None

app_state.model_parallel_size = app_state.tensor_model_parallel_size * app_state.pipeline_model_parallel_size

parallel_state.initialize_model_parallel(
tensor_model_parallel_size=app_state.tensor_model_parallel_size,
pipeline_model_parallel_size=app_state.pipeline_model_parallel_size,
pipeline_model_parallel_split_rank=app_state.pipeline_model_parallel_split_rank,
)

app_state.pipeline_model_parallel_rank = parallel_state.get_pipeline_model_parallel_rank()
app_state.tensor_model_parallel_rank = parallel_state.get_tensor_model_parallel_rank()

# inject model parallel rank
checkpoint_path = inject_model_parallel_rank(os.path.join(args.checkpoint_folder, args.checkpoint_name))

logging.info(
f'rank: {rank}, local_rank: {local_rank}, is loading checkpoint: {checkpoint_path} for tp_rank: {app_state.tensor_model_parallel_rank} and pp_rank: {app_state.pipeline_model_parallel_rank}'
)

if args.model_type == 'megatron_clip':
model = MegatronCLIPModel.load_from_checkpoint(
checkpoint_path, hparams_file=args.hparams_file, trainer=trainer
)
elif args.model_type == 'stable_diffusion':
model = MegatronLatentDiffusion.load_from_checkpoint(
checkpoint_path, hparams_file=args.hparams_file, trainer=trainer
)
elif args.model_type == 'instruct_pix2pix':
model = MegatronLatentDiffusionEdit.load_from_checkpoint(
checkpoint_path, hparams_file=args.hparams_file, trainer=trainer
)
elif args.model_type == 'dreambooth':
model = MegatronLatentDiffusion.load_from_checkpoint(
checkpoint_path, hparams_file=args.hparams_file, trainer=trainer
)
elif args.model_type == 'imagen':
model = MegatronImagen.load_from_checkpoint(checkpoint_path, hparams_file=args.hparams_file, trainer=trainer)
elif args.model_type == 'controlnet':
model = MegatronControlNet.load_from_checkpoint(
checkpoint_path, hparams_file=args.hparams_file, trainer=trainer
)
elif args.model_type == 'kosmos':
model = MegatronKosmosModel.load_from_checkpoint(
checkpoint_path, hparams_file=args.hparams_file, trainer=trainer
)
elif args.model_type == 'neva':
model = MegatronNevaModel.load_from_checkpoint(
checkpoint_path, hparams_file=args.hparams_file, trainer=trainer
)
else:
raise ValueError(f"Unrecognized model_type {args.model_type}.")

model._save_restore_connector = NLPSaveRestoreConnector()

if torch.distributed.is_initialized():
torch.distributed.barrier()

model.save_to(args.nemo_file_path)

logging.info(f'NeMo model saved to: {args.nemo_file_path}')


if __name__ == '__main__':
args = get_args()
local_rank, rank, world_size = initialize_distributed(args)
convert(local_rank, rank, world_size, args)
Loading
Loading