Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doesn't have the shards in expected directory. #11359

Open
wittysam opened this issue Nov 21, 2024 · 0 comments
Open

Doesn't have the shards in expected directory. #11359

wittysam opened this issue Nov 21, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@wittysam
Copy link

wittysam commented Nov 21, 2024

Describe the bug

I was trying to run the following lines:

%%bash

# Set paths to the model, train, validation and test sets.
MODEL="./llama-3_1-8b-instruct-nemo_v1.0/llama3_1_8b_instruct.nemo"

TRAIN_DS="[./curated-data/law-qa-train_preprocessed.jsonl]"
VALID_DS="[./curated-data/law-qa-val_preprocessed.jsonl]"
TEST_DS="[./curated-data/law-qa-test_preprocessed.jsonl]"
TEST_NAMES="[law]"

SCHEME="lora"
TP_SIZE=1
PP_SIZE=1

rm -rf results
OUTPUT_DIR="./results/Meta-llama3.1-8B-Instruct-titlegen"

torchrun --nproc_per_node=1 \
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \
    exp_manager.exp_dir=${OUTPUT_DIR} \
    exp_manager.explicit_log_dir=${OUTPUT_DIR} \
    trainer.devices=1 \
    trainer.num_nodes=1 \
    trainer.precision=bf16-mixed \
    trainer.val_check_interval=0.2 \
    trainer.max_steps=50 \
    model.megatron_amp_O2=True \
    ++model.mcore_gpt=True \
    model.tensor_model_parallel_size=${TP_SIZE} \
    model.pipeline_model_parallel_size=${PP_SIZE} \
    model.micro_batch_size=1 \
    model.global_batch_size=64 \
    model.restore_from_path=${MODEL} \
    model.data.train_ds.file_names=${TRAIN_DS} \
    model.data.train_ds.concat_sampling_probabilities=[1.0] \
    model.data.validation_ds.file_names=${VALID_DS} \
    model.peft.peft_scheme=${SCHEME}

but keep getting the following error:

'FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpelprgyyt/model_weights/model.decoder.layers.self_attention.core_attention._extra_state/shard_0_32.pt''

Just can't seem to have the weights file under the directory of "/tmp/tmpelprgyyt/model_weights/model.decoder.layers.self_attention.core_attention._extra_state"

Expected behavior

To successfully load the shards to the model.

Environment overview (please complete the following information)

  • Environment location: Primehub docker website interface(Jupyter lab)
  • Method of NeMo install: pull from nvcr.io/nvidia/nemo:24.09

Environment details

I'm using NeMo image 24.09 version on 4 A10 GPU cores.

@wittysam wittysam added the bug Something isn't working label Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant