Doesn't have the shards in expected directory. #11359

wittysam · 2024-11-21T12:47:29Z

Describe the bug

I was trying to run the following lines:

%%bash

# Set paths to the model, train, validation and test sets.
MODEL="./llama-3_1-8b-instruct-nemo_v1.0/llama3_1_8b_instruct.nemo"

TRAIN_DS="[./curated-data/law-qa-train_preprocessed.jsonl]"
VALID_DS="[./curated-data/law-qa-val_preprocessed.jsonl]"
TEST_DS="[./curated-data/law-qa-test_preprocessed.jsonl]"
TEST_NAMES="[law]"

SCHEME="lora"
TP_SIZE=1
PP_SIZE=1

rm -rf results
OUTPUT_DIR="./results/Meta-llama3.1-8B-Instruct-titlegen"

torchrun --nproc_per_node=1 \
/opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \
    exp_manager.exp_dir=${OUTPUT_DIR} \
    exp_manager.explicit_log_dir=${OUTPUT_DIR} \
    trainer.devices=1 \
    trainer.num_nodes=1 \
    trainer.precision=bf16-mixed \
    trainer.val_check_interval=0.2 \
    trainer.max_steps=50 \
    model.megatron_amp_O2=True \
    ++model.mcore_gpt=True \
    model.tensor_model_parallel_size=${TP_SIZE} \
    model.pipeline_model_parallel_size=${PP_SIZE} \
    model.micro_batch_size=1 \
    model.global_batch_size=64 \
    model.restore_from_path=${MODEL} \
    model.data.train_ds.file_names=${TRAIN_DS} \
    model.data.train_ds.concat_sampling_probabilities=[1.0] \
    model.data.validation_ds.file_names=${VALID_DS} \
    model.peft.peft_scheme=${SCHEME}

but keep getting the following error:

'FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpelprgyyt/model_weights/model.decoder.layers.self_attention.core_attention._extra_state/shard_0_32.pt''

Just can't seem to have the weights file under the directory of "/tmp/tmpelprgyyt/model_weights/model.decoder.layers.self_attention.core_attention._extra_state"

Expected behavior

To successfully load the shards to the model.

Environment overview (please complete the following information)

Environment location: Primehub docker website interface(Jupyter lab)
Method of NeMo install: pull from nvcr.io/nvidia/nemo:24.09

Environment details

I'm using NeMo image 24.09 version on 4 A10 GPU cores.

The text was updated successfully, but these errors were encountered:

wittysam added the bug Something isn't working label Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doesn't have the shards in expected directory. #11359

Doesn't have the shards in expected directory. #11359

wittysam commented Nov 21, 2024 •

edited

Loading

Doesn't have the shards in expected directory. #11359

Doesn't have the shards in expected directory. #11359

Comments

wittysam commented Nov 21, 2024 • edited Loading

wittysam commented Nov 21, 2024 •

edited

Loading