LM Training Pipeline does not work with the given README.md instructions #1

MrGranddy · 2024-05-02T03:35:16Z

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

Follow training instructions provided in README.md. By default the provided script for data preperation "prepare-data-joint-dict.sh" does not include a language model input yet "preprocess-joint-dict.sh" does. If the preparation script is ran without the additional language model parameter then there will be no lm input. So directly running:

bash prepare-data-joint-dict.sh $DATAPATH $SRC $TGT $BPESIZE $DESTDIR

Then:

bash train-waitk-model.sh $DATAPATH 32768 2 $SRC $TGT $WAITK word 1 0 $MASTER_IP $MASTER_PORT $WORDSIMT_DIR

Then:

bash train-waitk-lm-model.sh $DATAPATH 32768 2 $SRC $TGT $WAITK word 1 0 $MASTER_IP $MASTER_PORT $LM_PATH $WORDSIMT_DIR

Code sample

There is this part in "encode-bpe-joint-dict.sh":

echo "preparing LM data..."
cp $DATAPATH/train.${SRC}-${TGT}.${SRC} $DATAPATH/train.bpe.joint.${BPESIZE}.${SRC}-${TGT}.lm.${SRC}
cp $DATAPATH/val.${SRC}-${TGT}.${SRC} $DATAPATH/val.bpe.joint.${BPESIZE}.${SRC}-${TGT}.lm.${SRC}
cp $DATAPATH/test.${SRC}-${TGT}.${SRC} $DATAPATH/test.bpe.joint.${BPESIZE}.${SRC}-${TGT}.lm.${SRC}

but this as far as I understand does not do anything because the code tries to read the LM dataset in the following code snippet from "tasks/waitk_translation.py"

# infer langcode
if split_exists(split_k, src, tgt, src, data_path):
    prefix = os.path.join(data_path, "{}.{}-{}.".format(split_k, src, tgt))
    if src_lm_datasets is not None:
        lm_prefix = os.path.join(
            data_path, "{}.lm.{}-{}.".format(split_k, src, tgt)
        )
elif split_exists(split_k, tgt, src, src, data_path):
    prefix = os.path.join(data_path, "{}.{}-{}.".format(split_k, tgt, src))
    if src_lm_datasets is not None:
        lm_prefix = os.path.join(
            data_path, "{}.lm.{}-{}.".format(split_k, tgt, src)
        )

which is not the same naming format, which results in "src_lm_datasets" list being empty at the end of this process, which results in

if self.src_lm is not None:
    example["source_lm"] = self.src_lm[index]

not running in "data/language_pair_waitk_dataset.py" which leads to "source_lm" key not being set and eventually leading to an error in:

if lm_pad_idx is not None:
    src_lm_tokens = merge("source_lm", left_pad=left_pad_source, lm_input=True)
    batch["net_input"]["lm_input"] = src_lm_tokens.index_select(0, sort_order)

Key "source_lm" not found. If the data preperation script is ran with a language_model parameter then it again throws an error because:

self.dict.max_model_input_sizes.values())[0]

self.dict does not have max_model_input_sizes for facebook/xglm-564M model, this can be fixed by replacing it with:

# Default maximum length if max_model_input_sizes is not defined
default_max_len = 3000
try:
    max_len = list(self.dict.max_model_input_sizes.values())[0] if hasattr(self.dict, 'max_model_input_sizes') else default_max_len
except AttributeError:
    max_len = default_max_len

which seemingly fixes the whole pipeline, yet at this point I do not understand why the given instructions does not lead to the expected behaviour and also if the given work around is correct. Can you please make a clarification on this?

Expected behavior

The above mentioned problems should not have happened.

Environment

fairseq Version (e.g., 1.0 or main):
PyTorch Version (e.g., 1.0)
OS (e.g., Linux):
How you installed fairseq (pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:

The text was updated successfully, but these errors were encountered:

hankyu-xl8 · 2024-05-03T07:43:20Z

Hello @MrGranddy, sorry for the issue.
The reason LM data was not generated was because LM name was not given to prepare-data-joint-dict.sh.
The fix was committed to 1f0aff2 .
Please provide the LM name as the last argument.

The data is only copied in encode-bpe-joint-dict.sh, and binarized in https://github.com/xl8-ai/WordSiMT/blob/main/fairseq_cli/preprocess.py#L389-L391 .

max_model_input_sizes issue is due to the recent commit in transformers(huggingface/transformers#29785).
Could you try with transformers==4.30?

MrGranddy · 2024-05-03T10:24:41Z

Thank you for the clarification and fast response, it helped a ton :)
I am already working with the max_len workaround but I will try with transformers==4.30 as soon as possible and
share the results here. Thanks a lot :)

MrGranddy added the bug Something isn't working label May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LM Training Pipeline does not work with the given README.md instructions #1

LM Training Pipeline does not work with the given README.md instructions #1

MrGranddy commented May 2, 2024

hankyu-xl8 commented May 3, 2024

MrGranddy commented May 3, 2024

LM Training Pipeline does not work with the given README.md instructions #1

LM Training Pipeline does not work with the given README.md instructions #1

Comments

MrGranddy commented May 2, 2024

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

hankyu-xl8 commented May 3, 2024

MrGranddy commented May 3, 2024