Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LM Training Pipeline does not work with the given README.md instructions #1

Open
MrGranddy opened this issue May 2, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@MrGranddy
Copy link

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

Follow training instructions provided in README.md. By default the provided script for data preperation "prepare-data-joint-dict.sh" does not include a language model input yet "preprocess-joint-dict.sh" does. If the preparation script is ran without the additional language model parameter then there will be no lm input. So directly running:

bash prepare-data-joint-dict.sh $DATAPATH $SRC $TGT $BPESIZE $DESTDIR

Then:

bash train-waitk-model.sh $DATAPATH 32768 2 $SRC $TGT $WAITK word 1 0 $MASTER_IP $MASTER_PORT $WORDSIMT_DIR

Then:

bash train-waitk-lm-model.sh $DATAPATH 32768 2 $SRC $TGT $WAITK word 1 0 $MASTER_IP $MASTER_PORT $LM_PATH $WORDSIMT_DIR

Code sample

There is this part in "encode-bpe-joint-dict.sh":

echo "preparing LM data..."
cp $DATAPATH/train.${SRC}-${TGT}.${SRC} $DATAPATH/train.bpe.joint.${BPESIZE}.${SRC}-${TGT}.lm.${SRC}
cp $DATAPATH/val.${SRC}-${TGT}.${SRC} $DATAPATH/val.bpe.joint.${BPESIZE}.${SRC}-${TGT}.lm.${SRC}
cp $DATAPATH/test.${SRC}-${TGT}.${SRC} $DATAPATH/test.bpe.joint.${BPESIZE}.${SRC}-${TGT}.lm.${SRC}

but this as far as I understand does not do anything because the code tries to read the LM dataset in the following code snippet from "tasks/waitk_translation.py"

# infer langcode
if split_exists(split_k, src, tgt, src, data_path):
    prefix = os.path.join(data_path, "{}.{}-{}.".format(split_k, src, tgt))
    if src_lm_datasets is not None:
        lm_prefix = os.path.join(
            data_path, "{}.lm.{}-{}.".format(split_k, src, tgt)
        )
elif split_exists(split_k, tgt, src, src, data_path):
    prefix = os.path.join(data_path, "{}.{}-{}.".format(split_k, tgt, src))
    if src_lm_datasets is not None:
        lm_prefix = os.path.join(
            data_path, "{}.lm.{}-{}.".format(split_k, tgt, src)
        )

which is not the same naming format, which results in "src_lm_datasets" list being empty at the end of this process, which results in

if self.src_lm is not None:
    example["source_lm"] = self.src_lm[index]

not running in "data/language_pair_waitk_dataset.py" which leads to "source_lm" key not being set and eventually leading to an error in:

if lm_pad_idx is not None:
    src_lm_tokens = merge("source_lm", left_pad=left_pad_source, lm_input=True)
    batch["net_input"]["lm_input"] = src_lm_tokens.index_select(0, sort_order)

Key "source_lm" not found. If the data preperation script is ran with a language_model parameter then it again throws an error because:

self.dict.max_model_input_sizes.values())[0]

self.dict does not have max_model_input_sizes for facebook/xglm-564M model, this can be fixed by replacing it with:

# Default maximum length if max_model_input_sizes is not defined
default_max_len = 3000
try:
    max_len = list(self.dict.max_model_input_sizes.values())[0] if hasattr(self.dict, 'max_model_input_sizes') else default_max_len
except AttributeError:
    max_len = default_max_len

which seemingly fixes the whole pipeline, yet at this point I do not understand why the given instructions does not lead to the expected behaviour and also if the given work around is correct. Can you please make a clarification on this?

Expected behavior

The above mentioned problems should not have happened.

Environment

  • fairseq Version (e.g., 1.0 or main):
  • PyTorch Version (e.g., 1.0)
  • OS (e.g., Linux):
  • How you installed fairseq (pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:
@MrGranddy MrGranddy added the bug Something isn't working label May 2, 2024
@hankyu-xl8
Copy link

Hello @MrGranddy, sorry for the issue.
The reason LM data was not generated was because LM name was not given to prepare-data-joint-dict.sh.
The fix was committed to 1f0aff2 .
Please provide the LM name as the last argument.

The data is only copied in encode-bpe-joint-dict.sh, and binarized in https://github.com/xl8-ai/WordSiMT/blob/main/fairseq_cli/preprocess.py#L389-L391 .

max_model_input_sizes issue is due to the recent commit in transformers(huggingface/transformers#29785).
Could you try with transformers==4.30?

@MrGranddy
Copy link
Author

Thank you for the clarification and fast response, it helped a ton :)
I am already working with the max_len workaround but I will try with transformers==4.30 as soon as possible and
share the results here. Thanks a lot :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants