You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Follow training instructions provided in README.md. By default the provided script for data preperation "prepare-data-joint-dict.sh" does not include a language model input yet "preprocess-joint-dict.sh" does. If the preparation script is ran without the additional language model parameter then there will be no lm input. So directly running:
but this as far as I understand does not do anything because the code tries to read the LM dataset in the following code snippet from "tasks/waitk_translation.py"
# infer langcode
if split_exists(split_k, src, tgt, src, data_path):
prefix = os.path.join(data_path, "{}.{}-{}.".format(split_k, src, tgt))
if src_lm_datasets is not None:
lm_prefix = os.path.join(
data_path, "{}.lm.{}-{}.".format(split_k, src, tgt)
)
elif split_exists(split_k, tgt, src, src, data_path):
prefix = os.path.join(data_path, "{}.{}-{}.".format(split_k, tgt, src))
if src_lm_datasets is not None:
lm_prefix = os.path.join(
data_path, "{}.lm.{}-{}.".format(split_k, tgt, src)
)
which is not the same naming format, which results in "src_lm_datasets" list being empty at the end of this process, which results in
if self.src_lm is not None:
example["source_lm"] = self.src_lm[index]
not running in "data/language_pair_waitk_dataset.py" which leads to "source_lm" key not being set and eventually leading to an error in:
if lm_pad_idx is not None:
src_lm_tokens = merge("source_lm", left_pad=left_pad_source, lm_input=True)
batch["net_input"]["lm_input"] = src_lm_tokens.index_select(0, sort_order)
Key "source_lm" not found. If the data preperation script is ran with a language_model parameter then it again throws an error because:
self.dict.max_model_input_sizes.values())[0]
self.dict does not have max_model_input_sizes for facebook/xglm-564M model, this can be fixed by replacing it with:
# Default maximum length if max_model_input_sizes is not defined
default_max_len = 3000
try:
max_len = list(self.dict.max_model_input_sizes.values())[0] if hasattr(self.dict, 'max_model_input_sizes') else default_max_len
except AttributeError:
max_len = default_max_len
which seemingly fixes the whole pipeline, yet at this point I do not understand why the given instructions does not lead to the expected behaviour and also if the given work around is correct. Can you please make a clarification on this?
Expected behavior
The above mentioned problems should not have happened.
Environment
fairseq Version (e.g., 1.0 or main):
PyTorch Version (e.g., 1.0)
OS (e.g., Linux):
How you installed fairseq (pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:
The text was updated successfully, but these errors were encountered:
Hello @MrGranddy, sorry for the issue.
The reason LM data was not generated was because LM name was not given to prepare-data-joint-dict.sh.
The fix was committed to 1f0aff2 .
Please provide the LM name as the last argument.
Thank you for the clarification and fast response, it helped a ton :)
I am already working with the max_len workaround but I will try with transformers==4.30 as soon as possible and
share the results here. Thanks a lot :)
🐛 Bug
To Reproduce
Steps to reproduce the behavior:
Follow training instructions provided in README.md. By default the provided script for data preperation "prepare-data-joint-dict.sh" does not include a language model input yet "preprocess-joint-dict.sh" does. If the preparation script is ran without the additional language model parameter then there will be no lm input. So directly running:
bash prepare-data-joint-dict.sh $DATAPATH $SRC $TGT $BPESIZE $DESTDIR
Then:
bash train-waitk-model.sh $DATAPATH 32768 2 $SRC $TGT $WAITK word 1 0 $MASTER_IP $MASTER_PORT $WORDSIMT_DIR
Then:
bash train-waitk-lm-model.sh $DATAPATH 32768 2 $SRC $TGT $WAITK word 1 0 $MASTER_IP $MASTER_PORT $LM_PATH $WORDSIMT_DIR
Code sample
There is this part in "encode-bpe-joint-dict.sh":
but this as far as I understand does not do anything because the code tries to read the LM dataset in the following code snippet from "tasks/waitk_translation.py"
which is not the same naming format, which results in "src_lm_datasets" list being empty at the end of this process, which results in
not running in "data/language_pair_waitk_dataset.py" which leads to "source_lm" key not being set and eventually leading to an error in:
Key "source_lm" not found. If the data preperation script is ran with a language_model parameter then it again throws an error because:
self.dict does not have max_model_input_sizes for facebook/xglm-564M model, this can be fixed by replacing it with:
which seemingly fixes the whole pipeline, yet at this point I do not understand why the given instructions does not lead to the expected behaviour and also if the given work around is correct. Can you please make a clarification on this?
Expected behavior
The above mentioned problems should not have happened.
Environment
pip
, source):The text was updated successfully, but these errors were encountered: