-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pre- and post-processing text in Simuleval #18
Comments
Hey @xutaima I kindly remind you that I'm looking for your help on this issue. The bleu score I got on a subset of wmt15 de-en test set is significantly worse than what is reported on the paper. |
@xutaima could you please let ppl know if you are interested in making this code base work (and if so, when)? I'd like to indicate that we need your help because both MMA and simuleval have been mainly developed by you and people are having reeealy difficult time to make them run ;) |
@kurtisxx While I sincerely appreciate your strong interests in our work and kind reminders, I don't think you really have to remind me in this frequency. I would do my best to reply as soon as possible. First of all, I do apologize the code and document for MMA, especially on text-to-text part is out of date. These years we are shifting our focus more on speech-to-text simultaneous translation. Also the idea of SimulEval came after the MMA paper, that we would like to have a generic framework of evaluating simultaneous translation models, where in MMA paper the evaluation is ad hoc. We haven't managed to finish all the updates but been working on it. As to your questions
|
Thank you for your reply, @xutaima.
Sure, here is my training log:
I have run the training command given for MMA-Hard model in [README (https://github.com/pytorch/fairseq/blob/master/examples/simultaneous_translation/docs/ende-mma.md). When I evaluate the best_checkpoint of this training on the full test set, I got the following scores:
What is more interesting is its performance on a subset from training set. After evaluating on the test set, I've also created a subset with 200 examples from training set on which I evaluated the model:
My comments to your answers above:
I didn't quite understand how tokenization can be done in segment_to_units function. I was thinking that segment is only a word/token not the full sentence? Do you mean running a tokenizer like moses on a word level? Does that make sense? Also, this en-ja code here says that segment_to_words # Split a full word (segment) into subwords (units)?
As you can see above, I am merging subwords and clearing queue after merging subwords in the units_to_segment function not in update_states_write. Is this okay? |
What's you hardware setup? Can you share the full log including the configurations? There's only 2 epochs which doesn't look normal to me. As to tokenizer, yes it makes sense on word level. In real time translation there is no full source sentence. In en-ja example, we use sentencepiecemodel which serves both word splitter and tokenizer, so they could be done in one pass.
|
Thanks for you reply, @xutaima! I appreciate that! I used a single v100 GPU for training and my operating system is Linux. I think the training stops in the middle of second epoch because --max-iter is set to 50000 as instructed in the readme file. I used following command to start training:
Aren't those correct parameters to Train MMA-Hard model on de-en data? Unfortunately, I can't find my training log at the moment but I started another run just to double check everything. It must be finished in a couple of hours. I'll posted the full training log, inference time log and scores here in a couple of hours once it finishes. Here is the link for the log of test set evaluation I mentioned above: |
Training has not finished yet, so I need to wait a bit more to share output but here is the training parameters:
|
Hi thanks for sharing the command. Sorry we do have some typos in the documentation. I would suggest you run the following command
There are two modifications. 1. a bigger architecture As to the inference log, the merging subwords looks good to me, but probably you need detokenization. For instance, in "Prime Minister India and Japan are in Tokyo .", |
Thanks for your prompt reply! I also have access to a node with 4 GPUS. In that case, should I set update-freq to 16? |
Yes |
I have one more question , @xutaima . I got this message:
Did you use --fp16 or --amp flags to support faster training? I mean, does MMA/fairseq code support using these flags or would you suggest me to not to use? |
Not very sure about |
@kurtisxx It seems that your model is still converging. I don't have a model on my hand at the moment and we used to have an ad hoc method for the inference rather than simuleval. Sure I can retrain the model on my side and debug what bugs are there. |
@xutaima do you remember what was model's validation set perplexity after training completes (Also, how many epochs did it take)?
Thank would be great! Thank you! I'd also really appreciate if you share the training/inference code that you're going to use as well. |
@xutaima In addition to training a model on your side, could you please also look at this issue: facebookresearch/fairseq#3894 . As mentioned in the PR, MMA code in fairseq's master branch doesn't work right now (There are run time errors). I am wondering whether there are more bugs other than this.. |
@xutaima , How did the training go in your side? Could you also please let me know what you think about my previous questions (Let me copy them here just to make things easier):
# encode previous predicted target tokens
tgt_indices = self.to_device(
torch.LongTensor(
[self.model.decoder.dictionary.eos()]
+ [
self.dict['tgt'].index(x)
for x in states.units.target.value
if x is not None
]
).unsqueeze(0)
)
x, outputs = self.model.decoder.forward(
prev_output_tokens=tgt_indices,
encoder_out=states.encoder_states,
incremental_state=states.incremental_states,
) Basically, I've copied this part from EN-JA inference time code, as you suggested. However, I didn't understand why eos token from decoder's dictionary is prefixed to the tgt_indices list?
I've already spent a good amount of time and also money on this. I'm really looking forward to issues being solved as quick as possible and we can replicate the results reported in the paper.. |
@xutaima, Kindly remind you that I've been waiting an answer from you for 4 days. If you're not interested in fixing bugs in MMA code and also not interested to provide people a code which they can actually use and replicate your results, please let me know. In that case, I can get in touch with co-authors and/or other people in fairseq and ask if there is a way to make this project run. |
@kurtisxx thank you for your patience. We're working towards a deadline so we don't have much bandwidth at the moment. In general, this kind of OSS project is provided as is with no guarantee of support. However, we will try to address issues on a best effort basis. Thank you for your understanding. |
@jmp84 Thank you for your prompt reply and I wish you good luck with your deadlines. Nevertheless, I'd just want to make something clear. I am not asking you to update the MMA repository to make it compatible with current fairseq. I am okay with any working combination of fairseq & MMA. Unfortunately, I doubt there is such a combination, as I think MMA code was not working even when it was first released because of some functions (e.g. Link) that are called but not defined. I don't think that it aligns well with footnote-1 in the paper which says "The code is available at https://github.com/pytorch/fairseq/tree/master/examples/simultaneous_translation" |
@kurtisxx, I am also a user of FairseqMMA and I would like to ask if you have found a feasible way to reproduce its results |
Hello @kurtisxx , can you share the mma-dummy/mmaAgent.py file? Thank you very much |
What is the ad hoc method ? |
* Remove the starting silence in target speech * Add new max_len setting * Write score for score only option * Update readme for s2st * Add postprocessors * Check fairseq agent logic * Make TTS an postprocessor * Refactored * Change w2v path to absolute path * Remove pdb * Add build_postprocesor function to agent class * Refactored * Minor fix * Test time unit waitk agent * Fix the logic of determine online mode * Remove debug code * Turn off model loading logs * Prevent unit early stops * Add max length * Add import_user_module * Remove unity agent * add back unity agent * Add back online extractor * Remove unity codes * Remove pdb * Remove unity agent * add test-time waitk unity class * add back unity agent * Fix bugs running s2t agent * Use length of indices to determine the max_len * Fix a bug generating target audio * Formatting * Adding debug log for intermediate text * Create instead of append when index = 0 * Fix a bug not considering EOS in tts postprocessor * Record delay before instance is finished * Write speech latency numbers to logs * Typo * Do not record speech audio in logs * FIx issue of wrong end index * Handles when input of TTS system is all punctuations. * Update epst numbers * Add source info in s2t log * Visualizations for s2s and s2t * Update visualization * Update readme * Update scripts * Unity agent * Add back transducer readme * Typos in read me * rm unity for simpliciry * rm unity agent for pr review
I am playing with the MMA-hard model to replicate WMT15 DE-EN experiments reported in the paper and my question is about preprocessing and postprocessing data. The paper says that:
Following what is said above, I applied moses scripts to tokenize raw files and applied BPE to the tokenized files. Then, tokenized and BPE applied train, val and test files were binarized using following fairseq preprocess command:
Afer that, I trained a MMA-hard model using the binarized data. Now, I would like to evaluate (w.r.t. Latency and Bleu) a checkpoint using SimulEval. My first question is about the file format: Which format should I provide the test files as --source and --target to simuleval command? There are three options as far as I can see:
I am following EN-JA waitk model's agent file to understand what should be done. However, the difference between the experiment I'd like to replicate and EN-JA experiment is that in EN-JA sentencepiece model is used for tokenization whereas in my case moses is used and also bpe is applied.
So, I tried following:
I provided path of TOKENIZED files as --source and --target to simuleval. Also, I've implemented segment_to_units and build_word_splitter functions as follows but I couldn't figure out how I should implement units_to_segment.
I tried to test this implementation as follows:
So, my questions are:
segment_to_units
andbuild_word_splitter
functions seem correct?units_to_segment
andupdate_states_write
should be implemented?Edit: When I evaluate the best checkpoint on a subset of test-set using the above code I got the following output:
The text was updated successfully, but these errors were encountered: