Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocessing #3

Open
QAQ-v opened this issue Sep 18, 2019 · 13 comments
Open

Preprocessing #3

QAQ-v opened this issue Sep 18, 2019 · 13 comments

Comments

@QAQ-v
Copy link

QAQ-v commented Sep 18, 2019

Hi,

Could you please release the preprocessing codes for generating the structural sequence and the commands for applying bpe? i.e., how to get the files in corpus_sample/all_path_corpus and corpus_sample/five_path_corpus.

Thanks.

@Amazing-J
Copy link
Owner

Python has an [“anytree”] (https://pypi.org/project/anytree/2.1.4/) . You can try.

@QAQ-v
Copy link
Author

QAQ-v commented Sep 26, 2019

Python has an [“anytree”] (https://pypi.org/project/anytree/2.1.4/) . You can try.

Thanks for your reply! I am still confused about how to get the structural sequence, maybe releasing the preprocessing codes or the preprocessing data is a better way to help people run your model.

Meanwhile, there is another question. I trained Transformer baseline model implemented by OpenNMT with same hyperparameter setting as yours on LDC2015E86. When I compute the BLEU score on the BPE embedding prediction I can get a comparable result in Table 3 of your paper (25.5), but after I remove the "@@" in the prediction the BLEU droped a lot. So I am wondering that the BLEU results you reported in Table 3 was computed based on the BPE embedding prediction? Did you remove the "@@" in the final prediction of the model?

@Amazing-J
Copy link
Owner

After deleting "@@ ", the BLEU value should not decline, but rise a lot. Are you sure you are doing the right BPE process? It is worth noting that not only "@@" but also a space has been deleted. ( "@@ " )
The target side should do nothing but tokenization ( use PTB_tokenizer ).

@QAQ-v
Copy link
Author

QAQ-v commented Sep 26, 2019

After deleting "@@ ", the BLEU value should not decline, but rise a lot. Are you sure you are doing the right BPE process? It is worth noting that not only "@@" but also a space has been deleted. ( "@@ " )
The target side should do nothing but tokenization ( use PTB_tokenizer ).

Thanks for your reply!

I follow the author's instruction to delete "@@ " (sed -r 's/(@@ )|(@@ ?$)//g') so there shouldn't be any mistakes. So you mean you only apply the BPE on source side? On the target side you do not apply the BPE? But in this way the source and target sides do not share the same alphabet, you still share the vocab in the model? Could you please release the code for BPE maybe it is more efficient and clear.

@Amazing-J
Copy link
Owner

What I mean is that the source and target segment needs to do BPE during training, and the target segment does not need to do BPE during testing.
BPE is a commonly used method in machine translation, there is no special code ah.

@QAQ-v
Copy link
Author

QAQ-v commented Sep 26, 2019

What I mean is that the source and target segment needs to do BPE during training, and the target segment does not need to do BPE during testing.
BPE is a commonly used method in machine translation, there is no special code ah.

Thanks for your patient reply!

I am still a little confused. So you only apply the BPE on the training set, and do not apply the BPE on the test set, is that right? Or you also apply the BPE on the source side of test set but do not apply BPE on the target side of test set?

@Amazing-J
Copy link
Owner

yes. During the test, only the source side needs to do BPE, and then test BLEU after deleting @@.

@QAQ-v
Copy link
Author

QAQ-v commented Sep 26, 2019

yes. During the test, only the source side needs to do BPE, and then test BLEU after deleting @@.

Get it :). I will have a try, thanks!

@QAQ-v
Copy link
Author

QAQ-v commented Sep 26, 2019

yes. During the test, only the source side needs to do BPE, and then test BLEU after deleting @@.

Sorry for bothering again, what is the {num_operations} set in the following command, the default value 10000?

subword-nmt learn-bpe -s {num_operations} < {train_file} > {codes_file}

@Amazing-J
Copy link
Owner

On LDC2015E86 10000
On LDC2017T10 20000
train_file: cat train_source+train_target

@QAQ-v
Copy link
Author

QAQ-v commented Sep 26, 2019

train_source+train_target

So you follow the instructions in BEST PRACTICE ADVICE FOR BYTE PAIR ENCODING IN NMT , right?

If it is, the --vocabulary-threshold you still keep 50?

@Amazing-J
Copy link
Owner

Amazing-J commented Sep 26, 2019 via email

@Bobby-Hua
Copy link

@Amazing-J Hi! I have the same question regarding generating structural sequences. Can you provide more insight on how to use [“anytree”] (https://pypi.org/project/anytree/2.1.4/) to get corpus_sample/all_path_corpus and corpus_sample/five_path_corpus? Any example preprocessing code will be much appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants