Preprocessing #3

QAQ-v · 2019-09-18T12:48:41Z

Hi,

Could you please release the preprocessing codes for generating the structural sequence and the commands for applying bpe? i.e., how to get the files in corpus_sample/all_path_corpus and corpus_sample/five_path_corpus.

Thanks.

Amazing-J · 2019-09-19T03:30:41Z

Python has an [“anytree”] (https://pypi.org/project/anytree/2.1.4/) . You can try.

QAQ-v · 2019-09-26T10:37:22Z

Python has an [“anytree”] (https://pypi.org/project/anytree/2.1.4/) . You can try.

Thanks for your reply! I am still confused about how to get the structural sequence, maybe releasing the preprocessing codes or the preprocessing data is a better way to help people run your model.

Meanwhile, there is another question. I trained Transformer baseline model implemented by OpenNMT with same hyperparameter setting as yours on LDC2015E86. When I compute the BLEU score on the BPE embedding prediction I can get a comparable result in Table 3 of your paper (25.5), but after I remove the "@@" in the prediction the BLEU droped a lot. So I am wondering that the BLEU results you reported in Table 3 was computed based on the BPE embedding prediction? Did you remove the "@@" in the final prediction of the model?

Amazing-J · 2019-09-26T11:22:04Z

After deleting "@@ ", the BLEU value should not decline, but rise a lot. Are you sure you are doing the right BPE process? It is worth noting that not only "@@" but also a space has been deleted. ( "@@ " )
The target side should do nothing but tokenization ( use PTB_tokenizer ).

QAQ-v · 2019-09-26T11:58:55Z

After deleting "@@ ", the BLEU value should not decline, but rise a lot. Are you sure you are doing the right BPE process? It is worth noting that not only "@@" but also a space has been deleted. ( "@@ " )
The target side should do nothing but tokenization ( use PTB_tokenizer ).

Thanks for your reply!

I follow the author's instruction to delete "@@ " (sed -r 's/(@@ )|(@@ ?$)//g') so there shouldn't be any mistakes. So you mean you only apply the BPE on source side? On the target side you do not apply the BPE? But in this way the source and target sides do not share the same alphabet, you still share the vocab in the model? Could you please release the code for BPE maybe it is more efficient and clear.

Amazing-J · 2019-09-26T12:32:58Z

What I mean is that the source and target segment needs to do BPE during training, and the target segment does not need to do BPE during testing.
BPE is a commonly used method in machine translation, there is no special code ah.

QAQ-v · 2019-09-26T12:47:58Z

What I mean is that the source and target segment needs to do BPE during training, and the target segment does not need to do BPE during testing.
BPE is a commonly used method in machine translation, there is no special code ah.

Thanks for your patient reply!

I am still a little confused. So you only apply the BPE on the training set, and do not apply the BPE on the test set, is that right? Or you also apply the BPE on the source side of test set but do not apply BPE on the target side of test set?

Amazing-J · 2019-09-26T12:55:27Z

yes. During the test, only the source side needs to do BPE, and then test BLEU after deleting @@.

QAQ-v · 2019-09-26T13:01:50Z

yes. During the test, only the source side needs to do BPE, and then test BLEU after deleting @@.

Get it :). I will have a try, thanks!

QAQ-v · 2019-09-26T14:04:52Z

yes. During the test, only the source side needs to do BPE, and then test BLEU after deleting @@.

Sorry for bothering again, what is the {num_operations} set in the following command, the default value 10000?

subword-nmt learn-bpe -s {num_operations} < {train_file} > {codes_file}

Amazing-J · 2019-09-26T14:09:17Z

On LDC2015E86 10000
On LDC2017T10 20000
train_file: cat train_source+train_target

QAQ-v · 2019-09-26T14:15:30Z

train_source+train_target

So you follow the instructions in BEST PRACTICE ADVICE FOR BYTE PAIR ENCODING IN NMT , right?

If it is, the --vocabulary-threshold you still keep 50?

Amazing-J · 2019-09-26T14:21:11Z

You only need to use these two commands. subword-nmt learn-bpe -s {num_operations} < {train_file} > {codes_file} subword-nmt apply-bpe -c {codes_file} < {test_file} > {out_file} On 09/26/2019 22:15, Will wrote: train_source+train_target So you follow this instruction, right? If it is, the --vocabulary-threshold you still keep 50? —You are receiving this because you commented.Reply to this email directly, view it on GitHub, or mute the thread. [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "#3?email_source=notifications\u0026email_token=AJC27CN6BB7MCH3VJZTQXPTQLS7YFA5CNFSM4IX6GGS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7VW7OY#issuecomment-535523259", "url": "#3?email_source=notifications\u0026email_token=AJC27CN6BB7MCH3VJZTQXPTQLS7YFA5CNFSM4IX6GGS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7VW7OY#issuecomment-535523259", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Bobby-Hua · 2022-05-30T02:58:23Z

@Amazing-J Hi! I have the same question regarding generating structural sequences. Can you provide more insight on how to use [“anytree”] (https://pypi.org/project/anytree/2.1.4/) to get corpus_sample/all_path_corpus and corpus_sample/five_path_corpus? Any example preprocessing code will be much appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocessing #3

Preprocessing #3

QAQ-v commented Sep 18, 2019 •

edited

Loading

Amazing-J commented Sep 19, 2019

QAQ-v commented Sep 26, 2019 •

edited

Loading

Amazing-J commented Sep 26, 2019

QAQ-v commented Sep 26, 2019

Amazing-J commented Sep 26, 2019

QAQ-v commented Sep 26, 2019

Amazing-J commented Sep 26, 2019

QAQ-v commented Sep 26, 2019

QAQ-v commented Sep 26, 2019

Amazing-J commented Sep 26, 2019

QAQ-v commented Sep 26, 2019 •

edited

Loading

Amazing-J commented Sep 26, 2019 via email

Bobby-Hua commented May 30, 2022

Preprocessing #3

Preprocessing #3

Comments

QAQ-v commented Sep 18, 2019 • edited Loading

Amazing-J commented Sep 19, 2019

QAQ-v commented Sep 26, 2019 • edited Loading

Amazing-J commented Sep 26, 2019

QAQ-v commented Sep 26, 2019

Amazing-J commented Sep 26, 2019

QAQ-v commented Sep 26, 2019

Amazing-J commented Sep 26, 2019

QAQ-v commented Sep 26, 2019

QAQ-v commented Sep 26, 2019

Amazing-J commented Sep 26, 2019

QAQ-v commented Sep 26, 2019 • edited Loading

Amazing-J commented Sep 26, 2019 via email

Bobby-Hua commented May 30, 2022

QAQ-v commented Sep 18, 2019 •

edited

Loading

QAQ-v commented Sep 26, 2019 •

edited

Loading

QAQ-v commented Sep 26, 2019 •

edited

Loading