Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Preprocessing #1

Open
Cartus opened this issue Sep 8, 2019 · 5 comments
Open

Data Preprocessing #1

Cartus opened this issue Sep 8, 2019 · 5 comments

Comments

@Cartus
Copy link

Cartus commented Sep 8, 2019

Hi, thanks for the great work!

I try to run the code. However, I don't know how to do data preprocessing for AMR corpus. May I ask how can I do data preprocessing?

@Amazing-J
Copy link
Owner

Our baseline input could be the same linearized amr chart as konstas.
Only concept nodes are retained for input to the transformer model.
-train_src # concept node sequence
-train_structure1 # Xi to Xj path of the first token.
-train_structure2 # Xi to Xj path of the second token.
........

@Cartus
Copy link
Author

Cartus commented Sep 9, 2019

Hi @Amazing-J ,

Thank you for your prompt reply!

For the concept node sequence, I can use NeuralAmr https://github.com/sinantie/NeuralAmr to get the linearized sequence.

I also have two questions. The first one is how to construct the structural sequence. Since the model requires to sub-word units by BPE, how to generate the concept node sequence under this setting?

@dungtn
Copy link

dungtn commented Sep 23, 2019

Hi @Amazing-J,

Thank you for releasing the code! As @Cartus pointed out, can you provide the code for BPE over the source a.k.a linearized AMRs?

Best!

@dungtn
Copy link

dungtn commented Sep 24, 2019

Assuming that I've done the right thing for BPE by running

subword-nmt learn-bpe -s 10000 < ...LDC2015E86/training_source > codes.bpe
subword-nmt apply-bpe -c codes.bpe < ...LDC2015E86/dev_source > dev_source_bpe

then I still got this error:

FileNotFoundError: [Errno 2] No such file or directory: ...LDC2015E86/data_vocab.pt

How can I generate this file?

@dungtn
Copy link

dungtn commented Sep 24, 2019

Alright, I found out that I also have to run preprocess.sh. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants