Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can you tell how you pre-processed your Java dataset from raw ? #8

Open
hungkien05 opened this issue Apr 13, 2023 · 0 comments
Open

Comments

@hungkien05
Copy link

Hi,

I am trying to run ADAMO model (https://arxiv.org/pdf/2201.05222.pdf) with my own datasets and the authors use your datasets and your preprocessing. However it seems like ADAMO only needs your *.token.code and *.token.nl files.

I tried to pre-process my dataset in your way but I get some confusion. You mentioned in #2 that you use tokenizer from NeuralCodeSum for tokenizer, however when I use the tokenizer I don't see the results's structure similar to your processed dataset in the Google Drive link you provided.
Actually I don't understand what pre-processing technique you used to get the dataset in the Google Drive.

Can you guide me how pre-processed Java-code dataset from raw to get the final *.token.code files?

Thank you a lot !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant