Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating code documentation with code2seq #120

Closed
balysMorkunas opened this issue May 5, 2022 · 8 comments
Closed

Generating code documentation with code2seq #120

balysMorkunas opened this issue May 5, 2022 · 8 comments

Comments

@balysMorkunas
Copy link

balysMorkunas commented May 5, 2022

Hi,

I am currently writing a bachelors project where my aim is to test how inline code comments in the training dataset effects the performance of generating code documentation using the code2seq model. Would you be able to briefly tell me about the possibilities of this model regarding automatic code comment generation and how would one set that up? I am very thankful for your time.

edit: I can see that issue #34 has related information, I will of course follow it for now, but if you have any additional tips it would be much appreciated!

@urialon
Copy link
Contributor

urialon commented May 5, 2022

Hi @balysMorkunas ,
Thank you for your interest in code2seq!

I believe that the paper may give a better intuition than what I can describe in brief: https://openreview.net/pdf?id=H1gKYo09tX

Let me know if you have any specific questions!
Uri

@balysMorkunas
Copy link
Author

balysMorkunas commented May 10, 2022

Thank you, I'll contact again if any serious questions arise!

@balysMorkunas
Copy link
Author

Hi again,

I started looking at how to retrain the model and preprocess the dataset for documentation generation. I followed your suggestion on issue #34 where you suggest to change JavaExtractor to output documentation instead of method names.

Could you please elaborate/give example on how to do that? Do you by chance mean to use node.getJavaDoc() instead of node.getName()? What other changes should I be aware of?

Thank you very much for your time and effort, I really appreciate it,
Balys.

@bacevicius
Copy link

bacevicius commented May 19, 2022

Hi @urialon, I am in a very similar situation to @balysMorkunas and would also like to hear your input about this question. Thank you for your time!

@urialon
Copy link
Contributor

urialon commented May 19, 2022

Hi @balysMorkunas and @bacevicius ,
Thank you for your interest in code2seq!

Do you by chance mean to use node.getJavaDoc() instead of node.getName()

Basically yes!

Another option, if you wish to train on an existing dataset, is to set it to a unique ID, and then replace the unique ID with the documentation later. See also:
#45
For additional scripts and hyperparameters.

Best,
Uri

@balysMorkunas
Copy link
Author

balysMorkunas commented May 20, 2022

Thanks for your answer @urialon !

Do you think that the hyperparameters config.SUBTOKENS_VOCAB_MAX_SIZE = 190000 and config.TARGET_VOCAB_MAX_SIZE = 27000 are enough for documentation generation, or should be increased?
Anything else to watch out for, regarding the hyperparameters, maybe max_code_len and min_code_len in JavaExtractor?

Thank you very much for your time,
Balys.

@urialon
Copy link
Contributor

urialon commented May 27, 2022

Hi @balysMorkunas ,
Sorry for the delayed response.

These hyperparameters look OK to me, but they depend on the exact dataset and can never really be known in advance.
max_code_len and min_code_len refer to the size of the functions that you consider, so it is up to the dataset you are working with.

Best,
Uri

@balysMorkunas
Copy link
Author

Thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants