Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[representation] iS2S: Structure 2 Sequence Code Embeddings #94

Open
3 tasks
danaderp opened this issue Feb 2, 2021 · 6 comments
Open
3 tasks

[representation] iS2S: Structure 2 Sequence Code Embeddings #94

danaderp opened this issue Feb 2, 2021 · 6 comments

Comments

@danaderp
Copy link
Contributor

danaderp commented Feb 2, 2021

Description
Code Embeddings are abstract representations of source code employed in multiple automation tasks in software engineering like clone detection, traceability, or code generation. This abstract representation is a mathematical entity known as Tensor. Code Tensors allows us to manipulate snippets of code in semantic vector spaces instead of complex data structures like call graphs. Initial attempts focused on identifying deep learning strategies to compress code in lower-dimensional vectors (code2vec). Unfortunately, these approaches do not consider autoencoder architectures to represent code. The purpose of this project is to combine a structural language model of code with autoencoder architectures to compress source code snippets into lower-dimensional tensors. The lower-dimensional tensor must be evaluated in terms of semantics (clone detection).

** Disentanglement of Source Code Data with Variational Autoencoder **
The performance of deep learning approaches for software engineering generally depends on source code data representation. Bengio, et al. show that different representations can entangle explanatory factors of variation behind the data. We hypothesize that source code data contains these explanatory factors useful for automating many software engineering tasks (e.g., clone detection, traceability, feature location, and code generation). Although some deep learning architectures in SE are able to extract abstract representation for downstream tasks, we are not able to verify such features since the underlying data is entangled. The objective of code generative models is to capture underlying data generative factors. However, a disentangled representation would allow us to manipulate a single latent unit being sensitive to a single generative factor. Separate representational units are useful to explain why deep learning models are able to classify or generate source code without posterior knowledge (or labels). This project aims at identifying single representational units from source code data. We will use CodeSearch Net datasets and Variational Autoencoders to implement the approach.

Project Goals

  • Check and analyze literature review in structural models in deep learning
  • Implement a vanilla version of an autoencoder where the encoder is a structural language and the decoder a sequence-based architecture.
  • Lower dimensional tensors must be evaluated in terms of semantics (clone detection problem)

Implement a module of interpretability to test edge cases of the autoencoder

  • Project Requirements
  • Required Knowledge: Python, Git, and Statistics
  • Preferred Knowledge: Deep Learning, TensorFlow, and DVC

Recommended Readings

@danaderp
Copy link
Contributor Author

danaderp commented Feb 16, 2021

For Sam:

  • Run Code2Vec and checkout the architecture
  • Read the three papers and discuss the results
  • Set-up ds4se environment

@m13253
Copy link
Contributor

m13253 commented Mar 16, 2021

Updated work plan:

  • Check and analyze literature review in structural models indeep learning
  • Set up DS4SE environment
  • Test out the original code2vec
  • Using code2vec as a library to generate embedding vectors from CodeSearchNet-Java
  • Use RNNs to build an autoencoder to compress the dimension of embedding vectors
  • Move our autoencoder to use code2vec encoder
  • Later, to use transformers decoder
  • Evaluate our model for clone detection
  • Test our autoencoder for interpretability by actively trans-forming the data and evaluate the output
  • If time permits, extend the work to traceability and codegeneration

@m13253
Copy link
Contributor

m13253 commented Mar 19, 2021

Meeting note 2021-03-18:

  • Use the Keras autoencoder as a template to train against CodeSearchNet-Java.
  • Use google/sentencepiece for input tokenization.
  • Add an embedding layer after it.
  • Use multiclass cross entropy for loss function (binary doesn't work).
  • Don't deploy code2vec yet. It doesn't fit.

References:

@m13253
Copy link
Contributor

m13253 commented Apr 1, 2021

Meeting note 2021-03-25:

  • Ignore excessive length snippets
  • Prepare for "design of the case studies"
  • Refine previous sections

@m13253
Copy link
Contributor

m13253 commented Apr 1, 2021

Update 2021-04-01:

  • Finally got the autoencoder training running.
  • The network is shrunk in its dimension to deal with OOM situations.
  • Not seeing significant loss reduction.
  • Need help connecting to GPU inside Docker. (Someone has eaten the VRAM.)
  • Need some help in "design of the case studies", if possible, a guideline in text version.
    image

@m13253
Copy link
Contributor

m13253 commented Apr 2, 2021

Meeting note 2021-04-02:

Tasks:

  1. Complete sampling
  2. How to incorporate code2vec

Sampling (2 types):

  1. focus on encoder, obtain the middle vectors (<- for now)
  2. create random noise, how do they generate code

Case of studies:

The experiments we are going to run.

  • Checking the clones (clone library provided)
  • Test the encoder & decoder of GRU
  • The other case: encoder is code2vec, decoder is GRU

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants