-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How is AST path converted to vector? #159
Comments
Hi @abitrolly , Every unique token and unique path in the training data is allocated with a randomly initialized vector, and we remember the mapping between every token/path into its ID (an integer), and between every ID to the allocated vector. This is a common practice in neural networks, also known as "word embeddings". Best, |
Hi Uri. Thanks for the fast reply! I am not really familiar with embeddings, so while reading the paper right now, the reference to "continuous vectors for representing snippets of code" doesn't make it more clear. Need some low level example like requested in #102 to get a picture of what this "code embedding" is composed of. And a 3D movie of the process. :D It is also interesting to see how "a parallel vocabulary of vectors of the labels" looks like, and how does his vector arithmetic with words actually works. That's already a lot of questions, and I haven't even started with chapter "1.1 Applications" :D
Right now I understand the process as following. graph LR;
.java --> AST --> L["leaves and paths"] --> E["vector AKA code embedding"];
Chapter 3 gives some hard mathematical abstractions. Thankfully I played a bit with https://docs.python.org/3/library/ast.html to modify Python sources with Python, so I won't break my head on formal AST definition and can skip over. :D At the end there is a useful example of "leaves and paths". This expression.
Is transformed into leaves
So this thing ^^^ is called Path-Context. The paper then gives some limitations.
This needs example. Like the length of AST is actually depth of nesting instructions, and the width is "values are determined empirically as hyperparameters of our model" breaks my mind. Is it about that Finally chapter "4 MODEL" explains high level view and jumps directly to "a Bag of Path-Contexts",
How it looks in Python code? And then, if I understand it right, everything in this tuple needs to be a digit, because Python do not understand math symbols like these, and because to train model, there should be arithmetic operations there. And that tuple with digits is what I meant by vector while writing the title for this issue. There are a lot questions about the bag, about the size of every-to-every dict of leaves, what is "node" and "itself" in "pairs that contain a node and itself", but that falls out of scope for this question. Originally I was going about to ask, if the vector is this.
Where
|
Hi @abitrolly , I am not sure where to begin. I also recommend basic neural-NLP lectures such as Stanford's course. Best, |
I am still trying to find a time to read the paper till the end figure out what are embedding. How text fields are of path-context are converted into numerical values for calculations. |
Yes. I came here after watching it. |
The main idea is to assign a random vector to every kind of symbol, whether that symbol is a path or a token. Then, during training, these vectors are modified such that the loss term that we define is minimized. That's basically how neural network are trained, in a nutshell. |
Aha. That makes it more clear, thanks. Now I need to find the place in code, where the assignment of random vector takes place and see how these symbols are represented in Python. Then the question would be, what algorithm is used to adjust vector weights during training? But that's probably already covered by paper. |
This is where the random vectors are initialized: https://github.com/tech-srl/code2vec/blob/master/tensorflow_model.py#L206-L220 The reader converts string inputs into integer indices, and these integer indices allow looking up the specific random vector: The algorithm that is used to adjust vector weights during training is stochastic gradient descent + backpropagation, but that's a very common idea in training neural networks, that is not unique to our work. This idea is covered in almost any neural networks tutorial / online course. Best, |
Writing down permalinks to avoid reference drift. Lines 206 to 220 in e7547de
Found the docs https://www.tensorflow.org/api_docs/python/tf/compat/v1/get_variable. So for example Lines 183 to 184 in e7547de
128 is the width in floats for each path-context.
The puzzling thing here is that it seems there is not 1:1 mapping between AST path components and floats. And even if AST path is shorter, it still gets 128 width vector. So the network doesn't really have granularity at the AST level. Paths are treated just as whole strings. I am right? code2vec/path_context_reader.py Lines 205 to 207 in e7547de
With gradient descent, the goal is to find some minimum output value for a given set of inputs. So here I guess the output value is the name of a function. During training symbols and paths that are close to each other receive similar weighs, which can serve as some coordinates in 3D (or whatever-D space), and that allows to query the space for different properties. This is my understanding of it so far. |
Thus is correct, and was addressed in a follow-up paper Code2seq .
Yes, but the minimum value that we're trying to find is the negative (log-) probability of predicting the right method name.
Thus sounds almost correct, but I cannot completely confirm since I don't exactly understand what you mean by "query the space for different properties" ן |
Came here from https://youtu.be/EJ8okcxL2Iw?t=426
The talk says there are token vectors and path vectors.
I know what an AST is, but I am not that proficient in AI to answer what is the token vector? Is it a list of all tokens encountered? Or a list of all possible combinations of token pair?
It also not clear what a path vector is. If I understand it right, then vector should contains numbers, not symbols from AST path. So how does AST path is converted to path vector? Does the path vector include leaves, which are tokens/symbols?
The text was updated successfully, but these errors were encountered: