Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How is AST path converted to vector? #159

Open
abitrolly opened this issue Jul 28, 2022 · 10 comments
Open

How is AST path converted to vector? #159

abitrolly opened this issue Jul 28, 2022 · 10 comments

Comments

@abitrolly
Copy link

Came here from https://youtu.be/EJ8okcxL2Iw?t=426

The talk says there are token vectors and path vectors.

I know what an AST is, but I am not that proficient in AI to answer what is the token vector? Is it a list of all tokens encountered? Or a list of all possible combinations of token pair?

It also not clear what a path vector is. If I understand it right, then vector should contains numbers, not symbols from AST path. So how does AST path is converted to path vector? Does the path vector include leaves, which are tokens/symbols?

@urialon
Copy link
Collaborator

urialon commented Jul 28, 2022

Hi @abitrolly ,
Thank you for your interest in our work!

Every unique token and unique path in the training data is allocated with a randomly initialized vector, and we remember the mapping between every token/path into its ID (an integer), and between every ID to the allocated vector.
Initially, these vectors are meaningless (since they are random), but then, these vectors are trained as part of the neural network.

This is a common practice in neural networks, also known as "word embeddings".
I recommend also reading the paper: https://urialon.cswp.cs.technion.ac.il/wp-content/uploads/sites/83/2018/12/code2vec-popl19.pdf

Best,
Uri

@abitrolly
Copy link
Author

Hi Uri. Thanks for the fast reply!

I am not really familiar with embeddings, so while reading the paper right now, the reference to "continuous vectors for representing snippets of code" doesn't make it more clear. Need some low level example like requested in #102 to get a picture of what this "code embedding" is composed of. And a 3D movie of the process. :D

It is also interesting to see how "a parallel vocabulary of vectors of the labels" looks like, and how does his vector arithmetic with words actually works.

That's already a lot of questions, and I haven't even started with chapter "1.1 Applications" :D
So I better skip directly to "2.1 Motivating Example" which contains the following explanation.

Path extraction. First, each query method in the training corpus is parsed to construct an AST.
Then, the AST is traversed and syntactic paths between AST leaves are extracted. Each path is
represented as a sequence of AST nodes, linked by up and down arrows, which symbolize the up or
down link between adjacent nodes in the tree. The path composition is kept with the values of the
AST leaves it is connecting, as a tuple we refer to as a path-context.

Right now I understand the process as following.

graph LR;
.java --> AST --> L["leaves and paths"] --> E["vector AKA code embedding"];
Loading

Chapter 3 gives some hard mathematical abstractions. Thankfully I played a bit with https://docs.python.org/3/library/ast.html to modify Python sources with Python, so I won't break my head on formal AST definition and can skip over. :D At the end there is a useful example of "leaves and paths".

This expression.

x = 7

Is transformed into leaves x and 7 and the path (NameExpr ↑ AssignExpr ↓ IntegerLiteralExpr ) between them.

x, (NameExpr ↑ AssignExpr ↓ IntegerLiteralExpr ), 7

So this thing ^^^ is called Path-Context.

The paper then gives some limitations.

To limit the size of the training data and reduce sparsity [...] we limit the paths by maximum length
[...] and limit the maximum width - the maximal difference in child index between two child
nodes of the same intermediate node. These values are determined empirically as hyperparameters
of our model.

This needs example. Like the length of AST is actually depth of nesting instructions, and the width is
the difference between line number in the same method without nesting. But I may be wrong here.

"values are determined empirically as hyperparameters of our model" breaks my mind. Is it about that
parameters are calculated during dataset preprocessing to keep dataset small?

Finally chapter "4 MODEL" explains high level view and jumps directly to "a Bag of Path-Contexts",
while I am still puzzled, how a single Path-Context consisting of mathematical symbols is represented
in data on low level. For example, this one.

x, (NameExpr ↑ AssignExpr ↓ IntegerLiteralExpr), 7

How it looks in Python code?

And then, if I understand it right, everything in this tuple needs to be a digit, because Python do not understand math symbols like these, and because to train model, there should be arithmetic operations there. And that tuple with digits is what I meant by vector while writing the title for this issue.

There are a lot questions about the bag, about the size of every-to-every dict of leaves, what is "node" and "itself" in "pairs that contain a node and itself", but that falls out of scope for this question.

Originally I was going about to ask, if the vector is this.

id, token1, token2, vector
---------------------------------
1,  1,      2,      [1,2,3]

Where token1 is leaf and the number is the index in leaves table. And vector numbers are indexes in a table that lists all syntax constructs. So for x =1 the tables will be.

id, token
---------
1,  x
2,  7
id, syntax
----------
1,  NameExpr_UP
2,  AssignExpr_DOWN
3,  IntegerLiteralExpr

@urialon
Copy link
Collaborator

urialon commented Aug 11, 2022

Hi @abitrolly ,
Sorry for the delayed response.

I am not sure where to begin.
Do you have any concrete questions?
Did you have a chance to look at the video?

I also recommend basic neural-NLP lectures such as Stanford's course.

Best,
Uri

@abitrolly
Copy link
Author

I am still trying to find a time to read the paper till the end figure out what are embedding. How text fields are of path-context are converted into numerical values for calculations.

@abitrolly
Copy link
Author

Did you have a chance to look at the video?

Yes. I came here after watching it.

@urialon
Copy link
Collaborator

urialon commented Aug 12, 2022

The main idea is to assign a random vector to every kind of symbol, whether that symbol is a path or a token.

Then, during training, these vectors are modified such that the loss term that we define is minimized. That's basically how neural network are trained, in a nutshell.

@abitrolly
Copy link
Author

abitrolly commented Aug 12, 2022

Aha. That makes it more clear, thanks. Now I need to find the place in code, where the assignment of random vector takes place and see how these symbols are represented in Python.

Then the question would be, what algorithm is used to adjust vector weights during training? But that's probably already covered by paper.

@urialon
Copy link
Collaborator

urialon commented Aug 12, 2022

This is where the random vectors are initialized:

https://github.com/tech-srl/code2vec/blob/master/tensorflow_model.py#L206-L220

The reader converts string inputs into integer indices, and these integer indices allow looking up the specific random vector:
https://github.com/tech-srl/code2vec/blob/master/path_context_reader.py#L205-L207

The algorithm that is used to adjust vector weights during training is stochastic gradient descent + backpropagation, but that's a very common idea in training neural networks, that is not unique to our work. This idea is covered in almost any neural networks tutorial / online course.

Best,
Uri

@abitrolly
Copy link
Author

abitrolly commented Aug 12, 2022

Writing down permalinks to avoid reference drift.

tokens_vocab = tf.compat.v1.get_variable(
self.vocab_type_to_tf_variable_name_mapping[VocabType.Token],
shape=(self.vocabs.token_vocab.size, self.config.TOKEN_EMBEDDINGS_SIZE), dtype=tf.float32,
initializer=tf.compat.v1.initializers.variance_scaling(scale=1.0, mode='fan_out', distribution="uniform"))
targets_vocab = tf.compat.v1.get_variable(
self.vocab_type_to_tf_variable_name_mapping[VocabType.Target],
shape=(self.vocabs.target_vocab.size, self.config.TARGET_EMBEDDINGS_SIZE), dtype=tf.float32,
initializer=tf.compat.v1.initializers.variance_scaling(scale=1.0, mode='fan_out', distribution="uniform"))
attention_param = tf.compat.v1.get_variable(
'ATTENTION',
shape=(self.config.CODE_VECTOR_SIZE, 1), dtype=tf.float32)
paths_vocab = tf.compat.v1.get_variable(
self.vocab_type_to_tf_variable_name_mapping[VocabType.Path],
shape=(self.vocabs.path_vocab.size, self.config.PATH_EMBEDDINGS_SIZE), dtype=tf.float32,
initializer=tf.compat.v1.initializers.variance_scaling(scale=1.0, mode='fan_out', distribution="uniform"))

Found the docs https://www.tensorflow.org/api_docs/python/tf/compat/v1/get_variable. So for example path_vocab is Tensorflow variable named PATHS_VOCAB with shape (???, 128), ??? is probably the length of vocabulary in lines. Vocabulary is loaded here

code2vec/vocabularies.py

Lines 183 to 184 in e7547de

self.path_vocab = Vocab.load_from_file(
VocabType.Path, file, self._get_special_words_by_vocab_type(VocabType.Path))
and 128 is the width in floats for each path-context.

The puzzling thing here is that it seems there is not 1:1 mapping between AST path components and floats. And even if AST path is shorter, it still gets 128 width vector. So the network doesn't really have granularity at the AST level. Paths are treated just as whole strings. I am right?

path_source_token_indices = self.vocabs.token_vocab.lookup_index(path_source_token_strings) # (max_contexts, )
path_indices = self.vocabs.path_vocab.lookup_index(path_strings) # (max_contexts, )
path_target_token_indices = self.vocabs.token_vocab.lookup_index(path_target_token_strings) # (max_contexts, )


With gradient descent, the goal is to find some minimum output value for a given set of inputs. So here I guess the output value is the name of a function. During training symbols and paths that are close to each other receive similar weighs, which can serve as some coordinates in 3D (or whatever-D space), and that allows to query the space for different properties. This is my understanding of it so far.

@urialon
Copy link
Collaborator

urialon commented Aug 13, 2022

So the network doesn't really have granularity at the AST level. Paths are treated just as whole strings

Thus is correct, and was addressed in a follow-up paper Code2seq .

the goal is to find some minimum output value for a given set of inputs. So here I guess the output value is the name of a function.

Yes, but the minimum value that we're trying to find is the negative (log-) probability of predicting the right method name.

During training symbols and paths that are close to each other receive similar weighs, which can serve as some coordinates in 3D (or whatever-D space), and that allows to query the space for different properties.

Thus sounds almost correct, but I cannot completely confirm since I don't exactly understand what you mean by "query the space for different properties"

ן

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants