-
Notifications
You must be signed in to change notification settings - Fork 478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graph Convolutional Layers - Builtin #1941
base: main
Are you sure you want to change the base?
Conversation
About the performance: My machine showed performance issues when testing against PyTorch for very very big inputs. Stress test: SystemDS: 340 seconds - PyTorch: 32 seconds The stress test consisted of about 300 forward passes with about 10.000 x 10.000 matrices. This is likely a problem with my setup and not my implementation since the affine layer with the same inputs took 220 seconds. The GCL consists of a simple affine part and a convolutional part with the convolutional part being a lot more complex. So, the implementation is likely quite fast because the complex convolution part makes up less than a third of the runtime. |
About caching: It would be a possibility to cache the normalized weights since they are quite complex to compute (counting of every degree and spectral convolution). In the stress testing, this only showed an improvement of 10 seconds from 340 seconds. Also, when testing the caching feature against smaller inputs than the huge inputs of the stress test, it was always slower. |
ups, this was graph Conv - i thought i was commenting on the ResNet |
can you show the '-stats' output of calling it, to indicate where we are using time. If you are in doubt how to use it, i can show you in office. |
I already profiled it. The convolutional layer consists of the linear layer and a convolutional layer. The linear layer takes up more than 70% of the runtime. This means there are probably some issues in my configs that restrict systemds from being faster since the linear part is the very base line of this layer. |
with Linear layer, do you mean a simple fully connected affine layer (aka a matrix multiplication), or something else? Can you maybe give me an example in code? |
Yes exactly. There is a matrix multiplication and adding a bias happening. This is what takes up 70% of the performance. As for the rest of the forward pass, I already took numerous steps to optimize it from, initially, 780 seconds to 340 seconds for the whole layer. This includes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I quickly looked through, and am unsure what operation you are referring to.
I assume the starting matrix multiplication is expensive, but that only depends on the input size, and it will always be expensive. Some of the other code contain many as.integer or as.scalar. most of them should not be needed but i do not think they have much impact on performance.
{ | ||
edge_weight[j, 1] = 0.0 | ||
} | ||
X_out[as.integer(as.scalar(edge_index[j, 2]))] += as.scalar(edge_weight[j, 1]) * X_hat[as.integer(as.scalar(edge_index[j, 1]))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all these as.integer(as.scalar(...)) should not be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I got some issues when not using them. I will take a look at them.
m = nrow(edge_index) | ||
|
||
# transform | ||
X_hat = X %*% W |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the slow operation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is the slow thing. But adding the bias in the end takes just as much time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This matrix multiplication takes around 35% of the runtime while adding the bias in the end also takes around 35%.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in practice adding the bias, should be very fast compared to the matrix multiplication.
When looking at the code, it seems to me that you access indices, rather than adding vectors.
This might be the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All things related to the index is for the convolution part and not the linear part. But, it is actually possible to do the convolution part without any indices at all since the formula for a graph convolutional layer is OUT = D^-1 * A * D^-1 * X * W + b (A: Adjacency matrix n x n, D: degree matrix n x n, X: input n x features, W: weights f_in x f_out). As you can see, to do the convolution without any indices (normalization and message passing), you need to do 3 extra matrix multiplications instead. These matrix multiplications are (in normal use cases -> n >> features) even bigger than the linear part (XW + b) because D and A are bigger matrices than X and W.
So, since XW + b takes 220 seconds (only the matrix multiplaction takes around 110 seconds), not using indices to do the normalization and convolution would take way longer than 340 seconds, likely around 600 seconds.
This is also the reason why famous other libraries (PyTorch, TF) also mainly use an edge list to do the convolution part through index accessing or use a sparse matrix datatype (which is basically also an edge list) in the GCL implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My -Xms -Xmx arguments were 16g, I think.
Also, I called the tests multiple times. They were always very consistent, only changing by -3 to +3 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what i mean is not from the outside, but inside your script.
or maybe that is what you do already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, the stress test file itself would probably help. The stress test is a big forward pass over 3 layers, repeated 100 times.
So, from the outside and the inside.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah okay, when will you be in office next time, then maybe we can talk about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! I will be in the office tomorrow and on friday.
dX = dOut_agg_rev %*% t(W) | ||
|
||
# calculate gradient w.r.t. W (Formula: X^T * A_hat^T * dOut) | ||
dW = t(X) %*% dOut_agg_rev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or is it the gradient taking time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the stress test, I only used the forward pass.
Adding a built in graph convolutional layer to our nn layer diretory. The graph convolutional layer follows the paper "Semi-Supervised Classification with Graph Convolutional Networks" from Kipf and Welling. This includes the forward and the backward pass of the layer and an example network on how to use the layers.
We tested the implementation against PyTorch and our implementation computes the same exact values as PyTorch does and handles missing in-bound edges of a node the same way in the spectral convolution. This is also implemented as a component test as a NNComponentTest. There we hard-coded the initialized weights from PyTorch and the result into the component test.