Skip to content
This repository has been archived by the owner on Feb 26, 2024. It is now read-only.

GNN Encoder #1

Closed
legohyl opened this issue Mar 29, 2021 · 11 comments
Closed

GNN Encoder #1

legohyl opened this issue Mar 29, 2021 · 11 comments

Comments

@legohyl
Copy link

legohyl commented Mar 29, 2021

Hello, I was looking through the code and comparing with what you shared in the paper, and it appears like theres some discrepancies. For example, I was interested in the GNN Encoder module. In the paper, it was presented as:
image

But in the code, in the gnn_encoder.py, it was modelled differently. Firstly, in the GNNEncoder class, you have an embedding (or dictionary) of size 2 to map an edge in the adjacency matrix (I presume here since it's of size 2, the adjacency matrix should be a binary matrix indicating connections and not weight). Why is this so though? In the paper, it was mentioned that you initialized the edge features using a Euclidean distance between points, and not just the connectivity. Also, we can see that this is passed onto the GNNLayer, which takes it as it is. This means that the initialization of the network will only ever have 2 different embeddings, and hence doesn't take into account the Euclidean distance between two nodes.

class GNNEncoder(nn.Module):
    """Configurable GNN Encoder
    """
    
    def __init__(self, n_layers, hidden_dim, aggregation="sum", norm="layer", 
                 learn_norm=True, track_norm=False, gated=True, *args, **kwargs):
        super(GNNEncoder, self).__init__()

        self.init_embed_edges = nn.Embedding(2, hidden_dim)

        self.layers = nn.ModuleList([
            GNNLayer(hidden_dim, aggregation, norm, learn_norm, track_norm, gated)
                for _ in range(n_layers)
        ])

    def forward(self, x, graph):
        """
        Args:
            x: Input node features (B x V x H)
            graph: Graph adjacency matrices (B x V x V)
        Returns: 
            Updated node features (B x V x H)
        """
        # Embed edge features
        e = self.init_embed_edges(graph.type(torch.long))

        for layer in self.layers:
            x, e = layer(x, e, graph)

        return x

In the GNNLayer, the edge embedding gets updated different as opposed to the equation. Here's the code snippet:

        batch_size, num_nodes, hidden_dim = h.shape
        h_in = h
        e_in = e

        # Linear transformations for node update
        Uh = self.U(h)  # B x V x H
        Vh = self.V(h).unsqueeze(1).expand(-1, num_nodes, -1, -1)  # B x V x V x H

        # Linear transformations for edge update and gating
        Ah = self.A(h)  # B x V x H
        Bh = self.B(h)  # B x V x H
        Ce = self.C(e)  # B x V x V x H

        # Update edge features and compute edge gates
        e = Ah.unsqueeze(1) + Bh.unsqueeze(2) + Ce  # B x V x V x H
        gates = torch.sigmoid(e)  # B x V x V x H

        # Update node features
        h = Uh + self.aggregate(Vh, graph, gates)  # B x V x H

As you can see, you first updated the edge features by transforming them with the current layer representation followed by creating a gating mechanism with the sigmoid function. This is then used to update the node information. However, in the paper, you wrote that the node features get updated by e_{ij}^{(l)} and not this new updated e_{ij}^{(l+1)}.

@chaitjo
Copy link
Owner

chaitjo commented Mar 31, 2021

Hi @legohyl, thank you for your interest!

In the paper, it was mentioned that you initialized the edge features using a Euclidean distance between points, and not just the connectivity. Also, we can see that this is passed onto the GNNLayer, which takes it as it is. This means that the initialization of the network will only ever have 2 different embeddings, and hence doesn't take into account the Euclidean distance between two nodes.

It is possible to additionally use the euclidean distances among the coordinates to instantiate the edge features, following this work by the same team via an nn.Linear() projection. I used this formulation during my experiments.

You can compute the pairwise distances either during data loading or within the model's forward pass via something like this (off the top of my head):

(x.unsqueeze(1) - x.unsqueeze(2)).norm(dim=-1, keepdim=True)

...and then pass this to nn.Linear() to make use of the distance along with the connectivity when encoding via the GNN.

I believe that the GNNs, when receiving the coordinates as node features, can also implicitly build an approximation of the pairwise distances during message-passing.

As you can see, you first updated the edge features by transforming them with the current layer representation followed by creating a gating mechanism with the sigmoid function. This is then used to update the node information. However, in the paper, you wrote that the node features get updated by e_{ij}^{(l)} and not this new updated e_{ij}^{(l+1)}.

Great catch! I believe this must have slipped by in the editing iteration cycles by multiple authors of the paper. I will make the discrepancy clear in the next version.

I don't think this ordering of using e_{ij}^{(l)} or e_{ij}^{(l+1)} changes the layer's semantics, so I expect performance to not be impacted by this.

Hope it helps!

@legohyl
Copy link
Author

legohyl commented Apr 5, 2021

Thanks for the response! I looked through the original paper and got a better idea of the Gated GCN you were using. I seem to be running into this issue went trying to run your code, it seems like your BatchBeam class has some clashes. It appears your return super(BatchBeam, ...) is causing the issue, where there's a re-definition of the class object. This happens in CachedLookup, BatchBeam, AttentionModel, StateTSP classes.

image

@legohyl
Copy link
Author

legohyl commented Apr 5, 2021

@chaitjo I've also found this rather perplexing error in the code. It appears that a few of your gather functions uses an index that is not of type. torch.long . Looking into these indexes, they are decimal/float in nature. How did you manage to get your code to work previously? I checked torch 1.2.0, and gather also requires torch.long data types. Here's a sample in nar_beam_search.py. I checked the perm_mask matrix and it's all values in the range [0,1]

image

@chaitjo
Copy link
Owner

chaitjo commented Apr 5, 2021

Hi @legohyl, I'll look into these issues and get back asap.

Meanwhile, it would be helpful if you can provide your hardware and system information: GPUs, CUDA version, Python version, PyTorch version, anything else worth mentioning.

@legohyl
Copy link
Author

legohyl commented Apr 5, 2021

Thanks! I noticed that this is also the same implementation in your graph-convnet-tsp repo.

Sure, here's the following information
GPU: nvidia V100 (aws.p3 instance) / K80 (aws.p2 instance)
CUDA: 10.1
Python: 3.8.3
Torch: 1.7.0
OS: Ubuntu 18.04

@chaitjo
Copy link
Owner

chaitjo commented Apr 5, 2021

Is it possible to run with torch==1.2.0 and Python 3.6 for consistency?

@legohyl
Copy link
Author

legohyl commented Apr 5, 2021

Hmm it's not very convenient for me to revert back to older versions, so I was just looking at the errors and it doesn't seem to be package related. But I could try and give it a shot.

@legohyl
Copy link
Author

legohyl commented Apr 5, 2021

@chaitjo I installed using conda and it seems to work in that version. I'm really puzzled by this though.

@chaitjo
Copy link
Owner

chaitjo commented Apr 5, 2021

@legohyl Do you mean it works upon downgrading? I will check your issues from my side as soon as possible.

Off the top of my head, there may be discrepancy across package versions about how int and float divisions are handled. In the NAR beam search method, perm_mask variable should always be long (and you can see this yourself, if you try to trace back the logic). It is computer from prev_k which should again be long and is being divided by an integer value self.num_nodes. Now, I am imagining that the issue arises due to one of these tensors getting initialized as a float instead of long, or that during division, the type is converted to long. I can check further.

For AR beam search methods, I am confident that the issue may be due to Python version mismatch. Is it critical for you to use Python 3.8?

@legohyl
Copy link
Author

legohyl commented Apr 5, 2021

Yes, it works with your intended set of versions.

I do think it's because of the Float/Long tensors. I found myself having to convert things back to torch.long accordingly. Might be some information loss on this. Isn't prev_k < 1? It seems like we take the bestScoresId and divide them by the num_nodes, which makes it a float between 0 and 1?

Well, as I'm using a EC2 instance from my company, it's kind of the default. Though I think it can be avoided by just doing the tensor checks as asserts and return accordingly, instead of redeclaring with super. Because either way, we should not be receiving non-tensor inputs.

@legohyl
Copy link
Author

legohyl commented Apr 7, 2021

@chaitjo it seems to be resolved by calculating prev_k as bestScoresId // self.num_nodes to force integer division.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants