GNN Encoder #1

legohyl · 2021-03-29T08:06:11Z

Hello, I was looking through the code and comparing with what you shared in the paper, and it appears like theres some discrepancies. For example, I was interested in the GNN Encoder module. In the paper, it was presented as:

But in the code, in the gnn_encoder.py, it was modelled differently. Firstly, in the GNNEncoder class, you have an embedding (or dictionary) of size 2 to map an edge in the adjacency matrix (I presume here since it's of size 2, the adjacency matrix should be a binary matrix indicating connections and not weight). Why is this so though? In the paper, it was mentioned that you initialized the edge features using a Euclidean distance between points, and not just the connectivity. Also, we can see that this is passed onto the GNNLayer, which takes it as it is. This means that the initialization of the network will only ever have 2 different embeddings, and hence doesn't take into account the Euclidean distance between two nodes.

class GNNEncoder(nn.Module):
    """Configurable GNN Encoder
    """
    
    def __init__(self, n_layers, hidden_dim, aggregation="sum", norm="layer", 
                 learn_norm=True, track_norm=False, gated=True, *args, **kwargs):
        super(GNNEncoder, self).__init__()

        self.init_embed_edges = nn.Embedding(2, hidden_dim)

        self.layers = nn.ModuleList([
            GNNLayer(hidden_dim, aggregation, norm, learn_norm, track_norm, gated)
                for _ in range(n_layers)
        ])

    def forward(self, x, graph):
        """
        Args:
            x: Input node features (B x V x H)
            graph: Graph adjacency matrices (B x V x V)
        Returns: 
            Updated node features (B x V x H)
        """
        # Embed edge features
        e = self.init_embed_edges(graph.type(torch.long))

        for layer in self.layers:
            x, e = layer(x, e, graph)

        return x

In the GNNLayer, the edge embedding gets updated different as opposed to the equation. Here's the code snippet:

        batch_size, num_nodes, hidden_dim = h.shape
        h_in = h
        e_in = e

        # Linear transformations for node update
        Uh = self.U(h)  # B x V x H
        Vh = self.V(h).unsqueeze(1).expand(-1, num_nodes, -1, -1)  # B x V x V x H

        # Linear transformations for edge update and gating
        Ah = self.A(h)  # B x V x H
        Bh = self.B(h)  # B x V x H
        Ce = self.C(e)  # B x V x V x H

        # Update edge features and compute edge gates
        e = Ah.unsqueeze(1) + Bh.unsqueeze(2) + Ce  # B x V x V x H
        gates = torch.sigmoid(e)  # B x V x V x H

        # Update node features
        h = Uh + self.aggregate(Vh, graph, gates)  # B x V x H

As you can see, you first updated the edge features by transforming them with the current layer representation followed by creating a gating mechanism with the sigmoid function. This is then used to update the node information. However, in the paper, you wrote that the node features get updated by e_{ij}^{(l)} and not this new updated e_{ij}^{(l+1)}.

The text was updated successfully, but these errors were encountered:

chaitjo · 2021-03-31T06:48:23Z

Hi @legohyl, thank you for your interest!

In the paper, it was mentioned that you initialized the edge features using a Euclidean distance between points, and not just the connectivity. Also, we can see that this is passed onto the GNNLayer, which takes it as it is. This means that the initialization of the network will only ever have 2 different embeddings, and hence doesn't take into account the Euclidean distance between two nodes.

It is possible to additionally use the euclidean distances among the coordinates to instantiate the edge features, following this work by the same team via an nn.Linear() projection. I used this formulation during my experiments.

You can compute the pairwise distances either during data loading or within the model's forward pass via something like this (off the top of my head):

(x.unsqueeze(1) - x.unsqueeze(2)).norm(dim=-1, keepdim=True)

...and then pass this to nn.Linear() to make use of the distance along with the connectivity when encoding via the GNN.

I believe that the GNNs, when receiving the coordinates as node features, can also implicitly build an approximation of the pairwise distances during message-passing.

As you can see, you first updated the edge features by transforming them with the current layer representation followed by creating a gating mechanism with the sigmoid function. This is then used to update the node information. However, in the paper, you wrote that the node features get updated by e_{ij}^{(l)} and not this new updated e_{ij}^{(l+1)}.

Great catch! I believe this must have slipped by in the editing iteration cycles by multiple authors of the paper. I will make the discrepancy clear in the next version.

I don't think this ordering of using e_{ij}^{(l)} or e_{ij}^{(l+1)} changes the layer's semantics, so I expect performance to not be impacted by this.

Hope it helps!

legohyl · 2021-04-05T03:11:58Z

Thanks for the response! I looked through the original paper and got a better idea of the Gated GCN you were using. I seem to be running into this issue went trying to run your code, it seems like your BatchBeam class has some clashes. It appears your return super(BatchBeam, ...) is causing the issue, where there's a re-definition of the class object. This happens in CachedLookup, BatchBeam, AttentionModel, StateTSP classes.

legohyl · 2021-04-05T06:57:48Z

@chaitjo I've also found this rather perplexing error in the code. It appears that a few of your gather functions uses an index that is not of type. torch.long . Looking into these indexes, they are decimal/float in nature. How did you manage to get your code to work previously? I checked torch 1.2.0, and gather also requires torch.long data types. Here's a sample in nar_beam_search.py. I checked the perm_mask matrix and it's all values in the range [0,1]

chaitjo · 2021-04-05T07:34:51Z

Hi @legohyl, I'll look into these issues and get back asap.

Meanwhile, it would be helpful if you can provide your hardware and system information: GPUs, CUDA version, Python version, PyTorch version, anything else worth mentioning.

legohyl · 2021-04-05T08:16:59Z

Thanks! I noticed that this is also the same implementation in your graph-convnet-tsp repo.

Sure, here's the following information
GPU: nvidia V100 (aws.p3 instance) / K80 (aws.p2 instance)
CUDA: 10.1
Python: 3.8.3
Torch: 1.7.0
OS: Ubuntu 18.04

chaitjo · 2021-04-05T08:22:15Z

Is it possible to run with torch==1.2.0 and Python 3.6 for consistency?

legohyl · 2021-04-05T08:27:46Z

Hmm it's not very convenient for me to revert back to older versions, so I was just looking at the errors and it doesn't seem to be package related. But I could try and give it a shot.

legohyl · 2021-04-05T08:43:49Z

@chaitjo I installed using conda and it seems to work in that version. I'm really puzzled by this though.

chaitjo · 2021-04-05T09:00:21Z

@legohyl Do you mean it works upon downgrading? I will check your issues from my side as soon as possible.

Off the top of my head, there may be discrepancy across package versions about how int and float divisions are handled. In the NAR beam search method, perm_mask variable should always be long (and you can see this yourself, if you try to trace back the logic). It is computer from prev_k which should again be long and is being divided by an integer value self.num_nodes. Now, I am imagining that the issue arises due to one of these tensors getting initialized as a float instead of long, or that during division, the type is converted to long. I can check further.

For AR beam search methods, I am confident that the issue may be due to Python version mismatch. Is it critical for you to use Python 3.8?

legohyl · 2021-04-05T09:46:54Z

Yes, it works with your intended set of versions.

I do think it's because of the Float/Long tensors. I found myself having to convert things back to torch.long accordingly. Might be some information loss on this. Isn't prev_k < 1? It seems like we take the bestScoresId and divide them by the num_nodes, which makes it a float between 0 and 1?

Well, as I'm using a EC2 instance from my company, it's kind of the default. Though I think it can be avoided by just doing the tensor checks as asserts and return accordingly, instead of redeclaring with super. Because either way, we should not be receiving non-tensor inputs.

legohyl · 2021-04-07T07:50:03Z

@chaitjo it seems to be resolved by calculating prev_k as bestScoresId // self.num_nodes to force integer division.

chaitjo closed this as completed Apr 8, 2021

chaitjo mentioned this issue Dec 5, 2021

A code issue chaitjo/graph-convnet-tsp#16

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GNN Encoder #1

GNN Encoder #1

legohyl commented Mar 29, 2021 •

edited

Loading

chaitjo commented Mar 31, 2021 •

edited

Loading

legohyl commented Apr 5, 2021 •

edited

Loading

legohyl commented Apr 5, 2021 •

edited

Loading

chaitjo commented Apr 5, 2021

legohyl commented Apr 5, 2021

chaitjo commented Apr 5, 2021

legohyl commented Apr 5, 2021

legohyl commented Apr 5, 2021

chaitjo commented Apr 5, 2021

legohyl commented Apr 5, 2021

legohyl commented Apr 7, 2021

GNN Encoder #1

GNN Encoder #1

Comments

legohyl commented Mar 29, 2021 • edited Loading

chaitjo commented Mar 31, 2021 • edited Loading

legohyl commented Apr 5, 2021 • edited Loading

legohyl commented Apr 5, 2021 • edited Loading

chaitjo commented Apr 5, 2021

legohyl commented Apr 5, 2021

chaitjo commented Apr 5, 2021

legohyl commented Apr 5, 2021

legohyl commented Apr 5, 2021

chaitjo commented Apr 5, 2021

legohyl commented Apr 5, 2021

legohyl commented Apr 7, 2021

legohyl commented Mar 29, 2021 •

edited

Loading

chaitjo commented Mar 31, 2021 •

edited

Loading

legohyl commented Apr 5, 2021 •

edited

Loading

legohyl commented Apr 5, 2021 •

edited

Loading