Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving Model and Predicting classes #38

Open
Kunjal1999 opened this issue Mar 26, 2021 · 37 comments
Open

Saving Model and Predicting classes #38

Kunjal1999 opened this issue Mar 26, 2021 · 37 comments

Comments

@Kunjal1999
Copy link

How can I save the model trained? What line of code needs to be modified?

Moreover, how can I load the saved model in order to predict the label of a new NetworkX graph?

@muhanzhang
Copy link
Owner

After training, in this line, you can add a torch.save(classifier.state_dict(), your_model_save_path) to save the trained model. Then, next time, you can skip the training and load the saved model by classifier.load_state_dict(torch.load(your_model_save_path)).

@Kunjal1999
Copy link
Author

After training, in this line, you can add a torch.save(classifier.state_dict(), your_model_save_path) to save the trained model. Then, next time, you can skip the training and load the saved model by classifier.load_state_dict(torch.load(your_model_save_path)).

Thanks, How do I predict the label of a new graph, after loading the saved model?

@muhanzhang
Copy link
Owner

You can use loop_dataset():

test_loss = loop_dataset(test_graphs, classifier, list(range(len(test_graphs))))

@Kunjal1999
Copy link
Author

You can use loop_dataset():

test_loss = loop_dataset(test_graphs, classifier, list(range(len(test_graphs))))

Loss, accuracy and logits are generated inside the loop_dataset() function. Where are the predicted values?

@muhanzhang
Copy link
Owner

pred, mae, loss = classifier(batch_graph)

pred are the predictions.

@Kunjal1999
Copy link
Author

pred, mae, loss = classifier(batch_graph)

pred are the predictions.

Thanks, those are when classifier.regression is True (line 150).
What if classifier.regression is False? (line 153)

@muhanzhang
Copy link
Owner

logits, loss, acc = classifier(batch_graph)

logits

@asirico
Copy link

asirico commented Jul 14, 2021

After training, in this line, you can add a torch.save(classifier.state_dict(), your_model_save_path) to save the trained model. Then, next time, you can skip the training and load the saved model by classifier.load_state_dict(torch.load(your_model_save_path)).

After I save the model, where is the second line of code above executed? I am running the shell script, and once training is complete, the program terminates.

@asirico
Copy link

asirico commented Jul 15, 2021

How can I save the model trained? What line of code needs to be modified?

Moreover, how can I load the saved model in order to predict the label of a new NetworkX graph?

How are you running this program? Are you running it in the terminal or in a python console? Right now, I have been running it in a linux terminal, and it saves the model, then the program terminates. How do I get the variables out after running a shell script?

@muhanzhang
Copy link
Owner

I use the linux terminal. After saving the model, you need to rerun the script. In the script, you need to write a "if else" to skip the training part and directly load the saved model for prediction. You may refer to this code for how to implement it.

@huizhang1032
Copy link

How to output the probability that each subgraph is 0 or 1, such as (P(g=0), P(g=1)), if I input 50 subgraphs, how can I output the probability label of 50*2 and the final label?

@muhanzhang
Copy link
Owner

@hui2021 You can uncomment this line to save the test graphs' raw logit scores (for binary classification). Take exponential of the logits to get the predicted probabilities for being class 1 instead of class 0.

You may refer to this line for how the raw logits are computed.

@huizhang1032
Copy link

The output prediction result of each subgraph should be a 2-dimensional tuple. I want to get the normalized probability. If take exponential of the logits to get the predicted probabilities for being class 1, then how to use the softmax function to get the probability value between 0 and 1. For example,
If the model predicts a binary classification problem as x1=-3, x2=1.5,
Convert prediction results into non-negative numbers:
y1 = exp(x1) = exp(-3) = 0.05
y2 = exp(x2) = exp(1.5) = 4.48
The sum of the probabilities is equal to 1:
z1 = y1/(y1+y2) = 0.05/(0.05+4.48) = 0.110
z1 = y1/(y1+y2) = 0.05/(0.05+4.48) = 0.988

@muhanzhang
Copy link
Owner

@hui2021 Check this line. The logits are computed by log_softmax. So taking exponential of them directly recovers the softmax, which is a probability distribution.

@huizhang1032
Copy link

Ok.Thanks.

@asirico
Copy link

asirico commented Aug 25, 2021

I use the linux terminal. After saving the model, you need to rerun the script. In the script, you need to write a "if else" to skip the training part and directly load the saved model for prediction. You may refer to this code for how to implement it.

This is what I have for the main file:

import sys
import os
import torch
import random
import numpy as np
from tqdm import tqdm
from torch.autograd import Variable
from torch.nn.parameter import Parameter
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import math
import pdb
from DGCNN_embedding import DGCNN
from mlp_dropout import MLPClassifier, MLPRegression
from sklearn import metrics
from util import cmd_args, load_data
import matplotlib.pyplot as plt
import pickle

class Classifier(nn.Module):
    def __init__(self, regression=False):
        super(Classifier, self).__init__()
        self.regression = regression
        if cmd_args.gm == 'DGCNN':
            model = DGCNN
        else:
            print('unknown gm %s' % cmd_args.gm)
            sys.exit()

        if cmd_args.gm == 'DGCNN':
            self.gnn = model(latent_dim=cmd_args.latent_dim,
                            output_dim=cmd_args.out_dim,
                            num_node_feats=cmd_args.feat_dim+cmd_args.attr_dim,
                            num_edge_feats=cmd_args.edge_feat_dim,
                            k=cmd_args.sortpooling_k, 
                            conv1d_activation=cmd_args.conv1d_activation)
        out_dim = cmd_args.out_dim
        if out_dim == 0:
            if cmd_args.gm == 'DGCNN':
                out_dim = self.gnn.dense_dim
            else:
                out_dim = cmd_args.latent_dim
        self.mlp = MLPClassifier(input_size=out_dim, hidden_size=cmd_args.hidden, num_class=cmd_args.num_class, with_dropout=cmd_args.dropout)
        if regression:
            self.mlp = MLPRegression(input_size=out_dim, hidden_size=cmd_args.hidden, with_dropout=cmd_args.dropout)

    def PrepareFeatureLabel(self, batch_graph):
        if self.regression:
            labels = torch.FloatTensor(len(batch_graph))
        else:
            labels = torch.LongTensor(len(batch_graph))
        n_nodes = 0

        if batch_graph[0].node_tags is not None:
            node_tag_flag = True
            concat_tag = []
        else:
            node_tag_flag = False

        if batch_graph[0].node_features is not None:
            node_feat_flag = True
            concat_feat = []
        else:
            node_feat_flag = False

        if cmd_args.edge_feat_dim > 0:
            edge_feat_flag = True
            concat_edge_feat = []
        else:
            edge_feat_flag = False

        for i in range(len(batch_graph)):
            labels[i] = batch_graph[i].label
            n_nodes += batch_graph[i].num_nodes
            if node_tag_flag == True:
                concat_tag += batch_graph[i].node_tags
            if node_feat_flag == True:
                tmp = torch.from_numpy(batch_graph[i].node_features).type('torch.FloatTensor')
                concat_feat.append(tmp)
            if edge_feat_flag == True:
                if batch_graph[i].edge_features is not None:  # in case no edge in graph[i]
                    tmp = torch.from_numpy(batch_graph[i].edge_features).type('torch.FloatTensor')
                    concat_edge_feat.append(tmp)

        if node_tag_flag == True:
            concat_tag = torch.LongTensor(concat_tag).view(-1, 1)
            node_tag = torch.zeros(n_nodes, cmd_args.feat_dim)
            node_tag.scatter_(1, concat_tag, 1)

        if node_feat_flag == True:
            node_feat = torch.cat(concat_feat, 0)

        if node_feat_flag and node_tag_flag:
            # concatenate one-hot embedding of node tags (node labels) with continuous node features
            node_feat = torch.cat([node_tag.type_as(node_feat), node_feat], 1)
        elif node_feat_flag == False and node_tag_flag == True:
            node_feat = node_tag
        elif node_feat_flag == True and node_tag_flag == False:
            pass
        else:
            node_feat = torch.ones(n_nodes, 1)  # use all-one vector as node features
        
        if edge_feat_flag == True:
            edge_feat = torch.cat(concat_edge_feat, 0)

        if cmd_args.mode == 'gpu':
            node_feat = node_feat.cuda()
            labels = labels.cuda()
            if edge_feat_flag == True:
                edge_feat = edge_feat.cuda()

        if edge_feat_flag == True:
            return node_feat, edge_feat, labels
        return node_feat, labels

    def forward(self, batch_graph):
        feature_label = self.PrepareFeatureLabel(batch_graph)
        if len(feature_label) == 2:
            node_feat, labels = feature_label
            edge_feat = None
        elif len(feature_label) == 3:
            node_feat, edge_feat, labels = feature_label
        embed = self.gnn(batch_graph, node_feat, edge_feat)
        return self.mlp(embed, labels)

    def output_features(self, batch_graph):
        feature_label = self.PrepareFeatureLabel(batch_graph)
        if len(feature_label) == 2:
            node_feat, labels = feature_label
            edge_feat = None
        elif len(feature_label) == 3:
            node_feat, edge_feat, labels = feature_label
        embed = self.gnn(batch_graph, node_feat, edge_feat)
        return embed, labels
        

def loop_dataset(g_list, classifier, sample_idxes, optimizer=None, bsize=cmd_args.batch_size):
    total_loss = []
    total_iters = (len(sample_idxes) + (bsize - 1) * (optimizer is None)) // bsize
    pbar = tqdm(range(total_iters), unit='batch')
    all_targets = []
    all_scores = []

    n_samples = 0
    for pos in pbar:
        selected_idx = sample_idxes[pos * bsize : (pos + 1) * bsize]

        batch_graph = [g_list[idx] for idx in selected_idx]
        targets = [g_list[idx].label for idx in selected_idx]
        all_targets += targets
        if classifier.regression:
            pred, mae, loss = classifier(batch_graph)
            all_scores.append(pred.cpu().detach())  # for binary classification
        else:
            logits, loss, acc = classifier(batch_graph)
            all_scores.append(logits[:, 1].cpu().detach())  # for binary classification

        if optimizer is not None:
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        loss = loss.data.cpu().detach().numpy()
        if classifier.regression:
            pbar.set_description('MSE_loss: %0.5f MAE_loss: %0.5f' % (loss, mae) )
            total_loss.append( np.array([loss, mae]) * len(selected_idx))
        else:
            pbar.set_description('loss: %0.5f acc: %0.5f' % (loss, acc) )
            total_loss.append( np.array([loss, acc]) * len(selected_idx))


        n_samples += len(selected_idx)
    if optimizer is None:
        assert n_samples == len(sample_idxes)
    total_loss = np.array(total_loss)
    avg_loss = np.sum(total_loss, 0) / n_samples
    all_scores = torch.cat(all_scores).cpu().numpy()
    
    # np.savetxt('test_scores.txt', all_scores)  # output test predictions
    
    if not classifier.regression and cmd_args.printAUC:
        all_targets = np.array(all_targets)
        fpr, tpr, _ = metrics.roc_curve(all_targets, all_scores, pos_label=1)
        auc = metrics.auc(fpr, tpr)
        avg_loss = np.concatenate((avg_loss, [auc]))
    else:
        avg_loss = np.concatenate((avg_loss, [0.0]))
    
    return avg_loss


if __name__ == '__main__':
    print(cmd_args)
    random.seed(cmd_args.seed)
    np.random.seed(cmd_args.seed)
    torch.manual_seed(cmd_args.seed)

    train_graphs, test_graphs = load_data()
    print('# train: %d, # test: %d' % (len(train_graphs), len(test_graphs)))

    if cmd_args.sortpooling_k <= 1:
        num_nodes_list = sorted([g.num_nodes for g in train_graphs + test_graphs])
        cmd_args.sortpooling_k = num_nodes_list[int(math.ceil(cmd_args.sortpooling_k * len(num_nodes_list))) - 1]
        cmd_args.sortpooling_k = max(10, cmd_args.sortpooling_k)
        print('k used in SortPooling is: ' + str(cmd_args.sortpooling_k))

    classifier = Classifier()
    if cmd_args.mode == 'gpu':
        classifier = classifier.cuda()

    optimizer = optim.Adam(classifier.parameters(), lr=cmd_args.learning_rate)

    train_idxes = list(range(len(train_graphs)))
    best_loss = None
    for epoch in range(cmd_args.num_epochs):
        random.shuffle(train_idxes)
        classifier.train()
        avg_loss = loop_dataset(train_graphs, classifier, train_idxes, optimizer=optimizer)
        if not cmd_args.printAUC:
            avg_loss[2] = 0.0
        print('\033[92maverage training of epoch %d: loss %.5f acc %.5f auc %.5f\033[0m' % (epoch, avg_loss[0], avg_loss[1], avg_loss[2]))

        classifier.eval()
        test_loss = loop_dataset(test_graphs, classifier, list(range(len(test_graphs))))
        if not cmd_args.printAUC:
            test_loss[2] = 0.0
        print('\033[93maverage test of epoch %d: loss %.5f acc %.5f auc %.5f\033[0m' % (epoch, test_loss[0], test_loss[1], test_loss[2]))

    with open(cmd_args.data + '_acc_results.txt', 'a+') as f:
        f.write(str(test_loss[1]) + '\n')

    torch.save(classifier.state_dict(), 'saved_model/test2.bin')

    if cmd_args.printAUC:
        with open(cmd_args.data + '_auc_results.txt', 'a+') as f:
            f.write(str(test_loss[2]) + '\n')



    if cmd_args.extract_features:
        features, labels = classifier.output_features(train_graphs)
        labels = labels.type('torch.FloatTensor')
        np.savetxt('extracted_features_train.txt', torch.cat([labels.unsqueeze(1), features.cpu()], dim=1).detach().numpy(), '%.4f')
        features, labels = classifier.output_features(test_graphs)
        labels = labels.type('torch.FloatTensor')
        np.savetxt('extracted_features_test.txt', torch.cat([labels.unsqueeze(1), features.cpu()], dim=1).detach().numpy(), '%.4f')

    elif cmd_args.predict:
        with open('saved_model/test2/archive/data.pkl', 'rb') as parameters:
            saved_cmd_args = pickle.load(parameters)
        for key, value in vars(saved_cmd_args).items():
            vars(cmd_args)[key] = value
        classifier = Classifier()
        if cmd_args.mode == 'gpu':
            classifier = classifier.cuda()
        model_name = 'data/my_data2.pth'
        classifier.load_state_dict(torch.load(model_name))
        classifier.eval()
        predictions = []
        batch_graph = []
        for i, graph in enumerate(test_graphs):
            batch_graph.append(graph)
            if len(batch_graph) == cmd_args.batch_size or i == (len(test_graphs) - 1):
                predictions.append(classifier(batch_graph)[0][:, 1].exp().cpu().detach())
                batch_graph = []
        predictions = torch.cat(predictions, 0).unsqueeze(1).numpy()
        test_idx_and_pred = np.concatenate([test_idx, predictions], 1)
        pred_name = 'data/' + cmd_args.test_name.split('.')[0] + '_pred.txt'
        np.savetxt(pred_name, test_idx_and_pred, fmt=['%d', '%d', '%1.2f'])
        print('Predictions for {} are saved in {}'.format(cmd_args.test_name, pred_name))
        exit()

Be advised, I am an engineer inexperienced in computer science. Did I implement the code correctly to make predictions at the bottom? Also, in the util file I added the first line in the argparse: `cmd_opt.add_argument('-predict', action='store_true', default=False, help='...')

Please help! Thanks.

@muhanzhang
Copy link
Owner

Didn't check completely but it seems basically correct. You don't need the following lines:

with open('saved_model/test2/archive/data.pkl', 'rb') as parameters:
        saved_cmd_args = pickle.load(parameters)
    for key, value in vars(saved_cmd_args).items():
        vars(cmd_args)[key] = value

And you need to make sure your loaded model name 'data/my_data2.pth' is the same as your saved model name 'saved_model/test2.bin'.

@asirico
Copy link

asirico commented Aug 25, 2021

Thanks so much in advance for your help, I'm very new to this.

So when I run ./run_DGCNN.sh -predict in the terminal, what is coming after the predict? Because this is what I'm getting:


./run_DGCNN.sh -predict
====== begin of gnn configuration ======
| msg_average = 0
======   end of gnn configuration ======
usage: main.py [-h] [-predict] [-mode MODE] [-gm GM] [-data DATA] [-batch_size BATCH_SIZE] [-seed SEED]
               [-feat_dim FEAT_DIM] [-edge_feat_dim EDGE_FEAT_DIM] [-num_class NUM_CLASS] [-fold FOLD]
               [-test_number TEST_NUMBER] [-num_epochs NUM_EPOCHS] [-latent_dim LATENT_DIM]
               [-sortpooling_k SORTPOOLING_K] [-conv1d_activation CONV1D_ACTIVATION] [-out_dim OUT_DIM] [-hidden HIDDEN]
               [-max_lv MAX_LV] [-learning_rate LEARNING_RATE] [-dropout DROPOUT] [-printAUC PRINTAUC]
               [-extract_features EXTRACT_FEATURES]
main.py: error: argument -data: expected one argument

@asirico
Copy link

asirico commented Aug 25, 2021

Ok, so is it another data set in the same format as the training I did. ./run_DGCNN.sh my_predict_data 1 0 -predict ? But in that dataset to be predicted, does the graph label have to be missing? If this is the data it trained on:

6 0
1 1 5
2 1 4
3 1 3
4 3 2 4 5
5 2 1 3
5 2 0 3

Would the prediction data be:

6 
1 1 5
2 1 4
3 1 3
4 3 2 4 5
5 2 1 3
5 2 0 3

@muhanzhang
Copy link
Owner

The label cannot be missing. You can use a dummy label 0 for all test graphs. For your case, use ./run_DGCNN.sh my_train_data 1 0 first to train and save the model, then use ./run_DGCNN.sh my_predict_data 1 100 (suppose your my_predict_data contains 100 test graphs, and you need to modify "run_DGCNN.sh" to include the "-predict" after line 88) to predict.

I think reading the code more carefully can help understand its different functions better.

@asirico
Copy link

asirico commented Aug 25, 2021

In the cuda_visible_devises list ?

@asirico
Copy link

asirico commented Aug 25, 2021

main.py:177: RuntimeWarning: invalid value encountered in double_scalars
  avg_loss = np.sum(total_loss, 0) / n_samples
Traceback (most recent call last):
  File "main.py", line 219, in <module>
    avg_loss = loop_dataset(train_graphs, classifier, train_idxes, optimizer=optimizer)
  File "main.py", line 178, in loop_dataset
    all_scores = torch.cat(all_scores).cpu().numpy()

@huizhang1032
Copy link

logits, loss, acc = classifier(batch_graph)

logits

You can use loop_dataset():

test_loss = loop_dataset(test_graphs, classifier, list(range(len(test_graphs))))

Loss, accuracy and logits are generated inside the loop_dataset() function. Where are the predicted values?
What if classifier.regression is False? (line 153)
how to output predicted labels ?

@andreitam11
Copy link

Did anyone find a solution to this issue? I am also interested in outputting the predicted labels.

@andreitam11
Copy link

The label cannot be missing. You can use a dummy label 0 for all test graphs. For your case, use ./run_DGCNN.sh my_train_data 1 0 first to train and save the model, then use ./run_DGCNN.sh my_predict_data 1 100 (suppose your my_predict_data contains 100 test graphs, and you need to modify "run_DGCNN.sh" to include the "-predict" after line 88) to predict.

I think reading the code more carefully can help understand its different functions better.

When I try to run the code without test graphs I get this error:main.py:177: RuntimeWarning: invalid value encountered in double_scalars avg_loss = np.sum(total_loss, 0) / n_samples Traceback (most recent call last): File "main.py", line 225, in <module> test_loss = loop_dataset(test_graphs, classifier, list(range(len(test_graphs)))) File "main.py", line 178, in loop_dataset all_scores = torch.cat(all_scores).cpu().numpy() RuntimeError: There were no tensor arguments to this function (e.g., you passed an empty list of Tensors), but no fallback function is registered for schema aten::_cat. This usually means that this function requires a non-empty list of Tensors. Available functions are [CPU, QuantizedCPU, BackendSelect, Named, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, Autocast, Batched, VmapMode].

Has someone successfully saved the model to predict new values? Thak you for your help.

@muhanzhang
Copy link
Owner

Then a workaround would be appending a dummy test graph to the my_train_data, and use ./run_DGCNN.sh my_train_data 1 1 to save the model. @andreitam11

@andreitam11
Copy link

Hi that works and I am able to train and predict the data. However, when I want to re-train the model and I update the 'run_DGCNN.sh' file for the predict to be False, it never updates. It still only predicts and loads the last saved model. I have tried terminating the terminal, and recompiling with 'make clean' then 'make -j4' in the lib directory and these do not work.

I have copied the modified 'main.py' file if someone wishes to use it.
image

import sys
import os
import torch
import random
import numpy as np
from tqdm import tqdm
from torch.autograd import Variable
from torch.nn.parameter import Parameter
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import math
import pdb
from DGCNN_embedding import DGCNN
from mlp_dropout import MLPClassifier, MLPRegression
from sklearn import metrics
from util import cmd_args, load_data
import matplotlib.pyplot as plt
import pickle

class Classifier(nn.Module):
    def __init__(self, regression=False):
        super(Classifier, self).__init__()
        self.regression = regression
        if cmd_args.gm == 'DGCNN':
            model = DGCNN
        else:
            print('unknown gm %s' % cmd_args.gm)
            sys.exit()

        if cmd_args.gm == 'DGCNN':
            self.gnn = model(latent_dim=cmd_args.latent_dim,
                            output_dim=cmd_args.out_dim,
                            num_node_feats=cmd_args.feat_dim+cmd_args.attr_dim,
                            num_edge_feats=cmd_args.edge_feat_dim,
                            k=cmd_args.sortpooling_k, 
                            conv1d_activation=cmd_args.conv1d_activation)
        out_dim = cmd_args.out_dim
        if out_dim == 0:
            if cmd_args.gm == 'DGCNN':
                out_dim = self.gnn.dense_dim
            else:
                out_dim = cmd_args.latent_dim
        self.mlp = MLPClassifier(input_size=out_dim, hidden_size=cmd_args.hidden, num_class=cmd_args.num_class, with_dropout=cmd_args.dropout)
        if regression:
            self.mlp = MLPRegression(input_size=out_dim, hidden_size=cmd_args.hidden, with_dropout=cmd_args.dropout)

    def PrepareFeatureLabel(self, batch_graph):
        if self.regression:
            labels = torch.FloatTensor(len(batch_graph))
        else:
            labels = torch.LongTensor(len(batch_graph))
        n_nodes = 0

        if batch_graph[0].node_tags is not None:
            node_tag_flag = True
            concat_tag = []
        else:
            node_tag_flag = False

        if batch_graph[0].node_features is not None:
            node_feat_flag = True
            concat_feat = []
        else:
            node_feat_flag = False

        if cmd_args.edge_feat_dim > 0:
            edge_feat_flag = True
            concat_edge_feat = []
        else:
            edge_feat_flag = False

        for i in range(len(batch_graph)):
            labels[i] = batch_graph[i].label
            n_nodes += batch_graph[i].num_nodes
            if node_tag_flag == True:
                concat_tag += batch_graph[i].node_tags
            if node_feat_flag == True:
                tmp = torch.from_numpy(batch_graph[i].node_features).type('torch.FloatTensor')
                concat_feat.append(tmp)
            if edge_feat_flag == True:
                if batch_graph[i].edge_features is not None:  # in case no edge in graph[i]
                    tmp = torch.from_numpy(batch_graph[i].edge_features).type('torch.FloatTensor')
                    concat_edge_feat.append(tmp)

        if node_tag_flag == True:
            concat_tag = torch.LongTensor(concat_tag).view(-1, 1)
            node_tag = torch.zeros(n_nodes, cmd_args.feat_dim)
            node_tag.scatter_(1, concat_tag, 1)

        if node_feat_flag == True:
            node_feat = torch.cat(concat_feat, 0)

        if node_feat_flag and node_tag_flag:
            # concatenate one-hot embedding of node tags (node labels) with continuous node features
            node_feat = torch.cat([node_tag.type_as(node_feat), node_feat], 1)
        elif node_feat_flag == False and node_tag_flag == True:
            node_feat = node_tag
        elif node_feat_flag == True and node_tag_flag == False:
            pass
        else:
            node_feat = torch.ones(n_nodes, 1)  # use all-one vector as node features
        
        if edge_feat_flag == True:
            edge_feat = torch.cat(concat_edge_feat, 0)

        if cmd_args.mode == 'gpu':
            node_feat = node_feat.cuda()
            labels = labels.cuda()
            if edge_feat_flag == True:
                edge_feat = edge_feat.cuda()

        if edge_feat_flag == True:
            return node_feat, edge_feat, labels
        return node_feat, labels

    def forward(self, batch_graph):
        feature_label = self.PrepareFeatureLabel(batch_graph)
        if len(feature_label) == 2:
            node_feat, labels = feature_label
            edge_feat = None
        elif len(feature_label) == 3:
            node_feat, edge_feat, labels = feature_label
        embed = self.gnn(batch_graph, node_feat, edge_feat)
        return self.mlp(embed, labels)

    def output_features(self, batch_graph):
        feature_label = self.PrepareFeatureLabel(batch_graph)
        if len(feature_label) == 2:
            node_feat, labels = feature_label
            edge_feat = None
        elif len(feature_label) == 3:
            node_feat, edge_feat, labels = feature_label
        embed = self.gnn(batch_graph, node_feat, edge_feat)
        return embed, labels
        

def loop_dataset(g_list, classifier, sample_idxes, optimizer=None, bsize=cmd_args.batch_size):
    total_loss = []
    total_iters = (len(sample_idxes) + (bsize - 1) * (optimizer is None)) // bsize
    pbar = tqdm(range(total_iters), unit='batch')
    all_targets = []
    all_scores = []
    output_scores = []
    n_samples = 0
    for pos in pbar:
        selected_idx = sample_idxes[pos * bsize : (pos + 1) * bsize]

        batch_graph = [g_list[idx] for idx in selected_idx]
        targets = [g_list[idx].label for idx in selected_idx]
        all_targets += targets
        if classifier.regression:
            pred, mae, loss = classifier(batch_graph)
            all_scores.append(pred.cpu().detach())  # for binary classification
        else:
            logits, loss, acc = classifier(batch_graph)
            all_scores.append(logits[:, 1].cpu().detach())  # for binary classification
            output_scores.append(logits.exp().cpu().detach())
        if optimizer is not None:
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        loss = loss.data.cpu().detach().numpy()
        if classifier.regression:
            pbar.set_description('MSE_loss: %0.5f MAE_loss: %0.5f' % (loss, mae) )
            total_loss.append( np.array([loss, mae]) * len(selected_idx))
        else:
            pbar.set_description('loss: %0.5f acc: %0.5f' % (loss, acc) )
            total_loss.append( np.array([loss, acc]) * len(selected_idx))


        n_samples += len(selected_idx)
    if optimizer is None:
        assert n_samples == len(sample_idxes)
    total_loss = np.array(total_loss)
    avg_loss = np.sum(total_loss, 0) / n_samples
    all_scores = torch.cat(all_scores).cpu().numpy()
    output_scores = torch.cat(output_scores).cpu().numpy()
    np.savetxt(cmd_args.data+'_test_scores.txt', output_scores)  # output test predictions
    
    if not classifier.regression and cmd_args.printAUC:
        all_targets = np.array(all_targets)
        fpr, tpr, _ = metrics.roc_curve(all_targets, all_scores, pos_label=1)
        auc = metrics.auc(fpr, tpr)
        avg_loss = np.concatenate((avg_loss, [auc]))
    else:
        avg_loss = np.concatenate((avg_loss, [0.0]))
    
    return avg_loss


if __name__ == '__main__':
    print(cmd_args)
    random.seed(cmd_args.seed)
    np.random.seed(cmd_args.seed)
    torch.manual_seed(cmd_args.seed)

    train_graphs, test_graphs = load_data()
    print('# train: %d, # test: %d' % (len(train_graphs), len(test_graphs)))

    if cmd_args.sortpooling_k <= 1:
        num_nodes_list = sorted([g.num_nodes for g in train_graphs + test_graphs])
        cmd_args.sortpooling_k = num_nodes_list[int(math.ceil(cmd_args.sortpooling_k * len(num_nodes_list))) - 1]
        cmd_args.sortpooling_k = max(10, cmd_args.sortpooling_k)
        print('k used in SortPooling is: ' + str(cmd_args.sortpooling_k))
    if cmd_args.predict == False:
        classifier = Classifier()
        if cmd_args.mode == 'gpu':
            classifier = classifier.cuda()

        optimizer = optim.Adam(classifier.parameters(), lr=cmd_args.learning_rate)

        train_idxes = list(range(len(train_graphs)))
        best_loss = None
        output_string = ''
        for epoch in range(cmd_args.num_epochs):
            random.shuffle(train_idxes)
            classifier.train()
            avg_loss = loop_dataset(train_graphs, classifier, train_idxes, optimizer=optimizer)
            if not cmd_args.printAUC:
                avg_loss[2] = 0.0
            print('\033[92maverage training of epoch %d: loss %.5f acc %.5f auc %.5f\033[0m' % (epoch, avg_loss[0], avg_loss[1], avg_loss[2]))
            output_string += '[training, %d, %.5f ,%.5f ,%.5f] \n' % (epoch, avg_loss[0], avg_loss[1], avg_loss[2])
            classifier.eval()
            test_loss = loop_dataset(test_graphs, classifier, list(range(len(test_graphs))))
            if not cmd_args.printAUC:
                test_loss[2] = 0.0
            print('\033[93maverage test of epoch %d: loss %.5f acc %.5f auc %.5f\033[0m' % (epoch, test_loss[0], test_loss[1], test_loss[2]))
            output_string += '[test, %d, %.5f ,%.5f ,%.5f] \n' % (epoch, test_loss[0], test_loss[1], test_loss[2])
        with open(cmd_args.data + '_acc_results.txt', 'a+') as f:
            f.write(output_string)

        torch.save(classifier.state_dict(), 'saved_model/test2.bin')

        if cmd_args.printAUC:
            with open(cmd_args.data + '_auc_results.txt', 'a+') as f:
                f.write(output_string)



        if cmd_args.extract_features:
            features, labels = classifier.output_features(train_graphs)
            labels = labels.type('torch.FloatTensor')
            np.savetxt('extracted_features_train.txt', torch.cat([labels.unsqueeze(1), features.cpu()], dim=1).detach().numpy(), '%.4f')
            features, labels = classifier.output_features(test_graphs)
            labels = labels.type('torch.FloatTensor')
            np.savetxt('extracted_features_test.txt', torch.cat([labels.unsqueeze(1), features.cpu()], dim=1).detach().numpy(), '%.4f')

    elif cmd_args.predict:
        classifier = Classifier()
        if cmd_args.mode == 'gpu':
            classifier = classifier.cuda()
        model_name = 'saved_model/test2.bin'
        classifier.load_state_dict(torch.load(model_name))
        classifier.eval()
        predictions = []
        batch_graph = []
        for i, graph in enumerate(test_graphs):
            batch_graph.append(graph)
            if len(batch_graph) == cmd_args.batch_size or i == (len(test_graphs) - 1):
                predictions.append(classifier(batch_graph)[0].exp().cpu().detach())
                batch_graph = []
        predictions = torch.cat(predictions).cpu().numpy()
        pred_name = 'data/' +cmd_args.data+'_pred.txt'
        np.savetxt(pred_name,predictions)
        print('Predictions for {} are saved in {}'.format(cmd_args.data, pred_name))
        exit()

@muhanzhang
Copy link
Owner

@andreitam11 Did you modify the saved file name in torch.save(classifier.state_dict(), 'saved_model/test2.bin') and classifier.load_state_dict(torch.load(model_name)) when you retrain and reload the model?

@andreitam11
Copy link

@andreitam11 Did you modify the saved file name in torch.save(classifier.state_dict(), 'saved_model/test2.bin') and classifier.load_state_dict(torch.load(model_name)) when you retrain and reload the model?

I cannot retrain the model. Even if I modify the name in torch.save(classifier.state_dict(), 'saved_model/test2.bin'), when I run run_DGCNN,sh it reads the previous setting so it keeps on reading to predict. I have changed several parameters in the run_DGCNN.sh file and nothing changes.

@andreitam11
Copy link

I read the file in terminal as a text file and it is correct, but when I run it it does not ready the correct information.

(Pytorch) hectormujica@Mitzis-iMac pytorch_DGCNN % cat run_DGCNN.txt
#!/bin/bash

# input arguments
DATA="${1-MUTAG}"  # MUTAG, ENZYMES, NCI1, NCI109, DD, PTC, PROTEINS, COLLAB, IMDBBINARY, IMDBMULTI
fold=${2-1}  # which fold as testing data
test_number=${3-0}  # if specified, use the last test_number graphs as test data

# general settings
gm=DGCNN  # model
gpu_or_cpu=cpu
GPU=0  # select the GPU number
CONV_SIZE="32-32-32-1"
sortpooling_k=0.6  # If k <= 1, then k is set to an integer so that k% of graphs have nodes less than this integer
FP_LEN=0  # final dense layer's input dimension, decided by data
n_hidden=128  # final dense layer's hidden size
bsize=1  # batch size, set to 50 or 100 to accelerate training
dropout=True
predict=False

# dataset-specific settings
case ${DATA} in
MUTAG)
  num_epochs=300
  learning_rate=0.0001
  ;;
ANDREA)
  num_epochs=10
  learning_rate=0.0001
  ;;
*)
  num_epochs=500
  learning_rate=0.00001
  ;;
esac

if [ ${fold} == 0 ]; then
  echo "Running 10-fold cross validation"
  start=`date +%s`
  for i in $(seq 1 10)
  do
    CUDA_VISIBLE_DEVICES=${GPU} python main.py \
        -seed 1 \
        -data $DATA \
        -fold $i \
        -learning_rate $learning_rate \
        -num_epochs $num_epochs \
        -hidden $n_hidden \
        -latent_dim $CONV_SIZE \
        -sortpooling_k $sortpooling_k \
        -out_dim $FP_LEN \
        -batch_size $bsize \
        -gm $gm \
        -mode $gpu_or_cpu \
        -dropout $dropout \
        -predict $predict

  done
  stop=`date +%s`
  echo "End of cross-validation"
  echo "The total running time is $[stop - start] seconds."
  echo "The accuracy results for ${DATA} are as follows:"
  tail -10 ${DATA}_acc_results.txt
  echo "Average accuracy and std are"
  tail -10 ${DATA}_acc_results.txt | awk '{ sum += $1; sum2 += $1*$1; n++ } END { if (n > 0) print sum / n; print sqrt(sum2 / n - (sum/n) * (sum/n)); }'
else
  CUDA_VISIBLE_DEVICES=${GPU} python main.py \
      -seed 1 \
      -data $DATA \
      -fold $fold \
      -learning_rate $learning_rate \
      -num_epochs $num_epochs \
      -hidden $n_hidden \
      -latent_dim $CONV_SIZE \
      -sortpooling_k $sortpooling_k \
      -out_dim $FP_LEN \
      -batch_size $bsize \
      -gm $gm \
      -mode $gpu_or_cpu \
      -dropout $dropout \
      -predict $predict \
      -test_number ${test_number}
fi
(Pytorch) hectormujica@Mitzis-iMac pytorch_DGCNN % 
(Pytorch) hectormujica@Mitzis-iMac pytorch_DGCNN % 
(Pytorch) hectormujica@Mitzis-iMac pytorch_DGCNN % ./run_DGCNN.sh ANDREA 1 1
====== begin of gnn configuration ======
| msg_average = 0
======   end of gnn configuration ======
Namespace(batch_size=1, conv1d_activation='ReLU', data='ANDREA', dropout=True, edge_feat_dim=0, extract_features=False, feat_dim=0, fold=1, gm='DGCNN', hidden=128, latent_dim=[32, 32, 32, 1], learning_rate=0.0001, max_lv=4, mode='cpu', num_class=0, num_epochs=10, out_dim=0, predict=True, printAUC=False, seed=1, sortpooling_k=0.6, test_number=1)
loading data
# classes: 3
# maximum node tag: 3
# train: 499, # test: 1
k used in SortPooling is: 564
Initializing DGCNN
Traceback (most recent call last):
  File "main.py", line 255, in <module>
    classifier.load_state_dict(torch.load(model_name))
  File "/Users/hectormujica/opt/anaconda3/envs/Pytorch/lib/python3.8/site-packages/torch/serialization.py", line 581, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/Users/hectormujica/opt/anaconda3/envs/Pytorch/lib/python3.8/site-packages/torch/serialization.py", line 230, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/Users/hectormujica/opt/anaconda3/envs/Pytorch/lib/python3.8/site-packages/torch/serialization.py", line 211, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'saved_model/test3.bin'
(Pytorch) hectormujica@Mitzis-iMac pytorch_DGCNN % 

@muhanzhang
Copy link
Owner

@andreitam11 Seems that when you modify the file, you modify run_DGCNN.txt, but when you execute the program, you execute run_DGCNN.sh.

@andreitam11
Copy link

I only converted it to a .txt to see if it was reading it and to echo it in the terminal, but I run the run_DGCNN.sh

@muhanzhang
Copy link
Owner

I cannot guess what happened but there might be something wrong with your implementation of the -predict option, so that the correct argument isn't passed to the program. Can you try to set breaks by pdb.set_trace() before and after where the -predict should take effect and see what happens?

@andreitam11
Copy link

====== begin of gnn configuration ======
| msg_average = 0
======   end of gnn configuration ======
Namespace(batch_size=1, conv1d_activation='ReLU', data='ANDREA', dropout=True, edge_feat_dim=0, extract_features=False, feat_dim=0, fold=1, gm='DGCNN', hidden=128, latent_dim=[32, 32, 32, 1], learning_rate=0.001, max_lv=4, mode='cpu', num_class=0, num_epochs=10, out_dim=0, predict=True, printAUC=False, seed=1, sortpooling_k=0.6, test_number=10)
loading data
# classes: 3
# maximum node tag: 3
# train: 490, # test: 10
k used in SortPooling is: 564
> /Users/hectormujica/Documents/Andrea/pytorch_DGCNN/main.py(209)<module>()
-> if cmd_args.predict:
(Pdb) ```

@andreitam11
Copy link

Hello, I finally made it work, and I also changed the util.py to print out the label dictionary. What I needed to do was to erase the -predict $predict\ from the run_DGCNN.sh if I wanted to train a new model. I hope it helps someone and I hope it is implemented correctly.

Here are the modifications

main.py

import sys
import os
import torch
import random
import numpy as np
from tqdm import tqdm
from torch.autograd import Variable
from torch.nn.parameter import Parameter
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import math
from DGCNN_embedding import DGCNN
from mlp_dropout import MLPClassifier, MLPRegression
from sklearn import metrics
from util import cmd_args, load_data
import matplotlib.pyplot as plt

class Classifier(nn.Module):
    def __init__(self, regression=False):
        super(Classifier, self).__init__()
        self.regression = regression
        if cmd_args.gm == 'DGCNN':
            model = DGCNN
        else:
            print('unknown gm %s' % cmd_args.gm)
            sys.exit()

        if cmd_args.gm == 'DGCNN':
            self.gnn = model(latent_dim=cmd_args.latent_dim,
                            output_dim=cmd_args.out_dim,
                            num_node_feats=cmd_args.feat_dim+cmd_args.attr_dim,
                            num_edge_feats=cmd_args.edge_feat_dim,
                            k=cmd_args.sortpooling_k, 
                            conv1d_activation=cmd_args.conv1d_activation)
        out_dim = cmd_args.out_dim
        if out_dim == 0:
            if cmd_args.gm == 'DGCNN':
                out_dim = self.gnn.dense_dim
            else:
                out_dim = cmd_args.latent_dim
        self.mlp = MLPClassifier(input_size=out_dim, hidden_size=cmd_args.hidden, num_class=cmd_args.num_class, with_dropout=cmd_args.dropout)
        if regression:
            self.mlp = MLPRegression(input_size=out_dim, hidden_size=cmd_args.hidden, with_dropout=cmd_args.dropout)

    def PrepareFeatureLabel(self, batch_graph):
        if self.regression:
            labels = torch.FloatTensor(len(batch_graph))
        else:
            labels = torch.LongTensor(len(batch_graph))
        n_nodes = 0

        if batch_graph[0].node_tags is not None:
            node_tag_flag = True
            concat_tag = []
        else:
            node_tag_flag = False

        if batch_graph[0].node_features is not None:
            node_feat_flag = True
            concat_feat = []
        else:
            node_feat_flag = False

        if cmd_args.edge_feat_dim > 0:
            edge_feat_flag = True
            concat_edge_feat = []
        else:
            edge_feat_flag = False

        for i in range(len(batch_graph)):
            labels[i] = batch_graph[i].label
            n_nodes += batch_graph[i].num_nodes
            if node_tag_flag == True:
                concat_tag += batch_graph[i].node_tags
            if node_feat_flag == True:
                tmp = torch.from_numpy(batch_graph[i].node_features).type('torch.FloatTensor')
                concat_feat.append(tmp)
            if edge_feat_flag == True:
                if batch_graph[i].edge_features is not None:  # in case no edge in graph[i]
                    tmp = torch.from_numpy(batch_graph[i].edge_features).type('torch.FloatTensor')
                    concat_edge_feat.append(tmp)

        if node_tag_flag == True:
            concat_tag = torch.LongTensor(concat_tag).view(-1, 1)
            node_tag = torch.zeros(n_nodes, cmd_args.feat_dim)
            node_tag.scatter_(1, concat_tag, 1)

        if node_feat_flag == True:
            node_feat = torch.cat(concat_feat, 0)

        if node_feat_flag and node_tag_flag:
            # concatenate one-hot embedding of node tags (node labels) with continuous node features
            node_feat = torch.cat([node_tag.type_as(node_feat), node_feat], 1)
        elif node_feat_flag == False and node_tag_flag == True:
            node_feat = node_tag
        elif node_feat_flag == True and node_tag_flag == False:
            pass
        else:
            node_feat = torch.ones(n_nodes, 1)  # use all-one vector as node features
        
        if edge_feat_flag == True:
            edge_feat = torch.cat(concat_edge_feat, 0)

        if cmd_args.mode == 'gpu':
            node_feat = node_feat.cuda()
            labels = labels.cuda()
            if edge_feat_flag == True:
                edge_feat = edge_feat.cuda()

        if edge_feat_flag == True:
            return node_feat, edge_feat, labels
        return node_feat, labels

    def forward(self, batch_graph):
        feature_label = self.PrepareFeatureLabel(batch_graph)
        if len(feature_label) == 2:
            node_feat, labels = feature_label
            edge_feat = None
        elif len(feature_label) == 3:
            node_feat, edge_feat, labels = feature_label
        embed = self.gnn(batch_graph, node_feat, edge_feat)
        return self.mlp(embed, labels)

    def output_features(self, batch_graph):
        feature_label = self.PrepareFeatureLabel(batch_graph)
        if len(feature_label) == 2:
            node_feat, labels = feature_label
            edge_feat = None
        elif len(feature_label) == 3:
            node_feat, edge_feat, labels = feature_label
        embed = self.gnn(batch_graph, node_feat, edge_feat)
        return embed, labels
        

def loop_dataset(g_list, classifier, sample_idxes, optimizer=None, bsize=cmd_args.batch_size):
    total_loss = []
    total_iters = (len(sample_idxes) + (bsize - 1) * (optimizer is None)) // bsize
    pbar = tqdm(range(total_iters), unit='batch')
    all_targets = []
    all_scores = []
    output_scores = []
    n_samples = 0
    for pos in pbar:
        selected_idx = sample_idxes[pos * bsize : (pos + 1) * bsize]

        batch_graph = [g_list[idx] for idx in selected_idx]
        targets = [g_list[idx].label for idx in selected_idx]
        all_targets += targets
        if classifier.regression:
            pred, mae, loss = classifier(batch_graph)
            all_scores.append(pred.cpu().detach())  # for binary classification
        else:
            logits, loss, acc = classifier(batch_graph)
            all_scores.append(logits[:, 1].cpu().detach())  # for binary classification
            output_scores.append(logits.exp().cpu().detach())
        if optimizer is not None:
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        loss = loss.data.cpu().detach().numpy()
        if classifier.regression:
            pbar.set_description('MSE_loss: %0.5f MAE_loss: %0.5f' % (loss, mae) )
            total_loss.append( np.array([loss, mae]) * len(selected_idx))
        else:
            pbar.set_description('loss: %0.5f acc: %0.5f' % (loss, acc) )
            total_loss.append( np.array([loss, acc]) * len(selected_idx))


        n_samples += len(selected_idx)
    if optimizer is None:
        assert n_samples == len(sample_idxes)
    total_loss = np.array(total_loss)
    avg_loss = np.sum(total_loss, 0) / n_samples
    all_scores = torch.cat(all_scores).cpu().numpy()
    output_scores = torch.cat(output_scores).cpu().numpy()
    np.savetxt(cmd_args.data+'_test_scores.txt', output_scores)  # output test predictions
    
    if not classifier.regression and cmd_args.printAUC:
        all_targets = np.array(all_targets)
        fpr, tpr, _ = metrics.roc_curve(all_targets, all_scores, pos_label=1)
        auc = metrics.auc(fpr, tpr)
        avg_loss = np.concatenate((avg_loss, [auc]))
    else:
        avg_loss = np.concatenate((avg_loss, [0.0]))
    
    return avg_loss


if __name__ == '__main__':
    print(cmd_args)
    random.seed(cmd_args.seed)
    np.random.seed(cmd_args.seed)
    torch.manual_seed(cmd_args.seed)

    train_graphs, test_graphs,label_dict = load_data()
    print('# train: %d, # test: %d' % (len(train_graphs), len(test_graphs)))

    if cmd_args.sortpooling_k <= 1:
        num_nodes_list = sorted([g.num_nodes for g in train_graphs + test_graphs])
        cmd_args.sortpooling_k = num_nodes_list[int(math.ceil(cmd_args.sortpooling_k * len(num_nodes_list))) - 1]
        cmd_args.sortpooling_k = max(10, cmd_args.sortpooling_k)
        print('k used in SortPooling is: ' + str(cmd_args.sortpooling_k))

    if cmd_args.predict:
        classifier = Classifier()
        if cmd_args.mode == 'gpu':
            classifier = classifier.cuda()
        model_name = 'saved_model/test1.bin'
        classifier.load_state_dict(torch.load(model_name))
        classifier.eval()
        predictions = []
        batch_graph = []
        for i, graph in enumerate(test_graphs):
            batch_graph.append(graph)
            if len(batch_graph) == cmd_args.batch_size or i == (len(test_graphs) - 1):
                predictions.append(classifier(batch_graph)[0].exp().cpu().detach())
                batch_graph = []
        predictions = torch.cat(predictions).cpu().numpy()
        pred_name = 'data/' +cmd_args.data+'_pred.txt'
        np.savetxt(pred_name,predictions,fmt = '%.4f',header = str(label_dict))
        np.set_printoptions(precision =4,suppress=True)
        print(str(label_dict))
        print(predictions)
        print('Predictions for {} are saved in {}'.format(cmd_args.data, pred_name))
        exit()

    classifier = Classifier()
    if cmd_args.mode == 'gpu':
        classifier = classifier.cuda()

    optimizer = optim.Adam(classifier.parameters(), lr=cmd_args.learning_rate)
    train_idxes = list(range(len(train_graphs)))
    best_loss = None
    output_string = ''
    for epoch in range(cmd_args.num_epochs):
        random.shuffle(train_idxes)
        classifier.train()
        avg_loss = loop_dataset(train_graphs, classifier, train_idxes, optimizer=optimizer)
        if not cmd_args.printAUC:
            avg_loss[2] = 0.0
        print('\033[92maverage training of epoch %d: loss %.5f acc %.5f auc %.5f\033[0m' % (epoch, avg_loss[0], avg_loss[1], avg_loss[2]))
        output_string += '[training, %d, %.5f ,%.5f ,%.5f] \n' % (epoch, avg_loss[0], avg_loss[1], avg_loss[2])
        classifier.eval()
        test_loss = loop_dataset(test_graphs, classifier, list(range(len(test_graphs))))
        if not cmd_args.printAUC:
            test_loss[2] = 0.0
        print('\033[93maverage test of epoch %d: loss %.5f acc %.5f auc %.5f\033[0m' % (epoch, test_loss[0], test_loss[1], test_loss[2]))
        output_string += '[test, %d, %.5f ,%.5f ,%.5f] \n' % (epoch, test_loss[0], test_loss[1], test_loss[2])
    with open(cmd_args.data + '_acc_results.txt', 'w') as f:
        f.write(output_string)

    torch.save(classifier.state_dict(), 'saved_model/test1.bin')

    if cmd_args.printAUC:
        with open(cmd_args.data + '_auc_results.txt', 'a+') as f:
            f.write(output_string)



    if cmd_args.extract_features:
        features, labels = classifier.output_features(train_graphs)
        labels = labels.type('torch.FloatTensor')
        np.savetxt('extracted_features_train.txt', torch.cat([labels.unsqueeze(1), features.cpu()], dim=1).detach().numpy(), '%.4f')
        features, labels = classifier.output_features(test_graphs)
        labels = labels.type('torch.FloatTensor')
        np.savetxt('extracted_features_test.txt', torch.cat([labels.unsqueeze(1), features.cpu()], dim=1).detach().numpy(), '%.4f')

util.py

from __future__ import print_function
import numpy as np
import random
from tqdm import tqdm
import os
import networkx as nx
import argparse

cmd_opt = argparse.ArgumentParser(description='Argparser for graph_classification')
cmd_opt.add_argument('-mode', default='cpu', help='cpu/gpu')
cmd_opt.add_argument('-gm', default='DGCNN', help='gnn model to use')
cmd_opt.add_argument('-data', default=None, help='data folder name')
cmd_opt.add_argument('-batch_size', type=int, default=50, help='minibatch size')
cmd_opt.add_argument('-seed', type=int, default=1, help='seed')
cmd_opt.add_argument('-feat_dim', type=int, default=0, help='dimension of discrete node feature (maximum node tag)')
cmd_opt.add_argument('-edge_feat_dim', type=int, default=0, help='dimension of edge features')
cmd_opt.add_argument('-num_class', type=int, default=0, help='#classes')
cmd_opt.add_argument('-fold', type=int, default=1, help='fold (1..10)')
cmd_opt.add_argument('-test_number', type=int, default=0, help='if specified, will overwrite -fold and use the last -test_number graphs as testing data')
cmd_opt.add_argument('-num_epochs', type=int, default=1000, help='number of epochs')
cmd_opt.add_argument('-latent_dim', type=str, default='64', help='dimension(s) of latent layers')
cmd_opt.add_argument('-sortpooling_k', type=float, default=30, help='number of nodes kept after SortPooling')
cmd_opt.add_argument('-conv1d_activation', type=str, default='ReLU', help='which nn activation layer to use')
cmd_opt.add_argument('-out_dim', type=int, default=1024, help='graph embedding output size')
cmd_opt.add_argument('-hidden', type=int, default=100, help='dimension of mlp hidden layer')
cmd_opt.add_argument('-max_lv', type=int, default=4, help='max rounds of message passing')
cmd_opt.add_argument('-learning_rate', type=float, default=0.0001, help='init learning_rate')
cmd_opt.add_argument('-dropout', type=bool, default=False, help='whether add dropout after dense layer')
cmd_opt.add_argument('-printAUC', type=bool, default=False, help='whether to print AUC (for binary classification only)')
cmd_opt.add_argument('-extract_features', type=bool, default=False, help='whether to extract final graph features')
cmd_opt.add_argument('-predict', type=bool, default=False, help='whether to train or load a saved model')

cmd_args, _ = cmd_opt.parse_known_args()

cmd_args.latent_dim = [int(x) for x in cmd_args.latent_dim.split('-')]
if len(cmd_args.latent_dim) == 1:
    cmd_args.latent_dim = cmd_args.latent_dim[0]

class GNNGraph(object):
    def __init__(self, g, label, node_tags=None, node_features=None):
        '''
            g: a networkx graph
            label: an integer graph label
            node_tags: a list of integer node tags
            node_features: a numpy array of continuous node features
        '''
        self.num_nodes = len(node_tags)
        self.node_tags = node_tags
        self.label = label
        self.node_features = node_features  # numpy array (node_num * feature_dim)
        self.degs = list(dict(g.degree).values())

        if len(g.edges()) != 0:
            x, y = zip(*g.edges())
            self.num_edges = len(x)        
            self.edge_pairs = np.ndarray(shape=(self.num_edges, 2), dtype=np.int32)
            self.edge_pairs[:, 0] = x
            self.edge_pairs[:, 1] = y
            self.edge_pairs = self.edge_pairs.flatten()
        else:
            self.num_edges = 0
            self.edge_pairs = np.array([])
        
        # see if there are edge features
        self.edge_features = None
        if nx.get_edge_attributes(g, 'features'):  
            # make sure edges have an attribute 'features' (1 * feature_dim numpy array)
            edge_features = nx.get_edge_attributes(g, 'features')
            assert(type(edge_features.values()[0]) == np.ndarray) 
            # need to rearrange edge_features using the e2n edge order
            edge_features = {(min(x, y), max(x, y)): z for (x, y), z in edge_features.items()}
            keys = sorted(edge_features)
            self.edge_features = []
            for edge in keys:
                self.edge_features.append(edge_features[edge])
                self.edge_features.append(edge_features[edge])  # add reversed edges
            self.edge_features = np.concatenate(self.edge_features, 0)


def load_data():

    print('loading data')
    g_list = []
    label_dict = {}
    feat_dict = {}

    with open('data/%s/%s.txt' % (cmd_args.data, cmd_args.data), 'r') as f:
        n_g = int(f.readline().strip())
        for i in range(n_g):
            row = f.readline().strip().split()
            n, l = [int(w) for w in row]
            if not l in label_dict:
                mapped = len(label_dict)
                label_dict[l] = mapped
            g = nx.Graph()
            node_tags = []
            node_features = []
            n_edges = 0
            for j in range(n):
                g.add_node(j)
                row = f.readline().strip().split()
                tmp = int(row[1]) + 2
                if tmp == len(row):
                    # no node attributes
                    row = [int(w) for w in row]
                    attr = None
                else:
                    row, attr = [int(w) for w in row[:tmp]], np.array([float(w) for w in row[tmp:]])
                if not row[0] in feat_dict:
                    mapped = len(feat_dict)
                    feat_dict[row[0]] = mapped
                node_tags.append(feat_dict[row[0]])

                if attr is not None:
                    node_features.append(attr)

                n_edges += row[1]
                for k in range(2, len(row)):
                    g.add_edge(j, row[k])

            if node_features != []:
                node_features = np.stack(node_features)
                node_feature_flag = True
            else:
                node_features = None
                node_feature_flag = False

            #assert len(g.edges()) * 2 == n_edges  (some graphs in COLLAB have self-loops, ignored here)
            assert len(g) == n
            g_list.append(GNNGraph(g, l, node_tags, node_features))
    for g in g_list:
        g.label = label_dict[g.label]
    cmd_args.num_class = len(label_dict)
    cmd_args.feat_dim = len(feat_dict) # maximum node label (tag)
    cmd_args.edge_feat_dim = 0
    if node_feature_flag == True:
        cmd_args.attr_dim = node_features.shape[1] # dim of node features (attributes)
    else:
        cmd_args.attr_dim = 0

    print('# classes: %d' % cmd_args.num_class)
    print('# maximum node tag: %d' % cmd_args.feat_dim)
    print(label_dict)
    if cmd_args.test_number == 0:
        train_idxes = np.loadtxt('data/%s/10fold_idx/train_idx-%d.txt' % (cmd_args.data, cmd_args.fold), dtype=np.int32).tolist()
        test_idxes = np.loadtxt('data/%s/10fold_idx/test_idx-%d.txt' % (cmd_args.data, cmd_args.fold), dtype=np.int32).tolist()
        return [g_list[i] for i in train_idxes], [g_list[i] for i in test_idxes]
    else:
        return g_list[: n_g - cmd_args.test_number], g_list[n_g - cmd_args.test_number :],label_dict

@xianggai
Copy link

xianggai commented Sep 9, 2022

Hello, I would like to ask whether the model you proposed is competent to predict the classification of the missing graph by training the labeled graph when some graph labels are missing.

@muhanzhang
Copy link
Owner

@xianggai Yes. This is exactly the graph classification setting which DGCNN is proposed to address.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants