Skip to content

Files

Latest commit

author
Wyatt Hamabe
Mar 9, 2024
2704a85 · Mar 9, 2024

History

History
381 lines (307 loc) · 20.9 KB

File metadata and controls

381 lines (307 loc) · 20.9 KB

Findings Regarding Homophily

Ideas:

Week 6

What happens if all edges in a class are connected to a single node?
  • On the CORA dataset, there was a change 0.0020 from the ground truth.
  • How was this implemented?
    for i in range(0, num_classes):
    for j in range(1, len(homophilic_set[i])):
        if not modified_graph.has_edges_between([homophilic_set[i][0]], j):
            modified_graph.add_edges(j, homophilic_set[i][0])
            modified_graph.add_edges(homophilic_set[i][0], j)
  • The first node in each class was selected and all other nodes were connected.
  • How could we modify this?
    • Are there specific nodes which would result in no change (those with higher degree? those with lower degree?)?
    • What happens if all nodes are connected to each other? This is relatively computationally expensive, but can be done on a smaller graph like CORA.
What happens if only two edges in a single class are connected?
  • On the CORA dataset, there was a change anywhere from 0.0 to 0.0030 respective to the ground truth.

  • How was this implemented?

    for i in range(0, num_classes):
        r_node = random.choice(homophilic_set[i])
        n_node = random.choice(homophilic_set[i])
    
        while r_node == n_node or modified_graph.has_edges_between(n_node, r_node):
            n_node = random.choice(homophilic_set[i])
    
        modified_graph.add_edges(r_node, n_node)
        modified_graph.add_edges(n_node, r_node)
        print("added: ", r_node, " - ", n_node)
  • Two random nodes (in the same class) were connected if they weren't previously.

  • How could we modify this?

    • Is there a number of random edges we can add? At what threshold does the change begin to grow past 0.0030 ?
What happens if all edges in a single class are connected?
  • As expected, this took a little while to calculate. There are roughly 1000 nodes in the test mask of the CORA dataset. Therefore, in the worst case there was 1 , 000 , 000 edge additions.

  • How was this implemented?

    for i in range(0, num_classes):
        for j in range(0, len(homophilic_set[i])):
            for k in range(j + 1, len(homophilic_set[i])):
                j_node = homophilic_set[i][j]
                k_node = homophilic_set[i][k]
                if j_node == k_node or modified_graph.has_edges_between(j_node, k_node):
                    continue
    
                modified_graph.add_edges(j_node, k_node)
                modified_graph.add_edges(k_node, j_node)
                print("added: ", j_node, " - ", k_node)
  • It may come to no surprise, but there was a large change in results. The results changed by 0.2310 . Evidently, this would not be a valid minimum set. It wasn't expected to be, rather, it was just a test for the runtime of these processes, but was interesting to look at.

Interesting Results! What happens when edges are added relative to the number of homophilic edges a node already has?

I developed this according to the formula | a d d e d ( n ) | = l f l o o r c t i m e s h ( n ) r f l o o r where c is some constant and h ( n ) represents the number of homophilic edges. a d d e d ( n ) is the set of edges added from a specific node. Basically, this limits the number of edges which can be added according to the pre-existing degree of homophilic edges. For example, given c = 0.1 , if a node u has 23 homophilic edges then 2 randomly-selected homophilic edges v 1 , v 2 will be added from this node, ( u , v 1 ) and ( u , v 2 ) .

  • Initially, I tried using ceiling (this fails for obvious reasons).

  • The implementation is as follows:

    for i in range(0, num_classes):
        for j in range(0, len(homophilic_set[i])):
            edges = modified_graph.out_edges(homophilic_set[i][j])
    
            same_class_edges = set()
            for e in edges[1]:
                if (e.item() in homophilic_set[i]):
                    same_class_edges.add(e.item())
    
            for k in range(0, math.floor(len(same_class_edges) * c)):
                a_node = homophilic_set[i][j]
                b_node = random.choice(homophilic_set[i])
    
                while a_node == b_node or modified_graph.has_edges_between(a_node, b_node):
                    b_node = random.choice(homophilic_set[i])
    
                modified_graph.add_edges(a_node, b_node)
                modified_graph.add_edges(b_node, a_node)
                class_and_added[i].append(len(same_class_edges))
  • Here are the follow results:

    • c = 0.10 , 4 added edges, no change in accuracy
    • c = 0.15 , 11 added edges, accuracy change = 0.0010
    • c = 0.20 , 28 added edges, accuracy change = 0.0040
    • c = 0.25 , 47 added edges, accuracy change = 0.0030
    • c = 0.30 , 55 added edges, accuracy change = 0.0070
    • c = 1 / 3 , 128 added edges, accuracy change = 0.0100
    • c = 0.35 , 138 added edges, accuracy change = 0.0250
    • c = 0.50 , 419 added edges, accuracy change = 0.0460
  • Assessment:

    • I feel it was expected that the more edges you added, the more your accuracy is going to change (likely decrease).
    • However, it's notable that you can increase by homophilic degree of a node 1.25 times without seeing a large change.
      • It seems possible there's an intelligent choice of edges (rather than random!), which would increase the capabilities of this threshold and maintain accuracy.
  • Interesting questions this poses:

    • Is this a trend that extends to degree-only heuristics? Obviously, this is a combination of degree analysis and homophily, but it trends towards utilizing the definition of homophily. However, if it's a trend regardless, then, it leaves homophily non-important.
    • Can we selectively pair nodes with higher (or lower) homophilic degrees? What threshold would emerge to cause no change?
    • This seems to show a general trend of edges added increasing change in accuracy. However, it's possible that this isn't always true.
Very interesting Results! [MISTAKENLY DROPPED HOMOPHILIC ASPECT]
  • I did an entire experiment before realizing that I'd dropped the homophilic aspect of the experiment. I will pass these results to Niyati for her findings.
  • Will add basic results to research log.

Possible Ideas:

  • Is there some pattern which would not change the value? Some geometric pattern?
  • Can we increase the number of similar nodes by some threshold?
    • Yes! See "Interesting Results"
  • Can we selectively pair nodes with higher (or lower) homophilic degrees? What threshold would emerge to cause no change?

Overarching questions:

  • Is there an intelligent way we could pick edges such that the there would be no change?
  • To some degree, the change seems to relate most to the degree of nodes (no pun intended).

Week 8

Here are my findings so far in Week 8.

  • For context, groundtruth in this model is 0.749 .
    • I had a few new driving questions/ideas that I wanted to explore:
    • What happens if we turn a connected component, limited to a certain class, into a clique?
      • Clarification: Using the original graph, what happens if we turn a connected component, where all of the nodes are from the same class, into a clique?
      • Due to the small size of the CORA dataset, there are few connected components with size > 1 that reside in a single class. 2 to be exact. While the accuracy was not changed, only 3 edges were added, leaving this experiment relatively uninsightful.
        data = dataset.get_data()[0]
        modified_graph = data
        
        init_edges = len(modified_graph.edge_index[1])
        
        G, x, y, train_mask, test_mask = convert_to_networkx(modified_graph)
        
        cc = sorted(nx.strongly_connected_components(G), key=len, reverse=True)
        
        for i in cc:
            s = set(i)
            for j in range(0, num_classes):
                if s == s.intersection(homophilic_set[j]):
                    make_clique(G, s)
        
        modified_graph = convert_to_pyg(G, x, y, train_mask, test_mask)
        final_edges = len(modified_graph.edge_index[1])
        
        output_accuracy_change(ground_truth, test_model(model, modified_graph)) 
        number_added_edges(init_edges, final_edges, is_undirected=True)
        
    • What happens if we turn a connected component, limited to a certain class, with a certain density into a clique?
      • Clarification: Using the original graph, what happens if we turn a connected component with a certain density, where all of the nodes are from the same class, into a clique?
      • Similar to the above example, the CORA dataset is too limited. No nodes were changed.
        data = dataset.get_data()[0]
        modified_graph = data
        
        init_edges = len(modified_graph.edge_index[1])
        
        G, x, y, train_mask, test_mask = convert_to_networkx(modified_graph)
        
        cc = sorted(nx.strongly_connected_components(G), key=len, reverse=True)
        
        for i in cc:
            s = set(i)
            for j in range(0, num_classes):
                if (s == s.intersection(homophilic_set[j])) and (nx.density(G.subgraph(list(s))) > 0.2):
                    make_clique(G, s)
        
        modified_graph = convert_to_pyg(G, x, y, train_mask, test_mask)
        final_edges = len(modified_graph.edge_index[1])
        
        output_accuracy_change(ground_truth, test_model(model, modified_graph)) 
        number_added_edges(init_edges, final_edges, is_undirected=True)
        
    • What happens if we turn a connected component, when considering ONLY edges in a certain class, into a clique?
      • Clarification: After building a subgraph for each class, what happens if we turn a connected component, in that subgraph, into a clique?
      • This is where things start to get interesting.
      • First, I made a subgraph specific to the class, and then ran a connected components algorithm on it.
      • Upon cliquing an entire connected component, 4211 edges are added, an increase of 79.78 .
      • The accuracy rose by 0.0230.
      • While not stagnant, this is an incredibly minimal change for increasing the edge set by 4211 edges.
        data = dataset.get_data()[0]
        modified_graph = data
        
        init_edges = len(modified_graph.edge_index[1])
        
        G, x, y, train_mask, test_mask = convert_to_networkx(modified_graph)
        
        for j in range(0, len(homophilic_set)):
            new_G = G.subgraph(homophilic_set[j])
            cc = sorted(nx.strongly_connected_components(new_G), key=len, reverse=True)
            for i in cc:
                s = set(i)
                if len(s) > 1:
                    make_clique(G, s)
        
        modified_graph = convert_to_pyg(G, x, y, train_mask, test_mask)
        final_edges = len(modified_graph.edge_index[1])
        
        output_accuracy_change(ground_truth, test_model(model, modified_graph)) 
        number_added_edges(init_edges, final_edges, is_undirected=True)
        
    • What happens if we turn a connected component with a certain density c, when considering ONLY edges in a certain class, into a clique?
      • Clarification: After building a subgraph for each class, what happens if we turn a connected component with a certain density, in that subgraph, into a clique?

      • Again, I made subgraphs respective to the class.

      • For density threshold c , I tested all multiples of 0.05 from 0.05 to 0.75 . The results are as follows:

        Value of C Change in Edges Change in Accuracy
        0.05 2076 (39.33%) +0.0090
        0.1 816 (15.46%) +0.0080
        0.15 414 (7.84%) +0.0050
        0.2 338 (6.4%) +0.0040
        0.25 161 (3.05%) +0.0020
        0.3 140 (3.05%) +0.0020
        0.35 120 (2.65%) +0.0020
        0.4 120 (2.27%) +0.0020
        0.45 99 (1.88%) +0.0010
        0.5 83 (1.57%) +0.0010
        0.55 20 (0.38%) +0.0010
        0.6 20 (0.38%) +0.0010
        0.65 20 (0.38%) +0.0010
        0.7 0 0
        0.75 0 0
      • We can see you are able to add an extremely large number of edges without much change in classification. However, there are no instances where exactly 0.00 accuracy is changed.

        c_values = [0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.70, 0.75, 0.80, 0.85]
        
        for c in c_values:
            data = dataset.get_data()[0]
            modified_graph = data
            
            init_edges = len(modified_graph.edge_index[1])
            
            G, x, y, train_mask, test_mask = convert_to_networkx(modified_graph)
        
            for j in range(0, len(homophilic_set)):
                new_G = G.subgraph(homophilic_set[j])
                cc = sorted(nx.strongly_connected_components(new_G), key=len, reverse=True)
                for i in cc:
                    s = set(i)
                    if len(s) > 1 and nx.density(G.subgraph(i)) >= c:
                        make_clique(G, s)
            
            modified_graph = convert_to_pyg(G, x, y, train_mask, test_mask)
            final_edges = len(modified_graph.edge_index[1])
            
            output_accuracy_change(ground_truth, test_model(model, modified_graph)) 
            number_added_edges(init_edges, final_edges, is_undirected=True)
            print("For c value:", c)
        
    • What happens if we increase the density of a connected component, when considering ONLY edges in a certain class, by a certain threshold?
      • Clarification: After building a subgraph for each class, what happens if we increase the density of a connected component, in that subgraph, by some constant c ?
      • For the density increase constant, c , I tested: 1.05 , 1.1 , 1.15 , 1.2 , 1.25 , 1.3 , 1.35 , 1.4 , 1.45 , 1.5 , 1.55 , $1.6, 1.65, 1.70, 1.75, 1.80, 1.85, 1.9, 1.95, 2, 3, 4, 5, 6, 7, 8, 9, 10, 100$
      • The results are as follows:
        Value of C Change in Edges Change in Accuracy
        1.05 62 (1.17%) +0.0010
        1.1 78 (1.48%) +0.0020
        1.15 99 (1.88%) +0.0050
        1.2 114 (2.16%) +0.0040
        1.25 135 (2.56%) +0.0030
        1.3 159 (3.01%) +0.0040
        1.35 186 (3.52%) +0.0100
        1.4 202 (3.83%) +0.0080
        1.45 227 (4.30%) +0.0110
        1.5 243 (4.60%) +0.0100
        1.55 264 (5.00%) +0.0100
        1.6 281 (5.32%) +0.0070
        1.65 305 (5.78%) +0.0140
        1.7 330 (6.25%) +0.0110
        1.75 350 (6.63%) +0.0120
        1.8 367 (6.95%) +0.0160
        1.85 390 (7.39%) +0.0130
        1.9 408 (7.73%) +0.0160
        1.95 424 (8.03%) +0.0150
        2 433 (8.20%) +0.0140
        3 767 (14.53%) +0.0200
        4 1076 (20.39%) +0.0200
        5 1363 (25.82%) +0.0220
        6 1620 (30.69%) +0.0240
        7 1862 (35.28%) +0.0240
        8 2086 (39.52%) +0.0230
        9 2281 (43.22%) +0.0230
        10 2464 (46.68%) +0.0230
        100 4211 (79.78%) +0.0230
      • Again, we are able to add a large amount of nodes with little change, but there are no changes which results in 0 change.
      • However, the most interesting result seems to be that there is a convergence, as to the amount the model changes. I have zero explanation as to why, and it's something I'm going to be thinking about/looking into over the next few weeks.
      c_values = [1.05, 1.1, 1.15, 1.2, 1.25, 1.3, 1.35, 1.4, 1.45, 1.5, 1.55, 1.6, 1.65, 1.70, 1.75, 1.80, 1.85, 1.9, 1.95, 2, 3, 4, 5, 6, 7, 8, 9, 10, 100]
      # strange...it starts to plateau after 5-6
      # definitely something, I'll have to think on what...
      
      
      def increase_density(G, s, threshold):
          while nx.density(G.subgraph(s)) < min(threshold, 1):
              random_pair = random.sample(s, 2)
              add_edge(G, random_pair[0], random_pair[1], undirected=True)
      
      for c in c_values:
          data = dataset.get_data()[0]
          modified_graph = data
          
          init_edges = len(modified_graph.edge_index[1])
          
          G, x, y, train_mask, test_mask = convert_to_networkx(modified_graph)
      
          for j in range(0, len(homophilic_set)):
              new_G = G.subgraph(homophilic_set[j])
              cc = sorted(nx.strongly_connected_components(new_G), key=len, reverse=True)
              for i in cc:
                  s = set(i)
                  if len(s) > 1:
                      threshold = nx.density(G.subgraph(s))
                      increase_density(G, s, c*threshold)
          
          modified_graph = convert_to_pyg(G, x, y, train_mask, test_mask)
          final_edges = len(modified_graph.edge_index[1])
          
          output_accuracy_change(ground_truth, test_model(model, modified_graph)) 
          number_added_edges(init_edges, final_edges, is_undirected=True)
          print("For c value:", c)
      
    • What happens if we clique the graph irrespective to classes? Meaning, all connected components become cliques?
      • The model fails. Expectedly.
      • Over 3 million edges were added and accuracy decreased by 0.5520 .
      • This is expected. Moreso, I was interested to see what would happen.
        data = dataset.get_data()[0]
        modified_graph = data
        
        init_edges = len(modified_graph.edge_index[1])
        
        G, x, y, train_mask, test_mask = convert_to_networkx(modified_graph)
        
        cc = sorted(nx.strongly_connected_components(G), key=len, reverse=True)
        
        for i in cc:
            s = set(i)
            make_clique(G, s)
        
        modified_graph = convert_to_pyg(G, x, y, train_mask, test_mask)
        final_edges = len(modified_graph.edge_index[1])
        
        output_accuracy_change(ground_truth, test_model(model, modified_graph)) 
        number_added_edges(init_edges, final_edges, is_undirected=True)
        

Other Datasets:

PubMed Homophily CORA Homophily CiteSeer Homophily
0.802 0.810 0.734

Was there any data that didn't corroborate the previous experiments with the CORA datasets?

PubMed:

  • The problem with PubMed was nodes seemed to generally have less edges. Therefore, while the results were the same, it took higher thresholds to achieve them.
    • In other words: the results were similar when compared to the # of edges added instead of any constant c .
  • Again, with most connected components, they were of size 1 or 2 , which led to little change. However, the results were the same once PubMed added more edges (with higher constants).

CiteSeer:

  • Similar to PubMed, the results seemed to be similar in regards to # of edges added, rather than any constants.
  • CiteSeer did exhibit slight more extreme results, but this could be because the model didn't train as deeply. The results weren't alarmingly worse (less than 0.01 for each). I'd like to train the model to a higher accuracy (currently, it's at 54%) and see if that effects anything. But, I have trained the current model we have been using to convergence.