error in results due to negative sampling data leakage #28

hshahbazi · 2020-07-26T16:58:36Z

Hi,
I've noticed an issue in this work as follows: In module create_batch.py

the negative samples are not allowed to be taken from test data. When valid_triples_dict is defined as:
self.valid_triples_dict = {j: i for i, j in enumerate(
self.train_triples + self.validation_triples + self.test_triples)}

and later used in:

while (random_entities[current_index], self.batch_indices[last_index + current_index, 1],
self.batch_indices[last_index + current_index, 2]) in self.valid_triples_dict.keys():

The model is taught during training that the test data are valid triples. We are not allowed to use test data during training. This will make the performance goes higher and higher with more epochs without any good reasons.

378egg · 2020-09-10T03:18:59Z

I noticed also, so the paper is an error?

hshahbazi · 2020-09-10T13:12:44Z

Yes.

378egg · 2020-09-13T06:27:19Z

Sorry, have you try more epochs? In experiments.

hshahbazi · 2020-09-13T16:51:20Z

I experimented more epochs with other models such as tucker to see that the learning approach is not right.

wbaz4321 · 2020-09-24T03:56:00Z

I experimented more epochs with other models such as tucker to see that the learning approach is not right.

sorry, I notice this too and I found another paper 'A Re-evaluation of Knowledge Graph Completion Methods' also mentions this issue. But I suppose the model structure, i.e. the attention mechanism and the encoder-decoder structure is still valid, the problem is in the training and evaluation protocol right? Because I found the idea in this paper interesting and plan to modify and apply it for my work.

hshahbazi · 2020-09-24T05:36:42Z

Not sure what you mean by valid. Experimentally, the model gives top10=33% for fb15k-237 which is very low for the task. So experimentally, we can not reach to any validity conclusion for any parts.

wbaz4321 · 2020-09-24T06:42:28Z

Not sure what you mean by valid. Experimentally, the model gives top10=33% for fb15k-237 which is very low for the task. So experimentally, we can not reach to any validity conclusion for any parts.

Many thanks for you reply, could you please tell me how you run the experiment? in this code:

while (random_entities[current_index], self.batch_indices[last_index + current_index, 1],
self.batch_indices[last_index + current_index, 2]) in self.valid_triples_dict.keys():

do you change the 'valid_triples_dict' so that it only includes training data?

chauhanjatin10 · 2020-09-25T04:16:31Z

Hi @wbaz4321 and @hshahbazi ,
Thanks for showing interest in the work. As you guys have pointed out the issues regarding leakage in the negative sampling, we have been doing experiments to rectify this. After extensive experiments we have empirically concluded that the issue was the use of ConvKB as decoder. When we replace ConvKB with ConvE, the Hits@10 on Freebase reach upto 53% with correct training protocol as proposed in "A Re-evaluation of Knowledge Graph Completion Methods" using the ConvE code provided here - https://github.com/svjan5/kg-reeval . Thus, we have evidence supporting the claims of our encoder model which will be updated soon by my co-author @deepakn97 . Also, as shown in the works - https://arxiv.org/pdf/2009.08694.pdf and https://www.aclweb.org/anthology/2020.acl-main.526.pdf , KBGAT's proposed encoder model indeed works well provided that the decoder is strong enough. I hope this clarifies some of your doubts @hshahbazi and @wbaz4321 .
Thanks and regards
Jatin

jinjinjin9 · 2020-09-25T10:58:45Z

So, just to be sure, as we can conclude by now (can we?):

with test leakage the full system provides very good scores
without test leakage the full system provides much worse scores (by removing test triples from validation triples)

I don't see how this should be "rectified" by not using ConvKB as the decoder? Is there a bug in ConvKB then? Can you describe your experiments that "rectify the results" more closely? I mean, the two points above just suggest that the bug was in fact in the training and caused by the data leakage.

I am sorry for the inconvenience but I am working with your code and as of now I am very confused.

deepakn97 · 2020-09-25T11:31:07Z

As you can see in this paper (https://www.aclweb.org/anthology/2020.acl-main.489.pdf), we are dealing with two different issues here.

Test data leakage
Wrong evaluation strategy for the decoder part.

Above mentioned paper removes the test leakage and uses Random evaluation strategy to get the updated results for our paper. However, they don't experiment with the hyperparameters. After re-evaluating our model and using updated hyperparameters we were able to get ~44 Hits @ 10 using ConvKB (These results are after removing test data leakage and using correct evaluation protocol).
ConvKB provides significantly less results using the correct evaluation strategy i.e only 42.1 Hits @ 10 compared to reported 51.7 Hits @ 10.

The paper also mentions that ConvE does not suffer from this problem and the results of ConvE is independent of the evaluation protocol. More specifically, many of the neurons in ConvKB become zeros after applying ReLU activation function as a result of which ConvKB predicts same scores for numerous triples which causes the stark difference in the results using different evaluation protocol. Following this reasoning we decided to experiment with ConvE instead of using ConvKB.

I hope this is helpful to you.

jinjinjin9 · 2020-09-25T12:19:54Z

Thanks for the quick response!

But I wonder.... if it's now 53 with the random evaluation strategy, after you freshly tuned hyper-parameters... it would still fall behind TuckeR (53.6) and would be same as RotatE (53.0), in contrast to what's reported in the original result table displayed in https://www.aclweb.org/anthology/2020.acl-main.489.pdf.

Can you also report new results on the original task formulation (no randomized evaluation strategy), where the paper reported 62.2 hits?

deepakn97 · 2020-09-25T12:47:04Z

The reported results (with ConvE as decoder) are with randomized evaluation strategy and they are independent of the evaluation strategy. The task formulation is still the same but the evaluation strategy used was wrong. So we don't think you should compare the results with wrong evaluation strategy.
We also understand your concern that in the current scenario this is not a big leap in the results. However, we still feel that the contribution of the paper was the Attention model used to learn graph embeddings. As the research progresses, we can use different decoders in continuation with KBGAT.
We will try to provide a way to use different decoders (TuckeR, RotatE etc) with our attention model. As of now, we would appreciate if you could save the embeddings generated from the Attention model and use them to train the Decoder models.

wbaz4321 · 2020-09-26T07:58:36Z

Thanks for your efforts to adjust the model, I also personally think the encoding process is useful and I am currently trying to modify it for my work. Thanks very much for providing a good reference for me. I have another question about the classes SepcialSpmmFinal() and SpecialSpmmFinalFunctionalFinal() could you please tell me what do these function do? In the paper, once we have obtained the exp values of the absolute attention values, we can sum them up the get the value of e_row_sum
so why do we fed the values into the special function? Any suggestions are very appreciated

Also, I found that some one else mentions this issue as well, so if you can help us understand this, we will be very grateful.

chauhanjatin10 pinned this issue Sep 25, 2020

chauhanjatin10 unpinned this issue Sep 25, 2020

chauhanjatin10 mentioned this issue Sep 25, 2020

Test data leakage issue #17

Closed

chauhanjatin10 changed the title ~~test data is used during training --> results are not valid~~ error in results due to negative sampling data leakage Mar 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error in results due to negative sampling data leakage #28

error in results due to negative sampling data leakage #28

hshahbazi commented Jul 26, 2020 •

edited

Loading

378egg commented Sep 10, 2020

hshahbazi commented Sep 10, 2020

378egg commented Sep 13, 2020

hshahbazi commented Sep 13, 2020

wbaz4321 commented Sep 24, 2020

hshahbazi commented Sep 24, 2020

wbaz4321 commented Sep 24, 2020

chauhanjatin10 commented Sep 25, 2020 •

edited

Loading

jinjinjin9 commented Sep 25, 2020 •

edited

Loading

deepakn97 commented Sep 25, 2020

jinjinjin9 commented Sep 25, 2020 •

edited

Loading

deepakn97 commented Sep 25, 2020

wbaz4321 commented Sep 26, 2020

error in results due to negative sampling data leakage #28

error in results due to negative sampling data leakage #28

Comments

hshahbazi commented Jul 26, 2020 • edited Loading

378egg commented Sep 10, 2020

hshahbazi commented Sep 10, 2020

378egg commented Sep 13, 2020

hshahbazi commented Sep 13, 2020

wbaz4321 commented Sep 24, 2020

hshahbazi commented Sep 24, 2020

wbaz4321 commented Sep 24, 2020

chauhanjatin10 commented Sep 25, 2020 • edited Loading

jinjinjin9 commented Sep 25, 2020 • edited Loading

deepakn97 commented Sep 25, 2020

jinjinjin9 commented Sep 25, 2020 • edited Loading

deepakn97 commented Sep 25, 2020

wbaz4321 commented Sep 26, 2020

hshahbazi commented Jul 26, 2020 •

edited

Loading

chauhanjatin10 commented Sep 25, 2020 •

edited

Loading

jinjinjin9 commented Sep 25, 2020 •

edited

Loading

jinjinjin9 commented Sep 25, 2020 •

edited

Loading