Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not use public split on citation networks #1

Closed
hengruizhang98 opened this issue Mar 30, 2021 · 5 comments
Closed

Not use public split on citation networks #1

hengruizhang98 opened this issue Mar 30, 2021 · 5 comments

Comments

@hengruizhang98
Copy link

Hi, thanks for your nice work. I find that in the original paper you state that you use the public split on the citation networks. However, in this repo it seems that you use random split. Can you explain it?

@zekarias-tilahun
Copy link
Owner

zekarias-tilahun commented Mar 30, 2021

Hi, thank you for your interest.

In line 28 of data.py you can see that we invoke utils.create_masks(data=dataset.data) to create train/val/test masks/splits. If you navigate to create_masks function inside the utils.py module, online 200 you can find that we first check whether the data contains a validation mask (if not hasattr(data, "val_mask")). Since the citation networks data have val_mask attribute, we will not create a new one.

@hengruizhang98
Copy link
Author

hengruizhang98 commented Mar 30, 2021

Thanks for your response. According to my knowledge, in self-supervised setting (use 'cora' dataset as an example), in
pretraining step all the nodes(2708) will be used. In linear evaluation step. Only the training nodes (140) will be used to train the linear classifier, and the testing ndoes(1000) will be used only for evaluation. However, it seems that you split the testing nodes into train/test sets with 0.6/0.4 ratios(600 for train and 400 for test).

@zekarias-tilahun
Copy link
Owner

Oh! I miss understood you question. In that case you're right. We use a random (60/40) split of the test set for the LogisticRegression classifier.

@hengruizhang98
Copy link
Author

Yes. So I guess you have to update your codes and manuscripts, to compare fairly with other models.

@zekarias-tilahun
Copy link
Owner

BTW, a study discusses how using different splits will result in significantly different outcomes. Thus, we mention the split to indicate which particular splits of the publicly available ones we used for the three citation datasets. However, I agree that we need to state this clearly in the manuscript and I'll update it! Thank you for bringing this into light.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants