Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating a custom dataset #84

Open
YanivDorGalron opened this issue Apr 28, 2024 · 2 comments
Open

Creating a custom dataset #84

YanivDorGalron opened this issue Apr 28, 2024 · 2 comments
Assignees
Labels
question Further information is requested

Comments

@YanivDorGalron
Copy link

Hey.

How should a custom dataset be implemented in order to integrate well with the overall pipeline?

image

Should I create those files? what format or information it must include?

Thanks in advance,
Yaniv

@shenyangHuang shenyangHuang self-assigned this Apr 28, 2024
@shenyangHuang shenyangHuang added the question Further information is requested label Apr 28, 2024
@shenyangHuang
Copy link
Owner

shenyangHuang commented Apr 28, 2024

Hi Yaniv,

Thanks for your interest in our work. Feel free to add your custom dataset in your local fork if you prefer.

Here are a few tips:

  • add a new folder in tgb/datasets with your dataset name say tgbl_cool. Add the edgelist file you have, for example see the format of other *_edgelist.csv file in this case tgbl-review_edgelist_v2.csv. make sure you name it as tgbl-cool_edgelist.csv.
  • provide an entry for tgbl-cool in DATA_URL_DICT, DATA_VERSION_DICT, DATA_EVAL_METRIC_DICT in tgb/utils/info.py
  • In tgb/utils/pre_process.py add a new function to specify how you would process your data from edgelist, along with node feat etc.
  • In tgb/linkproppred/dataset.py, you need to add logic for dataset processing in generate_processed_files() function see line 216.
  • (optional) for link property prediction datasets, generate negative samples for tgbl_cool, copy an example negative sample generation script such as tgbl-wiki_neg_generator.py and modify it as needed. put the script into tgb/datasets/tgbl_cool dataset and run it to generate the negative samples.
  • (optional) test your script by adding a TGN example to a new folder (tgbl-cool) under examples/linkproppred. reuse TGN examples from other folders.

with the above steps, you should be able to add any custom local dataset to your locally installed TGB version.
Hope that answers your questions.

Best,
Andy

@YanivDorGalron
Copy link
Author

YanivDorGalron commented May 1, 2024

Hi Andy,

Thanks for the detailed response!

Just to make sure - when dealing with node classification dataset the changes are almost the same but with addition of:

In addition, can you please explain more on the format of the csv files? more specifically:

  1. If we are dealing with non-bipartite graph which is also not directed - should I reverse the columns in order to fully capture all edges and label all nodes? (It seems to me that only the 'user' list is labeled and the 'items' list does not.)
  2. Is there an example for node classification csv file without node features?
  3. It also seems that every dataset csv file is in different format, some csv's have different number of nodes at the node.csv compared to the edge.csv, for example - tgbn-genre_node_labels.csv and tgbn-genre_edgelist.csv where the first contains 974 unique users and the second contains 992 unique user.

Thank in advance!
Yaniv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants