Better Train, test, val split for graphs #158

Dsantra92 · 2022-07-15T09:16:21Z

Problems

Splitting graphs is a bit more complicated than normal data. Graphs can be split based on node, edge or whole graph itself. We should be able to support all of that with minimal alterations. Currently, graph splits in MLDatases.jl have no specification.
The API for accessing train, test and val split is inconsistent.

Proposed solution

I propose the following API:

Splits should be present in the metadata. We can access splits using splits = dataset.metadata.split
The split is a named tuple for train, test and val. Is empty if no predefined split is present.
Each split is a named tuple/struct which contains type of split $\in$ [node, edge, graph] and the value of the split.
There can be two or more splits depending on the feature you are splitting on. There are such examples in OGBDatasets. So it makes sense to make the value of the split a dict or named tuple.

This should ideally handle most of the cases. We only support non-dynamic graphs for now, I have not worked with dynamic graphs. Most likely we can support splits with little to no changes.

This recursive structure of splits can be confusing for the end-user. We can support APIs in MLUtils.jl like:

train_data, test_data, val_data = split(dataset)
# If no split is present in the dataset, all data will training data or error out.

train_data = split(dataset, :train)

# If we want to choose edge split
train_data = split(dataset, :train, :edge)

# Get training split for a feature
train_data = split(dataset, :train, :feature_name)

Keeping this a discussion for now.

cc: @CarloLucibello

The text was updated successfully, but these errors were encountered:

CarloLucibello · 2022-07-17T07:02:00Z

Right now for graph level tasks the dataset.metadata contains the train/test mask.

For node and edge tasks we store the masks in the single graph metadata. Notice that for node/edge level tasks we are not really splitting a graph in two, we just have masks that we apply on the model outputs.

The same dataset can define multiple splits, we just have to define entries in the datasets or graph metadata, e.g. "edge_train_mask", "node_train_mask".

So it seems to me we basically have already what you ask for, maybe we just have to write down a naming convention scheme for the masks, and also specify in the docs the mask types (I think we have boolean masks for most graph dataset, the alternative being a list of indices).

To make graph datasets consistent with non-graph datasets, for graph-level tasks (where the datasets contains multiple graphs) we should allow the selection of the split at construction time, as we do with MNIST(split=:train).

Dsantra92 · 2022-07-18T12:19:35Z

Yes, we already have a lot of things done. We just need to document it and solidify it.

On, another note is MNIST(split=:train) the best way to go? As you mentioned, train, test and val split are just masks. So if we just do something like MovieLens("100K", split=:train), we will compute the whole dataset and then apply the training mask. So for loading 3 different splits we will basically do 3 sets of the same computation. Instead split(dataset, mask) makes more sense here.

MNIST(split=:train) will make sense when we are storing pre-processed splits separately in parquet files and loading them individually.

CarloLucibello · 2022-07-18T13:09:19Z

For MNIST the splits are actually separate files, so no shared computation, but for other datasets this is not the case and that's not ideal. Storing processed datasets would be the path forward. The thing is we already went through some breaking changes recently so I really want to avoid others in the near future, so let's see in which case the split=:train API is really a huge a inconvenience and eventually mitigate that.

Dsantra92 added the gsoc label Jul 25, 2022

Dsantra92 mentioned this issue Sep 1, 2022

Train test and validation for graph datasets. #168

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better Train, test, val split for graphs #158

Better Train, test, val split for graphs #158

Dsantra92 commented Jul 15, 2022

CarloLucibello commented Jul 17, 2022

Dsantra92 commented Jul 18, 2022

CarloLucibello commented Jul 18, 2022

Better Train, test, val split for graphs #158

Better Train, test, val split for graphs #158

Comments

Dsantra92 commented Jul 15, 2022

Problems

Proposed solution

CarloLucibello commented Jul 17, 2022

Dsantra92 commented Jul 18, 2022

CarloLucibello commented Jul 18, 2022