batch_size and train_test_split on shuffled model #179

Owenxz · 2024-10-10T23:45:39Z

Owenxz
Oct 10, 2024

Hi Cebra team. I'm using the Cebra-behavior model to evaluate the relationship between our Neuropixels recording data and behavior stages. To demonstrate that the separation between the behavioral labeling in the real model does not solely depend on the label identity, I visualized the embeddings from the shuffled model with the shuffled labels and expected no apparent pattern from this visualization. However, I noticed two things that can generate a clear separation between the behavioral labels, even in the shuffled models.

First is the batch_size. Here, I used the demo CA1 data, trained a shuffled model with batch_size = 512 or = None, and visualized the model with the shuffled label. Surprisingly, the one with batch_size = None clearly showed the two behavioral labels separated in the embeddings. Here are my code and results:

hippocampus_pos = cebra.datasets.init('rat-hippocampus-single-achilles')
neural_train, neural_test, label_train, label_test = train_test_split(hippocampus_pos.neural, hippocampus_pos.continuous_index.numpy(), test_size=0.2,  random_state=2042, shuffle= False)
max_iterations = 10000
cebra_dir3_model_shuffled = CEBRA(model_architecture='offset10-model',
                        batch_size=None,
                        learning_rate=3e-4,
                        temperature=1,
                        output_dimension=3,
                        max_iterations=max_iterations,
                        distance='cosine',
                        conditional='time_delta',
                        device='cuda_if_available',
                        verbose=True,
                        time_offsets=10)

np.random.seed(999)
shuffled_label = np.random.permutation(label_train[:, 1:])
cebra_dir3_model_shuffled.fit(neural_train, shuffled_label)


cebra_dir3_shuffled = cebra_dir3_model_shuffled.transform(neural_train)


fig = cebra.plot_embedding_interactive(cebra_dir3_shuffled, embedding_labels=shuffled_label[:, 1], title = "CEBRA-Behavior", cmap = "rainbow")
fig.show()

However, the one trained with batch_size = 512 is similar to what the demo shows.

To ensure unbiased decoding, it is generally a good practice to randomly split the data, which is the default setting for sklearn.train_test_split. I noticed that, however, if we randomly shuffled the data during train_test_split, the embedding in the real model will collapse. Sorting the data after split based on their index would restore the embedding. Does this mean the temporal relationship between the behavioral labels still matters?

neural_train, neural_test, label_train, label_test = train_test_split(hippocampus_pos.neural, hippocampus_pos.continuous_index.numpy(), test_size=0.2,  random_state=2042, shuffle= True)

max_iterations = 10000
output_dimension = 32 
cebra_dir3_model = CEBRA(model_architecture='offset10-model',
                        batch_size=512,
                        learning_rate=3e-4,
                        temperature=1,
                        output_dimension=3,
                        max_iterations=max_iterations,
                        distance='cosine',
                        conditional='time_delta',
                        device='cuda_if_available',
                        verbose=True,
                        time_offsets=10)

cebra_dir3_model.fit(neural_train, label_train[:, 1:])

cebra_dir3 = cebra_dir3_model.transform(neural_train)


fig = cebra.plot_embedding_interactive(cebra_dir3, embedding_labels=label_train[:, 1], title = "CEBRA-Behavior", cmap = "rainbow")
fig.show()

I really appreciate your patience and time in answering these questions. Thanks!

Answered by stes

Oct 11, 2024

Hi @Owenxz , thanks for flagging. I focused on checking example 1, and can repro the behavior you report.

However, you are right now training the model on binary variables (0 or 1), but using a continuous solver. Just checking, is this intended?. This breaks a few assumptions on how the data is sampled; namely the continuous solver with time_delta distribution expects the label to be approximately changing according to a Normal distribution (which is not the case in this example).

(Other clarification, is the data splitting function used this one? This is additionally problematic when using an offset-10 model, a split that preserves temporal information is more suitable).

It would be inte…

View full answer

stes · 2024-10-11T00:45:18Z

stes
Oct 11, 2024
Maintainer

Hi @Owenxz , thanks for flagging. I focused on checking example 1, and can repro the behavior you report.

However, you are right now training the model on binary variables (0 or 1), but using a continuous solver. Just checking, is this intended?. This breaks a few assumptions on how the data is sampled; namely the continuous solver with time_delta distribution expects the label to be approximately changing according to a Normal distribution (which is not the case in this example).

(Other clarification, is the data splitting function used this one? This is additionally problematic when using an offset-10 model, a split that preserves temporal information is more suitable).

It would be interesting to see if you can repro the behavior also if you use cebra_dir3_model_shuffled.fit(neural_train, shuffled_label[:,1].astype(int)) which selects the discrete solver. However, then you need to specify a batch_size, None is not yet supported.

I ran a quick test on the following setup,

import cebra
from cebra.datasets import init
import numpy as np
from sklearn.model_selection import train_test_split
from cebra import CEBRA

hippocampus_pos = cebra.datasets.init('rat-hippocampus-single-achilles')
max_iterations = 10000
neural_train, neural_test, label_train, label_test = train_test_split(hippocampus_pos.neural, hippocampus_pos.continuous_index.numpy(), test_size=0.2,  random_state=2042, shuffle= False)
cebra_dir3_model_shuffled = CEBRA(model_architecture='offset10-model',
                        batch_size=1024,
                        learning_rate=3e-4,
                        temperature=1,
                        output_dimension=3,
                        max_iterations=max_iterations,
                        distance='cosine',
                        conditional='time_delta',
                        device='cuda',
                        verbose=True,
                        time_offsets=10)

np.random.seed(999)
shuffled_label = np.random.permutation(label_train[:, 1:])
cebra_dir3_model_shuffled.fit(neural_train, shuffled_label[:, 1].astype(int))

cebra_dir3_shuffled = cebra_dir3_model_shuffled.transform(neural_train)

fig = cebra.plot_embedding_interactive(cebra_dir3_shuffled, embedding_labels=shuffled_label[:, 1], title = "CEBRA-Behavior", cmap = "rainbow")
fig.show()

and can repro something similar:

However, when looking at the loss, we get:

i.e. the loss stays stable at chance level for a long time, but then gets very noisy and drops to the failure mode we see in the embedding space. This behavior is fine and expected to a certain level for discrete labels, because we have a limited number of samples to train on. When we heavily overtrain the model, it is possible to get these overfitting effects.

The solution is (1) observe the loss for irregularities (2) do not overtrain the model, pick a training time reasonable for the amount of data you have available (3) train on longer data sequences. To point (3), if you want to check, if you trained on a longer dataset, this collapse behavior you see in the loss will probably happen much later in the training.

Does this help clarify your question? Happy to discuss further!

2 replies

Owenxz Oct 11, 2024
Author

Hi @stes , thanks for the quick response!

However, you are right now training the model on binary variables (0 or 1), but using a continuous solver. Just checking, is this intended?. This breaks a few assumptions on how the data is sampled; namely the continuous solver with time_delta distribution expects the label to be approximately changing according to a Normal distribution (which is not the case in this example).

Thanks for pointing out! I overlooked this issue as I simply copied the code from the demo and wanted to reduce it to dir only to match our datatype. I wasn't aware that conditional='time_delta' assumes continuous data. I ran the same setting again with the position data with a continuous solver, and I didn't observe the issue, so I agree that discrete labels are more prone to this issue.

(Other clarification, is the data splitting function used this one? This is additionally problematic when using an offset-10 model, a split that preserves temporal information is more suitable).

Yes I used the sklearn train_test_split function. In the first case, I set the shuffle=False to preserve the temporal information, as I noticed that also affected the embedding. With shuffle=True and resorting the data back to the temporal order after splitting using the following code restored the embedding structure even though it's not exactly the original temporal order (randomly missing 20% of time points). I would appreciate it if you could elaborate more or point to where in the documentation I can learn about the temporal requirement for the offset-10 model.

label = pd.DataFrame(hippocampus_pos.continuous_index.numpy())
neural_train, neural_test, label_train, label_test = train_test_split(neural, label, test_size=0.2,  random_state=2042, shuffle= True)
neural_train = neural_train.sort_index().to_numpy()
neural_test = neural_test.sort_index().to_numpy()
label_train = label_train.sort_index().to_numpy()
label_test = label_test.sort_index().to_numpy()

It would be interesting to see if you can repro the behavior also if you use cebra_dir3_model_shuffled.fit(neural_train, shuffled_label[:,1].astype(int)) which selects the discrete solver. However, then you need to specify a batch_size, None is not yet supported.

I observed the same behavior as your repo here after changing it to astype (int). I also ran the original code that used a continuous solver and reduced the iteration number to 2000, which resulted in only a partial separation between the shuffled labels. Can I understand it as the continuous solver might be quicker to collapse when running on discrete labels?

The solution is (1) observe the loss for irregularities (2) do not overtrain the model, pick a training time reasonable for the amount of data you have available (3) train on longer data sequences. To point (3), if you want to check, if you trained on a longer dataset, this collapse behavior you see in the loss will probably happen much later in the training.

Thanks for the solutions! I try the shorter training time on my own dataset and report back if the original behavior still exits.

stes Oct 11, 2024
Maintainer

Thanks for pointing out! I overlooked this issue as I simply copied the code from the demo and wanted to reduce it to dir only to match our datatype. I wasn't aware that conditional='time_delta' assumes continuous data. I ran the same setting again with the position data with a continuous solver, and I didn't observe the issue, so I agree that discrete labels are more prone to this issue.

just to further clarify, what is important is the data type you pass to the fit() function. If you pass int, the discrete solver is used, if you pass a float array, the continuous one is used. IF the continuous one is used, the time_delta specifies the type of distribution.

I try the shorter training time on my own dataset and report back if the original behavior still exits.

Perfect, let me know how this goes. Ideally, there is still a way to make this a data driven decision (e.g., by observing the loss, and checking it for clear irregularity).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

batch_size and train_test_split on shuffled model #179

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

batch_size and train_test_split on shuffled model #179

Owenxz Oct 10, 2024

Replies: 1 comment · 2 replies

stes Oct 11, 2024 Maintainer

Owenxz Oct 11, 2024 Author

stes Oct 11, 2024 Maintainer

Owenxz
Oct 10, 2024

Replies: 1 comment 2 replies

stes
Oct 11, 2024
Maintainer

Owenxz Oct 11, 2024
Author

stes Oct 11, 2024
Maintainer