Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unwindowed datasets: some clarification required #908

Open
bellerofonte opened this issue May 31, 2024 · 0 comments
Open

Unwindowed datasets: some clarification required #908

bellerofonte opened this issue May 31, 2024 · 0 comments

Comments

@bellerofonte
Copy link

Hi @oguiza

I am trying to solve the binary classification problem using tsai.
I have a kind-a large dataset. I cannot use apply_sliding_window directly on it because I run into OOM.
That is why I am now trying to use the TSUnwindowedDataset[s] routines. And I doubt several points whether I'm doing the right thing.

For the following example I took just a part of full dataset, so the shape of this slice does not really matter, it is just FYI.

df.shape
# (2358720, 7)

Now I extract features and target from the slice

X = df.drop(columns=['time', 'target']).values
y = df['target'].values
type(X), type(y), X.shape, y.shape
# (numpy.ndarray, numpy.ndarray, (2358720, 5), (2358720,))

Checking the target is binary:

pd.Series(y).value_counts()
# 0    1829729
# 1     528991
# Name: count, dtype: int64

Now the tsai library kicks in:

computer_setup()
# os              : Linux-5.9.16-050916-lowlatency-x86_64-with-glibc2.31
# python          : 3.12.3
# tsai            : 0.3.9
# fastai          : 2.7.15
# fastcore        : 1.5.38
# torch           : 2.2.2+cu121
# device          : 1 gpu (['NVIDIA GeForce GTX 1080 Ti'])
# cpu cores       : 16
# threads per cpu : 2
# RAM             : 31.31 GB
# GPU memory      : [11.0] GB

After I've created splits, I create instances of the TSUnwindowedDataset and TSUnwindowedDatasets classes:

WINDOW_SIZE = 50

def my_y_func(y_):
    return y_[:,-1] # I need only the last item from the window of targets

ds = TSUnwindowedDataset(X=X, y=y, y_func=my_y_func, window_size=WINDOW_SIZE, seq_first=True)

dsets = TSUnwindowedDatasets(ds, splits=splits)

dls = TSDataLoaders.from_dsets(dsets.train, dsets.valid, dsets[2], # incuding test part of dataset
                                  bs=256, 
                                  shuffle_train=False,
                                  batch_tfms=TSStandardize(by_sample=True)
                                  )

and here is the first point:

dls.vars, dls.c
# (5, 1)
#     ^
#     expected 2 for binary classification

The class count is 1 instead of expected 2 for binary classification. If I try to create model and train it

model = TST(dls.vars, dls.c, dls.len, dropout=0.3, fc_dropout=0.3)

cbs = [
    # does not matter
]

learn = Learner(dls, model, metrics=[RocAucBinary(), accuracy], cbs=cbs)
learn.lr_find()

I get following error:

../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.

This approach differs from the sample notebook, where a transformation is used for the target:

# .....
tfms  = [None, [Categorize()]] # <---------- makes `dls.c` equal to `2`
dsets = TSDatasets(X, y, tfms=tfms, splits=splits)
# .....

However, the TSUnwindowedDataset does not have such functionality.

How to properly introduce the target to the data loader in that case?

As a temporary solution, I have tried to train model like this:

model = TST(dls.vars, max(2, dls.c), dls.len, dropout=0.3, fc_dropout=0.3)
#                            ^
cbs = [
    # does not matter
]

learn = Learner(dls, model, metrics=[RocAucBinary(), accuracy], cbs=cbs)
learn.fit_one_cycle(100, 1e-4)

This code trains the model and I even get pretty good-looking charts at the end
Xnip2024-05-31_19-32-13

But here is the second point:
I don't know how to properly interpret predictions.

probas, *_, labels = learner.get_preds(dl=dls.valid, with_decoded=True)
labels_ = probas.argmax(dim=1)
test_eq(labels_, labels)     # OK

As for my target - y[i] == 1 is good and 0 is bad
But what does label[i] == 1 mean? It can mean the same as my target, but since the prediction return probabilities of shape (N, 2) I suspect it means the opposite.

So to check it I've created a method:

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

def check_predictions(dls, learner, idx=None, invert:bool = False):
    dl = dls.valid if idx is None else dls[idx]
    probas, *_, labels = learner.get_preds(dl=dl, with_decoded=True)

    y_true = dl.y[dl.split]
    y_pred = labels.cpu().numpy()
    if invert:
        y_pred = 1 - y_pred
        
    print('ROC_AUC:   ', roc_auc_score(y_true, y_pred))
    print('F1:        ', f1_score(y_true, y_pred))
    print('Accuracy:  ', accuracy_score(y_true, y_pred))
    print('Precision: ', precision_score(y_true, y_pred))
    print(confusion_matrix(y_true, y_pred))

...and tried both ways

check_predictions(dls, learn, invert=False) # use predicted labels as they are
# ROC_AUC:    0.5123604793431208
# F1:         0.48825168804096614
# Accuracy:   0.41867109378016293
# Precision:  0.34949275421082937
# [[21967 80216]
#  [10126 43097]]
check_predictions(dls, learn, invert=True)
# ROC_AUC:    0.4876395206568792
# F1:         0.23737634206948285
# Accuracy:   0.5813289062198371
# Precision:  0.31552051849312934
# [[80216 21967]
#  [43097 10126]]

And here is the third point - I cannot reproduce validation ROC AUC score anywhere near displayed on the chat.
In both ways I compare predicted labels to my target on the validation subset - I get ROC AUC ~0.5, but the chart shows 0.75
Why that happens? What am I missing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant