Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where can I download the dataset? #2

Closed
datong-new opened this issue Jan 17, 2020 · 24 comments
Closed

Where can I download the dataset? #2

datong-new opened this issue Jan 17, 2020 · 24 comments
Labels
dataset Issue with downloading or loading the dataset

Comments

@datong-new
Copy link

datong-new commented Jan 17, 2020

Hello, thanks for the wonderful work!

Can you give more details about the dataset? And where can I download the dataset?

Thank you!

@maxjcohen
Copy link
Owner

Hi,

You can find a sample of the dataset, as well as a brief description, as an open data challenge, in csv format. You will have to transpose it to npz format, or use a custom pytorch dataset (see the challenge demo repo), in order to use the notebooks.

Bests

@maxjcohen maxjcohen added the dataset Issue with downloading or loading the dataset label Jan 17, 2020
@HuskyLens
Copy link

Hi,
Would you like to share the npz file? As the data structure from Open Data Challenge seems different from yours.
See the difference:
Yours
Origin

@maxjcohen
Copy link
Owner

Hi, I can't share a npz file containing any other data than the ones uploaded on the data challenge, as it would go against the very rules of the challenge.
The structure of the labels is different, but that shouldn't be an issue if you just want to convert the csv dataset to npz, as the code was written with these possible modifications in mind. Just load the csv with the OzeDataset class, and export R, Z and X using np.savez. You're aiming at this kind of data structure.

@francisduan
Copy link

Hi do you have any code that could transform the csv to npz, I am not sure what we should include in the npz

@maxjcohen
Copy link
Owner

Once again, all needed information are present in the challenge benchmark repo, but to prevent further questions on the dataset I have drafted a function to convert csv to npz.

@maxjcohen maxjcohen mentioned this issue Apr 3, 2020
@DanielAtKrypton
Copy link

DanielAtKrypton commented May 1, 2020

Dear @maxjcohen , I joined the challenge 28, downloaded the following files:

  • x_train_LsAZgHU.csv
  • y_train_EFo1WyE.csv
  • x_test_QK7dVsy.csv

Then I copied csv2npz script to utils folder within the project.
Then I created and ran the following python script at project's root folder:

from src.utils.csv2npz import csv2npz

csv2npz('datasets/x_train_LsAZgHU.csv', 'datasets/y_train_EFo1WyE.csv')

But unfortunately it errored as can be seen below.

Traceback (most recent call last):
  File "/home/<username>/Workspaces/Python/transformer/generateNpz.py", line 3, in <module>
    csv2npz('datasets/x_train_LsAZgHU.csv', 'datasets/y_train_EFo1WyE.csv')
  File "/home/<username>/Workspaces/Python/transformer/src/utils/csv2npz.py", line 21, in csv2npz
    R = x[labels["R"]].values
  File "/home/<username>/.virtualenvs/.env/lib/python3.7/site-packages/pandas/core/frame.py", line 2806, in __getitem__
    indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
  File "/home/<username>/.virtualenvs/.env/lib/python3.7/site-packages/pandas/core/indexing.py", line 1553, in _get_listlike_indexer
    keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
  File "/home/<username>/.virtualenvs/.env/lib/python3.7/site-packages/pandas/core/indexing.py", line 1646, in _validate_read_indexer
    raise KeyError(f"{not_found} not in index")
KeyError: "['initial_temperature', 'roof_1_thickness_3'] not in index"

@maxjcohen
Copy link
Owner

Hi, this error means that the index "initial_temperature" and "roof_thickness_3" are not present in the challenge dataset. Indeed, if you take the original labels.json, these values are not present, because they were not intended to be used in the challenge.

In order to solve your error, I recommend using the original labels file from the benchmark repo.

@DanielAtKrypton
Copy link

I created a pull request #6 with some improvements I came up with up to now, it might be useful to merge @maxjcohen, please advise.

@jshyou
Copy link

jshyou commented Jan 7, 2021

I am looking at your project and try to process different dataset. If convenient, please describe the data format so I can process any data beyond the challenge dataset only. Thanks.

@maxjcohen
Copy link
Owner

Hi, there is no particular data format to use with the Transformer beside the input shape specified in the documentation.

We currently handle our data using the OzeDataset class, inherited from PyTorch's Dataset class. As the format here is a bit specific, I encourage you to write your own Dataset inherited class fitting your data, and feed it to the Transformer.

@jiange91
Copy link

Hi, thanks for the reference for the helpful data loading function. Just one minor tip here.

The original data loader uses X.values.reshape((m,-1,k)) where m is the number of observations and k is the length of time series. However, a normal LSTM or Transformer model accepts an input vector in shape (batch, time series length, num_feature). Thus the reshaping of (m, k, -1) is recommended. Same for variable "Z" (have to point out that the naming is quite confusing at the first glance.)
X = X.values.reshape((m, K, -1))
Z = Z.values.reshape((m, K, -1))

For the labels.jason, I delete "week" and "light_blabla_mask" (can't remember the name but the error message alert me that this index is not found). You can also refer to the data specification on Challenge website https://challengedata.ens.fr/participants/challenges/28/ to modify your labels.jason

My final input vector size is (8, 672, 18) (8 batches, 672 time-series, 18 features ignoring room-paras.) - 2021 / 2 / 25

@maxjcohen
Copy link
Owner

LSTM in pytorch accepts a vector of shape (time series length, batch, num_features), see the docs.

@diegoquintanav
Copy link

diegoquintanav commented Apr 19, 2021

I managed to get a .npz file using the labels.json from https://raw.githubusercontent.com/maxjcohen/ozechallenge_benchmark/master/labels.json and the code from https://gist.github.com/diegoquintanav/050765be2ff3f4cfcf7c25da645cfcc2

However, in the notebook in https://timeseriestransformer.readthedocs.io/en/latest/notebooks/trainings/training_2020_06_27__164648.html#Load-dataset the dataset used has (I think) 25k rows (the one downloaded from the ozechallenge has 7500

$ wc -l dataset/x_train_LsAZgHU.csv 
7501 dataset/x_train_LsAZgHU.csv

If I change the splits to dataset_train, dataset_val, dataset_test = random_split(ozeDataset, (5500, 1000, 1000)), I hit an error in the cell that does the training:

[Epoch   1/30]:   0%|          | 0/5500 [00:00<?, ?it/s]

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-20-4b3396332a6c> in <module>
     12 
     13             # Propagate input
---> 14             netout = net(x.to(device))
     15 
     16             # Comupte loss

~/.pyenv/versions/anaconda3-5.3.1/envs/tfm/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    887             result = self._slow_forward(*input, **kwargs)
    888         else:
--> 889             result = self.forward(*input, **kwargs)
    890         for hook in itertools.chain(
    891                 _global_forward_hooks.values(),

~/code/notebooks/transformers/transformer/tst/transformer.py in forward(self, x)
    123 
    124         # Embeddin module
--> 125         encoding = self._embedding(x)
    126 
    127         # Add position encoding

~/.pyenv/versions/anaconda3-5.3.1/envs/tfm/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    887             result = self._slow_forward(*input, **kwargs)
    888         else:
--> 889             result = self.forward(*input, **kwargs)
    890         for hook in itertools.chain(
    891                 _global_forward_hooks.values(),

~/.pyenv/versions/anaconda3-5.3.1/envs/tfm/lib/python3.8/site-packages/torch/nn/modules/linear.py in forward(self, input)
     92 
     93     def forward(self, input: Tensor) -> Tensor:
---> 94         return F.linear(input, self.weight, self.bias)
     95 
     96     def extra_repr(self) -> str:

~/.pyenv/versions/anaconda3-5.3.1/envs/tfm/lib/python3.8/site-packages/torch/nn/functional.py in linear(input, weight, bias)
   1751     if has_torch_function_variadic(input, weight):
   1752         return handle_torch_function(linear, (input, weight), input, weight, bias=bias)
-> 1753     return torch._C._nn.linear(input, weight, bias)
   1754 
   1755 

RuntimeError: mat1 dim 1 must match mat2 dim 0

What is this 'datasets/dataset_57M.npz'? and what are X, R and Z? thanks!

@maxjcohen
Copy link
Owner

Hi, the dataset from the challenge and the one I'm using on this repo are quite different, this is why dimensions don't match. If you want to use this Transformer for the challenge, you'll have to make a few ajdustements.

As for your question about X, R and Z, you can check #28 .

@diegoquintanav
Copy link

Hi!, thanks for answering.

Can you tell me more about the differences? For example, what are the shapes of X, R, and Z indataset_57M.npz? Also, I'm lost when you say that

If you want to use this Transformer for the challenge, you'll have to make a few adjustments.

Is this not what is going on in this repo? In the readme, you say that the dataset used to train this transformer is the one from the challenge, but that does not seem to be the case. Can you tell me more about what are the adjustments needed?

@maxjcohen
Copy link
Owner

The variables X, R and Z are proper to the challenge dataset, and completely independent from the Transformer model. They simply describe the dataset, with 2 inputs instead of the usual one:

  • R contains the characteristics of the building, which don't change with time, and are concatenated with Z to serve as input. Shape should be (n_samples, n_characteristics).
  • Z contains the input time series. Shape should be (n_samples, time_steps, n_input_variables).
  • X contains the output time series. Shape should be (n_samples, time_steps, n_output_variables).

The original dataset from the challenge has been modified, for instance some variables where removed from R, some added to Z, etc. But the content is roughly the same, and should be sufficient for trying out the Transformer. All changes can be found in the files labels.json.

Please keep in mind that the dataset dataset_57M.npz is not available for download.

@inkyusa
Copy link

inkyusa commented May 12, 2021

Thanks to the author for the great intuitions and efforts.

For those who may have issues related to the dataset, you might be able to try this that I slightly modified according to the author's suggestions.
https://github.com/afters-cool/transformer

and dataset
https://github.com/afters-cool/transformer/releases/tag/v0.0.1

You can check some plots resulted from the code above (don't know whether it's correct or not).
https://github.com/afters-cool/transformer/tree/master/assets

Hope this helped someone.

@sarraAyed
Copy link

sarraAyed commented Oct 13, 2021

The dataset of the challenge contain a file named x_train and y_train. Do they complement each other or one of them is enough ?
Plus, If my data are already in a csv file, can't I just devide them into train, test and validate directly and just use them ?

@maxjcohen
Copy link
Owner

Hi, yes they complement each other, x_train are the command (input vectors) while y_train are the observations (output vectors). You are, of course, free to divide your data however you desire.
In the future, please keep discussions about the challenge in the challenge repo.

@gaoyanfei1
Copy link

Thank you for your work!

@chrismen
Copy link

I am new to Transformer methods. Can the package accept csv files directly instead of .npz files?

@maxjcohen
Copy link
Owner

In this repo, we define a Transformer model that takes as inputs Tensors, see the documentation. We present examples loading data as .npz files, but you can load data however you want.

@yyldtc
Copy link

yyldtc commented Apr 9, 2024

可以把数据集这一块,做一个详细的解释吗,我已经下载了这两个数据集dataset.npz和lable.json,也放在了目录中,但还是无法运行代码

@maxjcohen
Copy link
Owner

Hi @yyldtc , from what I was able to translate from your message, something is still not working with the dataset. Could you detail the error that you got in a new issue ? I'll take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset Issue with downloading or loading the dataset
Projects
None yet
Development

No branches or pull requests