-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Where can I download the dataset? #2
Comments
Hi, You can find a sample of the dataset, as well as a brief description, as an open data challenge, in Bests |
Hi, I can't share a npz file containing any other data than the ones uploaded on the data challenge, as it would go against the very rules of the challenge. |
Hi do you have any code that could transform the csv to npz, I am not sure what we should include in the npz |
Once again, all needed information are present in the challenge benchmark repo, but to prevent further questions on the dataset I have drafted a function to convert |
Dear @maxjcohen , I joined the challenge 28, downloaded the following files:
Then I copied csv2npz script to utils folder within the project. from src.utils.csv2npz import csv2npz
csv2npz('datasets/x_train_LsAZgHU.csv', 'datasets/y_train_EFo1WyE.csv') But unfortunately it errored as can be seen below. Traceback (most recent call last):
File "/home/<username>/Workspaces/Python/transformer/generateNpz.py", line 3, in <module>
csv2npz('datasets/x_train_LsAZgHU.csv', 'datasets/y_train_EFo1WyE.csv')
File "/home/<username>/Workspaces/Python/transformer/src/utils/csv2npz.py", line 21, in csv2npz
R = x[labels["R"]].values
File "/home/<username>/.virtualenvs/.env/lib/python3.7/site-packages/pandas/core/frame.py", line 2806, in __getitem__
indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
File "/home/<username>/.virtualenvs/.env/lib/python3.7/site-packages/pandas/core/indexing.py", line 1553, in _get_listlike_indexer
keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
File "/home/<username>/.virtualenvs/.env/lib/python3.7/site-packages/pandas/core/indexing.py", line 1646, in _validate_read_indexer
raise KeyError(f"{not_found} not in index")
KeyError: "['initial_temperature', 'roof_1_thickness_3'] not in index" |
Hi, this error means that the index In order to solve your error, I recommend using the original labels file from the benchmark repo. |
I created a pull request #6 with some improvements I came up with up to now, it might be useful to merge @maxjcohen, please advise. |
I am looking at your project and try to process different dataset. If convenient, please describe the data format so I can process any data beyond the challenge dataset only. Thanks. |
Hi, there is no particular data format to use with the Transformer beside the input shape specified in the documentation. We currently handle our data using the |
Hi, thanks for the reference for the helpful data loading function. Just one minor tip here. The original data loader uses X.values.reshape((m,-1,k)) where m is the number of observations and k is the length of time series. However, a normal LSTM or Transformer model accepts an input vector in shape (batch, time series length, num_feature). Thus the reshaping of (m, k, -1) is recommended. Same for variable "Z" (have to point out that the naming is quite confusing at the first glance.) For the labels.jason, I delete "week" and "light_blabla_mask" (can't remember the name but the error message alert me that this index is not found). You can also refer to the data specification on Challenge website https://challengedata.ens.fr/participants/challenges/28/ to modify your labels.jason My final input vector size is (8, 672, 18) (8 batches, 672 time-series, 18 features ignoring room-paras.) - 2021 / 2 / 25 |
LSTM in pytorch accepts a vector of shape (time series length, batch, num_features), see the docs. |
I managed to get a .npz file using the labels.json from https://raw.githubusercontent.com/maxjcohen/ozechallenge_benchmark/master/labels.json and the code from https://gist.github.com/diegoquintanav/050765be2ff3f4cfcf7c25da645cfcc2 However, in the notebook in https://timeseriestransformer.readthedocs.io/en/latest/notebooks/trainings/training_2020_06_27__164648.html#Load-dataset the dataset used has (I think) 25k rows (the one downloaded from the ozechallenge has 7500 $ wc -l dataset/x_train_LsAZgHU.csv
7501 dataset/x_train_LsAZgHU.csv If I change the splits to [Epoch 1/30]: 0%| | 0/5500 [00:00<?, ?it/s]
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-20-4b3396332a6c> in <module>
12
13 # Propagate input
---> 14 netout = net(x.to(device))
15
16 # Comupte loss
~/.pyenv/versions/anaconda3-5.3.1/envs/tfm/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
887 result = self._slow_forward(*input, **kwargs)
888 else:
--> 889 result = self.forward(*input, **kwargs)
890 for hook in itertools.chain(
891 _global_forward_hooks.values(),
~/code/notebooks/transformers/transformer/tst/transformer.py in forward(self, x)
123
124 # Embeddin module
--> 125 encoding = self._embedding(x)
126
127 # Add position encoding
~/.pyenv/versions/anaconda3-5.3.1/envs/tfm/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
887 result = self._slow_forward(*input, **kwargs)
888 else:
--> 889 result = self.forward(*input, **kwargs)
890 for hook in itertools.chain(
891 _global_forward_hooks.values(),
~/.pyenv/versions/anaconda3-5.3.1/envs/tfm/lib/python3.8/site-packages/torch/nn/modules/linear.py in forward(self, input)
92
93 def forward(self, input: Tensor) -> Tensor:
---> 94 return F.linear(input, self.weight, self.bias)
95
96 def extra_repr(self) -> str:
~/.pyenv/versions/anaconda3-5.3.1/envs/tfm/lib/python3.8/site-packages/torch/nn/functional.py in linear(input, weight, bias)
1751 if has_torch_function_variadic(input, weight):
1752 return handle_torch_function(linear, (input, weight), input, weight, bias=bias)
-> 1753 return torch._C._nn.linear(input, weight, bias)
1754
1755
RuntimeError: mat1 dim 1 must match mat2 dim 0 What is this 'datasets/dataset_57M.npz'? and what are X, R and Z? thanks! |
Hi, the dataset from the challenge and the one I'm using on this repo are quite different, this is why dimensions don't match. If you want to use this Transformer for the challenge, you'll have to make a few ajdustements. As for your question about X, R and Z, you can check #28 . |
Hi!, thanks for answering. Can you tell me more about the differences? For example, what are the shapes of X, R, and Z in
Is this not what is going on in this repo? In the readme, you say that the dataset used to train this transformer is the one from the challenge, but that does not seem to be the case. Can you tell me more about what are the adjustments needed? |
The variables
The original dataset from the challenge has been modified, for instance some variables where removed from Please keep in mind that the dataset |
Thanks to the author for the great intuitions and efforts. For those who may have issues related to the dataset, you might be able to try this that I slightly modified according to the author's suggestions. and dataset You can check some plots resulted from the code above (don't know whether it's correct or not). Hope this helped someone. |
The dataset of the challenge contain a file named x_train and y_train. Do they complement each other or one of them is enough ? |
Hi, yes they complement each other, |
Thank you for your work! |
I am new to Transformer methods. Can the package accept csv files directly instead of .npz files? |
In this repo, we define a Transformer model that takes as inputs Tensors, see the documentation. We present examples loading data as |
可以把数据集这一块,做一个详细的解释吗,我已经下载了这两个数据集dataset.npz和lable.json,也放在了目录中,但还是无法运行代码 |
Hi @yyldtc , from what I was able to translate from your message, something is still not working with the dataset. Could you detail the error that you got in a new issue ? I'll take a look. |
Hello, thanks for the wonderful work!
Can you give more details about the dataset? And where can I download the dataset?
Thank you!
The text was updated successfully, but these errors were encountered: