Skip to content
This repository has been archived by the owner on Jun 22, 2024. It is now read-only.

Transparency of Transformations #16

Closed
7 tasks done
kalebphipps opened this issue Oct 26, 2020 · 3 comments
Closed
7 tasks done

Transparency of Transformations #16

kalebphipps opened this issue Oct 26, 2020 · 3 comments
Assignees
Labels
core The issue relates to the core documentation Improvements or additions to documentation high This issue has a high priority module This issue involves a specific module

Comments

@kalebphipps
Copy link
Collaborator

kalebphipps commented Oct 26, 2020

The Problem
Currently, different modules deal with data transformations in different ways:

  • Sometimes modules access data variables via an index, sometimes via a name.
  • There is no clear structure or system regarding how new data variables are named (sometimes this is based on the target, sometimes the module, etc.)
  • Sometimes modules create a new data array and sometimes they add in an extra dimension, there is no consistency here.

Solution
It is important that we clearly define:

  • What are the inputs and outputs of a module, how is a transformation handled - define a clear convention.
  • How are data variables named, what are the conventions and when are they renamed?
  • When do we allow new dimensions and when do we create a new data array.
  • How are data sets named, what are the conventions here and when are they renamed?

Furthermore, once these decisions have been made it is important we document them clearly:

  • Documentation of module definition and behaviour.
  • Documentation regarding the handling of new dimensions/data arrays.
  • Documentation of the naming conventions.

Possible extensions:

  • A dynamic identification of the column names: depending on the decisions made above it would be nice to have an automated approach to identify the column name. This could be possible by using the step information in each module and returning a list of column names.
@kalebphipps kalebphipps added documentation Improvements or additions to documentation core The issue relates to the core high This issue has a high priority labels Oct 26, 2020
@benHeid benHeid added the module This issue involves a specific module label Oct 26, 2020
@olineumann
Copy link
Collaborator

Just reviewed the changes in feature/16 branch. Currently the examples 'example.py' and 'example_day_and_night.py' not working for me because they crash with:

Traceback (most recent call last):
  File "example.py", line 109, in <module>
    pipeline.train(data=train)
  File "/home/oliver/git/pyWATTS/pywatts/core/pipeline.py", line 186, in train
    return self._run(data, ComputationMode.FitTransform)
  File "/home/oliver/git/pyWATTS/pywatts/core/pipeline.py", line 198, in _run
    return self.transform(**{key: data[key] for key in data.data_vars})
  File "/home/oliver/git/pyWATTS/pywatts/core/pipeline.py", line 72, in transform
    self.counter = list(x.values())[0].indexes[time_index[0]][0]  # The start date of the input time series.
IndexError: list index out of range

After looking at the code the reason for that is that the time_index variable is an empty list because x is a dict of DataArrays and the _get_time_indeces(x) method only checks for DataArrays or DatetimeIndex.

# pipeline.py
def transform(self, **x: xr.DataArray) -> xr.Dataset:
    """
    Transform the input into output, by performing all the step in this pipeline.
    Moreover, this method collects the results of the last steps in this pipeline.

    Note, this method is necessary for enabling subpipelining.

    :param x: The input data
    :type x: xr.Dataset
    :return:The transformed data
    :rtype: xr.Dataset
    """
    for key, (start_step, _) in self.start_steps.items():
        start_step.buffer = x[key].copy()
        start_step.finished = True

    time_index = _get_time_indeces(x)
    self.counter = list(x.values())[0].indexes[time_index[0]][0]  # The start date of the input time series.

    last_steps = list(filter(lambda x: x.last, self.id_to_step.values()))

    if not self.batch:
        return self._collect_results(last_steps)
    return self._collect_batches(last_steps, time_index)

# _xarray_time_series_utils.py
def _get_time_indeces(x: Dict[str, xr.DataArray]) -> List[str]:
    indexes = []
    if isinstance(x, xr.DataArray):
        for k, v in x.indexes.items():
            if isinstance(v, pd.DatetimeIndex):
                indexes.append(k)
        return indexes
    # TODO check that all inputs have the same time dimension?
    return indexes

So the _get_time_indeces method has to be updated? Or did I something wrong?

@benHeid
Copy link
Collaborator

benHeid commented Dec 29, 2020

Thanks, you are right. Now, it should work.

@benHeid
Copy link
Collaborator

benHeid commented Mar 24, 2021

merged with #35

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
core The issue relates to the core documentation Improvements or additions to documentation high This issue has a high priority module This issue involves a specific module
Projects
None yet
Development

No branches or pull requests

4 participants