Transparency of Transformations #16

kalebphipps · 2020-10-26T15:32:18Z

The Problem
Currently, different modules deal with data transformations in different ways:

Sometimes modules access data variables via an index, sometimes via a name.
There is no clear structure or system regarding how new data variables are named (sometimes this is based on the target, sometimes the module, etc.)
Sometimes modules create a new data array and sometimes they add in an extra dimension, there is no consistency here.

Solution
It is important that we clearly define:

What are the inputs and outputs of a module, how is a transformation handled - define a clear convention.
How are data variables named, what are the conventions and when are they renamed?
When do we allow new dimensions and when do we create a new data array.
How are data sets named, what are the conventions here and when are they renamed?

Furthermore, once these decisions have been made it is important we document them clearly:

Documentation of module definition and behaviour.
Documentation regarding the handling of new dimensions/data arrays.
Documentation of the naming conventions.

Possible extensions:

A dynamic identification of the column names: depending on the decisions made above it would be nice to have an automated approach to identify the column name. This could be possible by using the step information in each module and returning a list of column names.

olineumann · 2020-12-24T11:47:24Z

Just reviewed the changes in feature/16 branch. Currently the examples 'example.py' and 'example_day_and_night.py' not working for me because they crash with:

Traceback (most recent call last):
  File "example.py", line 109, in <module>
    pipeline.train(data=train)
  File "/home/oliver/git/pyWATTS/pywatts/core/pipeline.py", line 186, in train
    return self._run(data, ComputationMode.FitTransform)
  File "/home/oliver/git/pyWATTS/pywatts/core/pipeline.py", line 198, in _run
    return self.transform(**{key: data[key] for key in data.data_vars})
  File "/home/oliver/git/pyWATTS/pywatts/core/pipeline.py", line 72, in transform
    self.counter = list(x.values())[0].indexes[time_index[0]][0]  # The start date of the input time series.
IndexError: list index out of range

After looking at the code the reason for that is that the time_index variable is an empty list because x is a dict of DataArrays and the _get_time_indeces(x) method only checks for DataArrays or DatetimeIndex.

# pipeline.py
def transform(self, **x: xr.DataArray) -> xr.Dataset:
    """
    Transform the input into output, by performing all the step in this pipeline.
    Moreover, this method collects the results of the last steps in this pipeline.

    Note, this method is necessary for enabling subpipelining.

    :param x: The input data
    :type x: xr.Dataset
    :return:The transformed data
    :rtype: xr.Dataset
    """
    for key, (start_step, _) in self.start_steps.items():
        start_step.buffer = x[key].copy()
        start_step.finished = True

    time_index = _get_time_indeces(x)
    self.counter = list(x.values())[0].indexes[time_index[0]][0]  # The start date of the input time series.

    last_steps = list(filter(lambda x: x.last, self.id_to_step.values()))

    if not self.batch:
        return self._collect_results(last_steps)
    return self._collect_batches(last_steps, time_index)

# _xarray_time_series_utils.py
def _get_time_indeces(x: Dict[str, xr.DataArray]) -> List[str]:
    indexes = []
    if isinstance(x, xr.DataArray):
        for k, v in x.indexes.items():
            if isinstance(v, pd.DatetimeIndex):
                indexes.append(k)
        return indexes
    # TODO check that all inputs have the same time dimension?
    return indexes

So the _get_time_indeces method has to be updated? Or did I something wrong?

benHeid · 2020-12-29T16:27:19Z

Thanks, you are right. Now, it should work.

benHeid · 2021-03-24T15:44:45Z

merged with #35

kalebphipps added documentation Improvements or additions to documentation core The issue relates to the core high This issue has a high priority labels Oct 26, 2020

benHeid added the module This issue involves a specific module label Oct 26, 2020

marianturowski assigned marianturowski, benHeid and olineumann and unassigned marianturowski Dec 22, 2020

marianturowski mentioned this issue Dec 22, 2020

Transparency of Outputs #31

Closed

benHeid closed this as completed Mar 24, 2021

benHeid mentioned this issue Mar 25, 2021

Update documentation of the core #75

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transparency of Transformations #16

Transparency of Transformations #16

kalebphipps commented Oct 26, 2020 •

edited by benHeid

Loading

olineumann commented Dec 24, 2020

benHeid commented Dec 29, 2020

benHeid commented Mar 24, 2021

Transparency of Transformations #16

Transparency of Transformations #16

Comments

kalebphipps commented Oct 26, 2020 • edited by benHeid Loading

olineumann commented Dec 24, 2020

benHeid commented Dec 29, 2020

benHeid commented Mar 24, 2021

kalebphipps commented Oct 26, 2020 •

edited by benHeid

Loading