Modularize default argument handling for datasets #14

deepyaman · 2019-06-09T21:33:29Z

Description

Near-identical code to handle default arguments is replicated in almost every dataset implementation. Worse still, functionality across said datasets is the same, but implementation is inconsistent.

Context

When I want to implement a new dataset, I look at existing datasets as a baseline for implementing my own. However, there are inconsistencies between these datasets, from the more minor (save_args handled after load_args for some datasets), to the slightly more significant (special casing where there are no default arguments on some datasets but not others) and worse (one case where arguments are evaluated for truthiness instead of is not None) (see https://github.com/quantumblacklabs/kedro/blob/0.14.1/kedro/contrib/io/azure/csv_blob.py#L109-L113 as an example representing several of the above). I don't know which one to follow to maintain consistency across the codebase.

Possible Implementation

#15

By having DEFAULT_LOAD_ARGS/DEFAULT_SAVE_ARGS attributes, users can also see the defaults programmatically (with the caveat that this is a drawback if you consider the few cases where such arguments don't apply, like no save on SqlQueryDataSet or in general on LambdaDataSet/MemoryDataSet).

Possible Alternatives

Create an intermediate abstract dataset class (or mixin?) so as to not modify AbstractDataSet and thereby only apply to those with load_args/save_args
Move default argument handling into a utility function and call it from each individual __init__ method (not preferred)

Checklist

Include labels so that we can categorise your issue:

Add a "Component" label to the issue
Add a "Priority" label to the issue

The text was updated successfully, but these errors were encountered:

tsanikgr · 2019-06-10T10:03:22Z

Hi @deepyaman , thanks a lot for pointing this out. It's something I 've thought of proposing to fix quite a few times but was never a priority.

Another solution to consider would be to have default_save_args and default_load_args as class variables (as this is what they really are): (edit: This is actually what you did as well, sorry, somehow I missed that 🤦‍♂ )

class Base:
      default_save_args = {}
      default_load_args = {}

      @property
      def _load_args(self):
           return ({**self.default_load_args, **self._load_args_} 
                   if hasattr(self, '_load_args_') else {**self.default_load_args})

      @property
      def _save_args(self):
           return ({**self.default_save_args, **self._save_args_} 
                   if hasattr(self, '_save_args_') else {**self.default_save_args})

      @_load_args.setter
      def _load_args(self, load_args):
            self._load_args_ = load_args if load_args is not None else {}

      @_save_args.setter
      def _save_args(self, save_args):
            self._save_args_ = save_args if save_args is not None else {}

class Child(Base):
     default_save_args = {'index': False}

     def __init__(self, load_args=None, save_args=None):
          self._load_args = load_args
          self._save_args = save_args

So that:

In [7]: c = Child({'hi': 'there'}, {'extra': 1})

In [8]: c._load_args
Out[8]: {'hi': 'there'}

In [9]: c._save_args
Out[9]: {'index': False, 'extra': 1}

In [10]: c.default_save_args
Out[10]: {'index': False}

This would avoid the __init__ on the parent class, remove the pylint: disable=super-init-not-called or simplify the code for classes that don't make use of defaults.

I also like you proposition of making the default_*_args a public attribute, its kind of hidden in the constructor atm, so less "magic" for our users

Btw, I m not sure if what I suggested is the right way
Maybe wait and see what others also say? @idanov @tolomea

deepyaman · 2019-06-10T12:58:24Z

@tsanikgr Thanks for the feedback! I believe the property implementations can be further simplified as:

@property
def _load_args(self):
    return {**self.default_load_args, **getattr(self, '_load_args_', {})}

(and the analogous method for _save_args)

tsanikgr · 2019-06-10T13:02:13Z

👏 Indeed! Unfortunately, we can't avoid the setters, otherwise self._loads_args = load_args in the constructor of children won't work

deepyaman · 2019-06-10T13:41:55Z

I'm fine with the above approach and happy to make the changes if others agree. @idanov @tolomea

yetudada · 2019-06-10T14:44:49Z

Thank you so much for this @deepyaman! We'll await feedback from @idanov on this and will get back to you.

idanov · 2019-06-18T15:40:14Z

Thanks @deepyaman for raising this. We've had discussions internally and this has come up a few times already. However I don't think this is something which should be added - we should leave each implementation to deal with default arguments as they see fit - they might not even have default arguments if they want to.

The dataset abstraction is meant to be not very prescriptive in order to allow for all heterogenous and very different in their own way datasets. Adding more structure to the abstraction will make that functionality usefull for only some datasets but not others. That is going to undermine the abstraction - abstract classes are not meant to remove code repetition, but to create useful and general abstractions. Also in this case we are talking about very small code repetition, which does not justify paying the price of having a more complex and bad abstraction at the end.

Having heterogenous classes means that the code will remain to be federated by nature, which will inevitably result in some code repetition here and there, and that's not such a bad thing.

Duplication is far cheaper than the wrong abstraction

See https://www.sandimetz.com/blog/2016/1/20/the-wrong-abstraction

deepyaman · 2019-06-19T00:38:48Z

@idanov I understand your point. That being said, there are options (from the Possible Alternatives section in the original issue) that leave the ABC as is. Copied from above:

Create a mixin so as to not modify AbstractDataSet and thereby only apply to those with load_args/save_args (e.g. DefaultArgsMixin)
Move default argument handling into a utility function and call it from each individual init method (not preferred)

Finally, from the Context provided in the original issue:

When I want to implement a new dataset, I look at existing datasets as a baseline for implementing my own. However, there are inconsistencies between these datasets [...]. I don't know which one to follow to maintain consistency across the codebase.

At the very minimum, I feel like it would be an improvement to standardize this functionality across existing datasets, even if the code is identical/copy-pasted in each.

Thoughts?

idanov · 2019-07-02T14:30:48Z

@deepyaman I agree with your last point, so I suppose we can do at least that. I like your idea for DefaultArgsMixin, however we need to find more benefits to it to justify paying the price in terms of complexity. For the time being, we can do that for contrib datasets only, where the built-in will just have identical copy of the defaults setting code.

deepyaman · 2019-07-03T23:05:44Z

@deepyaman I agree with your last point, so I suppose we can do at least that. I like your idea for DefaultArgsMixin, however we need to find more benefits to it to justify paying the price in terms of complexity. For the time being, we can do that for contrib datasets only, where the built-in will just have identical copy of the defaults setting code.

@idanov Sounds great! I've updated #15 with the above changes.

deepyaman added the Issue: Feature Request New feature or improvement to existing feature label Jun 9, 2019

deepyaman mentioned this issue Jun 9, 2019

Modularize default load and save argument handling #15

Merged

8 tasks

tsanikgr added the Component: I/O label Jun 10, 2019

yetudada assigned idanov and tolomea Jun 10, 2019

yetudada added the Priority: Low label Jun 10, 2019

idanov closed this as completed in #15 Jul 23, 2019

deepyaman mentioned this issue Oct 31, 2022

Easier CustomDataset Creation #1936

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modularize default argument handling for datasets #14

Modularize default argument handling for datasets #14

deepyaman commented Jun 9, 2019 •

edited

Loading

tsanikgr commented Jun 10, 2019 •

edited

Loading

deepyaman commented Jun 10, 2019 •

edited

Loading

tsanikgr commented Jun 10, 2019 •

edited

Loading

deepyaman commented Jun 10, 2019

yetudada commented Jun 10, 2019

idanov commented Jun 18, 2019 •

edited

Loading

deepyaman commented Jun 19, 2019

idanov commented Jul 2, 2019

deepyaman commented Jul 3, 2019

Modularize default argument handling for datasets #14

Modularize default argument handling for datasets #14

Comments

deepyaman commented Jun 9, 2019 • edited Loading

Description

Context

Possible Implementation

Possible Alternatives

Checklist

tsanikgr commented Jun 10, 2019 • edited Loading

deepyaman commented Jun 10, 2019 • edited Loading

tsanikgr commented Jun 10, 2019 • edited Loading

deepyaman commented Jun 10, 2019

yetudada commented Jun 10, 2019

idanov commented Jun 18, 2019 • edited Loading

deepyaman commented Jun 19, 2019

idanov commented Jul 2, 2019

deepyaman commented Jul 3, 2019

deepyaman commented Jun 9, 2019 •

edited

Loading

tsanikgr commented Jun 10, 2019 •

edited

Loading

deepyaman commented Jun 10, 2019 •

edited

Loading

tsanikgr commented Jun 10, 2019 •

edited

Loading

idanov commented Jun 18, 2019 •

edited

Loading