Refactoring? #541

mpariente · 2019-11-22T19:26:54Z

Thanks for lightning, it's a nice tool.

Trainer already has around 30 arguments and the PRs want to make it grow (see #516 and #539 for example). The more features, the more arguments and that's not really scalable IMHO.

There are things which could be grouped into objects instead, are you considering refactoring Trainer?
That would help the community see where the project is going and in which case to use it.

The text was updated successfully, but these errors were encountered:

Borda · 2019-11-22T20:17:03Z

Good point, the API parameters should not grow infinitely... @mpariente do you have an idea or a specific recommendation?

DrClick · 2019-11-22T20:38:25Z

I took a stab at it.

:param max_nb_epochs: int.
:param min_nb_epochs: int.


logging
        :param logger: Logger for experiment tracking
        :param log_save_interval: int. Writes logs to disk this often
        :param row_log_interval: int. How often to add logging rows
        :param add_row_log_interval: int. How often to add logging rows. Deprecated.
        :param log_gpu_memory: str. None, 'min_max', 'all'
        
callbacks
        :param checkpoint_callback: Callback for checkpointing
        :param early_stop_callback: Callback for early stopping

model_io
        :param default_save_path: Default path for logs+weights if no logger/ckpt_callback passed
        :param weights_save_path: Bool. Where to save weights if on cluster


modeleling
        :param gradient_clip_val: int. 0 means don't clip.
        :param gradient_clip: int. 0 means don't clip. Deprecated.
        :param accumulate_grad_batches: int. Accumulates grads every k batches

distributed
        :param nb_gpu_nodes: number of GPU nodes
        :param gpus: int. (ie: 2 gpus) OR list to specify which GPUs [0, 1] OR '0,1'
            OR '-1' / -1 to use all available gpus

        :param distributed_backend: str. Options: 'dp', 'ddp', 'ddp2'.
        :param use_amp: Bool. If true uses apex for 16bit precision
        :param amp_level: str. Check nvidia docs for level

training
        :param process_position: shown in the tqdm bar
        :param show_progress_bar: Bool. If true shows tqdm bar
        :param overfit_pct: float. uses this much of all datasets
        :param track_grad_norm: int. -1 no tracking. Otherwise tracks that norm


development
        :param check_val_every_n_epoch: int. check val every n train epochs
        :param fast_dev_run: Bool. runs full iteration over everything to find bugs
        :param train_percent_check: int. How much of train set to check
        :param val_percent_check: int. How much of val set to check
        :param test_percent_check: int. How much of test set to check
        :param val_check_interval: float/int. If float, % of tng epoch. If int, check every n batch
        :param no_validation: bool. If true, skips all validation steps even if validation is defined.
        :param nb_sanity_val_steps: int. How many val steps before a full train loop.
        :param truncated_bptt_steps: int. Enables multiple backward passes for each batch.

debugging
        :param print_nan_grads: Bool. Prints nan gradients
        :param weights_summary: str. Options: 'full', 'top', None to not print.

Borda · 2019-11-22T20:45:13Z

@DrClick very nice work! do you have an idea of how to pass them as "groups"?
something like kind of dictionary per group? :)

mpariente · 2019-11-22T21:52:06Z

Thanks for the quick answer :-)
Not sure I'd group them this way but that's one way !

I would avoid grouping them into dictionaries, this gives the impression that there are less parameters but doesn't bring much IMO.

There is the Callback abstract class here in lightning which is more or less taken from keras1.x and could be pretty useful for some of these options.
Instead of calling self.early_stopping_callback.on_epoch_end(), a callback list could be called, very much like in keras. And this would also allow the user to create callbacks, and that would be great !

jeffling · 2019-11-22T23:50:21Z

@mpariente +1 for creating callbacks, it's something we're juryrigging Trainer to do at my work :) I'm really swamped lately but I'd love to push our callback implementation upstream.

Re: Grouping, I'm uncertain about the proposal if we're just passing the objects into the same constructor anyways as we may just be painting over the complexity. ie. what's the difference between doing dicts vs better namespacing?

A potential suggestion would be a further refactor that separates out functionality into public facing callbacks, where each callback takes in their own parameters. having all 'distributed' functionality be in a callback could be an example.

Proposal:

for now, don't make any big changes to these arguments but enforce strong namespacing for all arguments going forward
Think about ways to actually reduce the amount of arguments to tackle the problem at it's core

Borda · 2019-11-22T23:59:41Z

I like idea with callbacks...

mpariente · 2019-11-23T08:19:18Z

@jeffling I agree, grouping into dictionaries just gives an illusion that there are less parameters but it's a bad idea.
Would something like this make sense?

gradient_checker = GradientChecker(
    gradient_clip_val=0,
    gradient_clip=None,  # backward compatible
    track_grad_norm=-1,
    accumulate_grad_batches=1,
    print_nan_grads=False)

training_logger = TrainingLogger(
    logger=True,
    log_gpu_memory=None,
    show_progress_bar=True,
    row_log_interval=10,
    add_row_log_interval=None)

distributed_config = DistributedConfig(...)
callbacks = [EarlyStopping(...), ModelCheckpoint(...), ...]
trainer = Trainer(model, gradient_checker=gradient_checker,
                  logger=training_logger, dd_config=distributed_config, 
                  callbacks=callbacks)
trainer.fit()

williamFalcon · 2019-11-23T08:57:35Z

@tullie @neggert any thoughts?

this is definitely a good conversation to have. One benefit of having everything flattened out is that it’s easy to see what’s possible. Downside is a long list of args.

If we break up into namespaces callbacks (dicts are bad because the IDE can’t help with autocomplete), then we have logical groupings of functionality. downside, we add a bit of friction to the user.

mpariente · 2019-11-23T09:12:02Z

Some other points :

If I may, having all the arguments shows what is possible to a certain extent.
If you don't read the source code here, you cannot really know that you can pass a customized instance of EarlyStopping. This is not really written in the function docs and I didn't see it in the examples either. Same for ModelCheckpoint.
If the doc is well written in the DistributedConfig(...) for example, it makes it even clearer than having intertwined docs in Trainer. For example use_amp and amp_level are not grouped together in Trainer's docstring.
Regarding a kind of GradientChecker. Currently, the gradients are inspected at least two times here and here and this is done for every batch. I think it is not inexpensive to loop through all the parameters and having one class doing all the gradient checks would make it possible to do it only once.

Borda · 2019-11-23T09:25:58Z

I believe that the documentation should be gradually upgraded/improved in follow-up PRs after #521

dreamgonfly · 2019-11-25T03:15:28Z

We need to seriously consider the downside of grouping. For example, Pandas's read_csv function ( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html ) has ~50 arguments but it works like a charm. It's because its arguments are orthogonal to each other and it's well documented. IMHO, callbacks are for allowing users to define their own methods, not just parameter values. I might be wrong, but callbacks just for grouping arguments seems not very useful to me.

neggert · 2019-11-25T20:53:21Z

I agree that it's getting to be a lot, but I think grouping/namespacing/similar just hides the problem. IMO we just need to be cognizant of the problem and be a little more careful about approving PRs that add more args. #536 is a good example of a use case that we can handle without adding a new arg.

One other thing to think about with this huge number of arguments is that we keep seeing bugs because someone tried a combination of args that we haven't tested. This is something I've been thinking about, but I've yet to come up with a good solution that doesn't blow up our test time.

williamFalcon · 2019-11-26T16:26:55Z

@neggert yeah... same thought. Other than random sampling combinations not sure what else we could do.

williamFalcon · 2019-11-26T16:27:54Z

resolution to this:
Let’s keep the list of args:

maybe adjust the order so they are logically grouped.
set a very very high bar for accepting a new arg.
maybe think of a way to collapse some args into each other

mpariente added feature Is an improvement or enhancement help wanted Open to be worked on labels Nov 22, 2019

Borda mentioned this issue Nov 25, 2019

run loop without validation. #536

Closed

Borda mentioned this issue Nov 25, 2019

Refactor: name modules #548

Merged

williamFalcon closed this as completed Nov 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring? #541

Refactoring? #541

mpariente commented Nov 22, 2019

Borda commented Nov 22, 2019

DrClick commented Nov 22, 2019 •

edited

Loading

Borda commented Nov 22, 2019

mpariente commented Nov 22, 2019

jeffling commented Nov 22, 2019 •

edited

Loading

Borda commented Nov 22, 2019

mpariente commented Nov 23, 2019

williamFalcon commented Nov 23, 2019

mpariente commented Nov 23, 2019

Borda commented Nov 23, 2019

dreamgonfly commented Nov 25, 2019 •

edited

Loading

neggert commented Nov 25, 2019

williamFalcon commented Nov 26, 2019

williamFalcon commented Nov 26, 2019 •

edited

Loading

Refactoring? #541

Refactoring? #541

Comments

mpariente commented Nov 22, 2019

Borda commented Nov 22, 2019

DrClick commented Nov 22, 2019 • edited Loading

Borda commented Nov 22, 2019

mpariente commented Nov 22, 2019

jeffling commented Nov 22, 2019 • edited Loading

Borda commented Nov 22, 2019

mpariente commented Nov 23, 2019

williamFalcon commented Nov 23, 2019

mpariente commented Nov 23, 2019

Borda commented Nov 23, 2019

dreamgonfly commented Nov 25, 2019 • edited Loading

neggert commented Nov 25, 2019

williamFalcon commented Nov 26, 2019

williamFalcon commented Nov 26, 2019 • edited Loading

DrClick commented Nov 22, 2019 •

edited

Loading

jeffling commented Nov 22, 2019 •

edited

Loading

dreamgonfly commented Nov 25, 2019 •

edited

Loading

williamFalcon commented Nov 26, 2019 •

edited

Loading