Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring? #541

Closed
mpariente opened this issue Nov 22, 2019 · 14 comments
Closed

Refactoring? #541

mpariente opened this issue Nov 22, 2019 · 14 comments
Labels
feature Is an improvement or enhancement help wanted Open to be worked on

Comments

@mpariente
Copy link
Contributor

Thanks for lightning, it's a nice tool.

Trainer already has around 30 arguments and the PRs want to make it grow (see #516 and #539 for example). The more features, the more arguments and that's not really scalable IMHO.

There are things which could be grouped into objects instead, are you considering refactoring Trainer?
That would help the community see where the project is going and in which case to use it.

@mpariente mpariente added feature Is an improvement or enhancement help wanted Open to be worked on labels Nov 22, 2019
@Borda
Copy link
Member

Borda commented Nov 22, 2019

Good point, the API parameters should not grow infinitely... @mpariente do you have an idea or a specific recommendation?

@DrClick
Copy link

DrClick commented Nov 22, 2019

I took a stab at it.

:param max_nb_epochs: int.
:param min_nb_epochs: int.


logging
        :param logger: Logger for experiment tracking
        :param log_save_interval: int. Writes logs to disk this often
        :param row_log_interval: int. How often to add logging rows
        :param add_row_log_interval: int. How often to add logging rows. Deprecated.
        :param log_gpu_memory: str. None, 'min_max', 'all'
        
callbacks
        :param checkpoint_callback: Callback for checkpointing
        :param early_stop_callback: Callback for early stopping

model_io
        :param default_save_path: Default path for logs+weights if no logger/ckpt_callback passed
        :param weights_save_path: Bool. Where to save weights if on cluster


modeleling
        :param gradient_clip_val: int. 0 means don't clip.
        :param gradient_clip: int. 0 means don't clip. Deprecated.
        :param accumulate_grad_batches: int. Accumulates grads every k batches

distributed
        :param nb_gpu_nodes: number of GPU nodes
        :param gpus: int. (ie: 2 gpus) OR list to specify which GPUs [0, 1] OR '0,1'
            OR '-1' / -1 to use all available gpus

        :param distributed_backend: str. Options: 'dp', 'ddp', 'ddp2'.
        :param use_amp: Bool. If true uses apex for 16bit precision
        :param amp_level: str. Check nvidia docs for level

training
        :param process_position: shown in the tqdm bar
        :param show_progress_bar: Bool. If true shows tqdm bar
        :param overfit_pct: float. uses this much of all datasets
        :param track_grad_norm: int. -1 no tracking. Otherwise tracks that norm


development
        :param check_val_every_n_epoch: int. check val every n train epochs
        :param fast_dev_run: Bool. runs full iteration over everything to find bugs
        :param train_percent_check: int. How much of train set to check
        :param val_percent_check: int. How much of val set to check
        :param test_percent_check: int. How much of test set to check
        :param val_check_interval: float/int. If float, % of tng epoch. If int, check every n batch
        :param no_validation: bool. If true, skips all validation steps even if validation is defined.
        :param nb_sanity_val_steps: int. How many val steps before a full train loop.
        :param truncated_bptt_steps: int. Enables multiple backward passes for each batch.

debugging
        :param print_nan_grads: Bool. Prints nan gradients
        :param weights_summary: str. Options: 'full', 'top', None to not print.


@Borda
Copy link
Member

Borda commented Nov 22, 2019

@DrClick very nice work! do you have an idea of how to pass them as "groups"?
something like kind of dictionary per group? :)

@mpariente
Copy link
Contributor Author

Thanks for the quick answer :-)
Not sure I'd group them this way but that's one way !

I would avoid grouping them into dictionaries, this gives the impression that there are less parameters but doesn't bring much IMO.

There is the Callback abstract class here in lightning which is more or less taken from keras1.x and could be pretty useful for some of these options.
Instead of calling self.early_stopping_callback.on_epoch_end(), a callback list could be called, very much like in keras. And this would also allow the user to create callbacks, and that would be great !

@jeffling
Copy link
Contributor

jeffling commented Nov 22, 2019

@mpariente +1 for creating callbacks, it's something we're juryrigging Trainer to do at my work :) I'm really swamped lately but I'd love to push our callback implementation upstream.

Re: Grouping, I'm uncertain about the proposal if we're just passing the objects into the same constructor anyways as we may just be painting over the complexity. ie. what's the difference between doing dicts vs better namespacing?

A potential suggestion would be a further refactor that separates out functionality into public facing callbacks, where each callback takes in their own parameters. having all 'distributed' functionality be in a callback could be an example.

Proposal:

  1. for now, don't make any big changes to these arguments but enforce strong namespacing for all arguments going forward
  2. Think about ways to actually reduce the amount of arguments to tackle the problem at it's core

@Borda
Copy link
Member

Borda commented Nov 22, 2019

I like idea with callbacks...

@mpariente
Copy link
Contributor Author

@jeffling I agree, grouping into dictionaries just gives an illusion that there are less parameters but it's a bad idea.
Would something like this make sense?

gradient_checker = GradientChecker(
    gradient_clip_val=0,
    gradient_clip=None,  # backward compatible
    track_grad_norm=-1,
    accumulate_grad_batches=1,
    print_nan_grads=False)

training_logger = TrainingLogger(
    logger=True,
    log_gpu_memory=None,
    show_progress_bar=True,
    row_log_interval=10,
    add_row_log_interval=None)

distributed_config = DistributedConfig(...)
callbacks = [EarlyStopping(...), ModelCheckpoint(...), ...]
trainer = Trainer(model, gradient_checker=gradient_checker,
                  logger=training_logger, dd_config=distributed_config, 
                  callbacks=callbacks)
trainer.fit()

@williamFalcon
Copy link
Contributor

@tullie @neggert any thoughts?

this is definitely a good conversation to have. One benefit of having everything flattened out is that it’s easy to see what’s possible. Downside is a long list of args.

If we break up into namespaces callbacks (dicts are bad because the IDE can’t help with autocomplete), then we have logical groupings of functionality. downside, we add a bit of friction to the user.

@mpariente
Copy link
Contributor Author

Some other points :

  • If I may, having all the arguments shows what is possible to a certain extent.
    If you don't read the source code here, you cannot really know that you can pass a customized instance of EarlyStopping. This is not really written in the function docs and I didn't see it in the examples either. Same for ModelCheckpoint.

  • If the doc is well written in the DistributedConfig(...) for example, it makes it even clearer than having intertwined docs in Trainer. For example use_amp and amp_level are not grouped together in Trainer's docstring.

  • Regarding a kind of GradientChecker. Currently, the gradients are inspected at least two times here and here and this is done for every batch. I think it is not inexpensive to loop through all the parameters and having one class doing all the gradient checks would make it possible to do it only once.

@Borda
Copy link
Member

Borda commented Nov 23, 2019

I believe that the documentation should be gradually upgraded/improved in follow-up PRs after #521

@dreamgonfly
Copy link
Contributor

dreamgonfly commented Nov 25, 2019

We need to seriously consider the downside of grouping. For example, Pandas's read_csv function ( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html ) has ~50 arguments but it works like a charm. It's because its arguments are orthogonal to each other and it's well documented. IMHO, callbacks are for allowing users to define their own methods, not just parameter values. I might be wrong, but callbacks just for grouping arguments seems not very useful to me.

@neggert
Copy link
Contributor

neggert commented Nov 25, 2019

I agree that it's getting to be a lot, but I think grouping/namespacing/similar just hides the problem. IMO we just need to be cognizant of the problem and be a little more careful about approving PRs that add more args. #536 is a good example of a use case that we can handle without adding a new arg.

One other thing to think about with this huge number of arguments is that we keep seeing bugs because someone tried a combination of args that we haven't tested. This is something I've been thinking about, but I've yet to come up with a good solution that doesn't blow up our test time.

@williamFalcon
Copy link
Contributor

@neggert yeah... same thought. Other than random sampling combinations not sure what else we could do.

@williamFalcon
Copy link
Contributor

williamFalcon commented Nov 26, 2019

resolution to this:
Let’s keep the list of args:

  • maybe adjust the order so they are logically grouped.
  • set a very very high bar for accepting a new arg.
  • maybe think of a way to collapse some args into each other

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

7 participants