-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unify usage of multiple callbacks #896
Comments
@ethanwharris lest split the discussion and talk about callbacks here... |
I don't plan to incorporate Same for the loggers logic, they should use the callback system IMO (all the custom callbacks I use are loggers...). |
interesting idea, and true that logger is a special case of a callback, just not sure if I won't be too confusing for a user... |
For the user, it's easy to make this change invisible. Just make the |
sure making the parents of loggers does not disturb any user, I was more-less thinking about passing callbacks and lobber in the same argument (list) instead of having an argument for callbacks and loggers separately |
But loggers and callbacks are the same things so I don't see any issue and new users should no be disturbed IMO. About existing users and backward compatibilities, if the documentation is clear about it, everything should be fine. |
I'm not convinced that there is an advantage to making Loggers into Callbacks. Currently there is a lot of inter-dependency between checkpointing and logging (the checkpointer directory depends on logger name / version for example), I can see that getting quite messy if everything is of the same class and invoked with the same command. There's also the problem of callback hell, where everything can become a callback if the API is flexible enough, which we should probably avoid. That said, I'm not against the idea if it doesn't disturb the user, just not sure what value it would bring in that case. |
@PyTorchLightning/core-contributors what do you think? is it reasonable/beneficial? 🤖 |
my two cents:
I think it will be more user friendly to have the trainer keep these separate (eg.
+1 as I mentioned in #595, I think we should allow user defined callbacks passed in as a list and execute the callbacks in the order which they're provided in the list.
one thing that we'll want to consider is what information we'll expose to the callbacks. do we want to expose all model attributes? |
@Borda i'm not sure i love this idea. this means you wouldn't get early stopping for free or checkpointing for free. most of the time users don't care about the details of those and a simple True is enough. This means now you'd have to specify all the details. i think this clutters the API and makes it unnecessarily complex. We moved the callbacks to default arguments after a ton of feedback on people being annoyed they had to init them EVERY time. and frankly, the keras callbacks are super unintuitive and confusing. So, unless there's a clear API win, i'm going to push back on this. |
I think keeping a Both things are not exclusive. |
but then we have multiple ways of doing something which is confusing. ie: do i pass in the callback to early stopping or the callbacks list? having many choices for something is bad and against our principles. We want to force user to make a single decision rooted in best practices and sensible defaults. |
I was thinking not breaking the APPI as it is not so there will be still an option to set a checkpoint true, but inside the trainer, it would create a callback instance which will be added to list with others passed via callback argument... |
@Borda I was thinking something similar, either appending the early stopping and model check pointing callbacks at the beginning or end of the list.
I agree that losing early stopping and model check pointing “for free” would be an undesirable regression. Moreover, as @williamFalcon pointed out there are still some confusing cases that need to be ironed out before considering moving forward with this. As an example, right now you can construct your own ModelCheckpoint object and pass it in to the To quote the Zen of Python: “there should be one — and preferably only one — obvious way to do it.” |
What if we started by simplifying how the internal callbacks are executed, make it easier for contributors to add callbacks to the official library, and revisit user-define callbacks at a later date? Right now, the callbacks must be manually invoked across the Trainer code. Here's an proposal:
This way, if a new user wants to contribute a Callback to the library, they simply have to write the Callback class and determine where in the list of callbacks it should be placed (order of execution). |
I like that, although it would be incredibly useful to go all the way and expose the default callback stack in a way that could be overridden (ie defined in LightningModule). Anecdotally, I noticed some users are already hacking their own callbacks in by editing Trainer code. Ideally this solution could get us to a point where the number of users overriding Trainer is at a minimum. |
@jeffling could u give an example of such a hack? |
I think there are some issues with the idea of a
I would suggest that there should be well defined and easy to implement abstractions for standard operations (logging, checkpointing, ...), which are called separately in the codebase via their own My concern is that trying to be too flexible is what led to an unusable callback or event handler system in: torchbearer, ignite, catalyst, keras or a host of other libraries. We should be aware that all of those libraries got it wrong, and try to avoid making the same mistakes. |
Such use cases definitely exist, mine for example is when one wants to add a test, which checks something at runtime. Here I had to hack a module instead of passing a callback. Also, if you check into this libraries you'll find some examples, that you cannot use in lightning now as a callback. I believe many of users have their own cases. My proposal is to add a new class, named for example |
Yeah, I can see that there are good use cases for custom callbacks, I just don't think we should bundle them all into one class along with loggers and checkpoints and the rest. My proposal would be that we have a |
Personally, I like your idea, it also should be easier to implement and maintain than new
which does no posess such ambiguity. |
The limitation of the current system is that it does not give control on what should be logged and when as in a callback system where a hook is called during the different steps of the loop such as Here the only method is For example, I have currently a callback system embedded in my module (which is bad but hopefully temporary...) with the following loggers:
I insist on the fact that defaults should be kept as it is now because they are great and make the training loop setup easy and quick but a certain level of flexibility should be given in order to:
Here is a humble proposal #889 As about the callback hell, I don't really understand what it means if you could give a specific example? IMO, it's just a matter of designing a good API, then you obviously don't have control over how the user uses it... |
About the confusion between |
I think we may be talking about different things here. If we want to expose customization for logging and checkpointing, then I agree it's absolutely simpler to have users just implement @hadim does provide a good example, though, with the desire to have control over multiple loggers and their respective behavior. It's a good example in the sense that he's not calling each logger uniformly (ie. do the same action at the same time across all the loggers) which doesn't look like it'd be supported by a
I would imagine that the desire for a Callback would be a situation where we want to have more distinct separation of code logic. Is this what other people would expect as well?
Could you elaborate on what those libraries got wrong? Where were the pain points? I've used callbacks in Keras (sparingly) and found it to be useful. I think the most common way to enter callback hell is to have a rigid framework which leads to callbacks being the only place for flexibility. Fortunately, PyTorch Lightning is quite flexible so I don't worry about this case so much. In my opinion, we should collect more information on the following two points:
|
i like the brainstorm for different approaches.
then any brainstorm solution can be sanity checked to see if it solves the 3 cases. I’m sure we’ll have many solutions that do, but then at that point we can focus on making those solutions more usable. |
I'm a user of PL and I felt the need of callbacks for my usecases. As it seems you need some usecases, mine is: During training, I save intermediate plots to better understand how the training is going, I compute metrics, I save embeddings, etc. Currently, I edited my LightningModule but I don't like much that way of doing it because it adds a lot of logic that is not related to the model (saving plots and defining a forward step are quite different stuffs) so it's harder and longer to read my code, I'd prefer to have the logic separated. I also sometimes change my mind and would like to remove one of those added functionalities and to do that I need to remove the code or I found a slightly better way : I created a class "instructions" which stores how much I want to do this additional stuffs (how often in terms of epochs for example) and this object instructions is used to give instructions on when to do things or not. It adds more code though. I feel like callbacks could help me better separate logics and also I can easily remove a callback instead of removing the code or using an object to deactivate the callback. |
thanks for the use case! I’m wondering why you can’t do the logging and embedding plotting from the on_epoch_end? |
feel like I have accidentally opened a Pandora box even at the beginning I wanted to merge the two parameters in Trainer init... 🤖 |
I definitely can. But I still have the problem that if I want to disable this code, I have to delete the related code in the on_epoch_end on the LightningModule instead of "just" removing the callback function from the callbacks list in the trainer. This kind of extended functionalities that callbacks are should not appear in the LightningModule (in my opinion). It also help for sharing the model, I have some colleagues that don't care about my callbacks but are interested in the model. They could remove the on_epoch_end code but it's not a good practice to activate/deactive something by adding/removing code I think. |
For example, suppose that the checkpointing logic would not be a parameter in the Trainer but some code in on_epoch_end. It would be a pain to remove or add this code whenever I want to activate/deactivate checkpointing. Instead, it's in the Trainer call so it doesn't disturb the readability of the LightningModule |
got it. i see what you mean now. main takeaways from your example:
I guess the key is point 1. It sounds like ideally we want to make sure that your model can still run with or without these callbacks. So we want to factor DL code into three components? A. hardware/engineering code (trainer) This means your model should work with or without C? |
My use-cases look exactly the same as @Rodolphe2005 and @williamFalcon you just summarize my main point. C is non-essential. Ideally, you would prevent callbacks to modify essential things in A and B but I am not sure this is easily feasible. |
awesome. So, I think flavors of this have been proposed in this thread by (@jeremyjordan, @ethanwharris, @hadim). Here's another proposal. MyCallback(pl.Callback):
def __init__(self, hook_name):
self.hook_name = hook_name
def __call__(self, trainer, pl_module):
# my custom code I want to execute callback = Callback('on_epoch_end')
Trainer(callbacks=[callback]) Something like this? |
What if a callback needs to do things on multiple events such as |
Here is a more concrete example: import datetime
import pathlib
import matplotlib.pyplot as plt
import pandas as pd
from pytorch_lightning.callbacks import Callback
class HistoryLogger(Callback):
def __init__(self, dir_path, plot_history_interval=None):
super().__init__()
self.dir_path = pathlib.Path(dir_path)
self.history_path = self.dir_path / "history.csv"
self.test_path = self.dir_path / "test.csv"
self.plot_history_interval = plot_history_interval
self.history_plot_path = self.dir_path / "history.png"
def on_epoch_end(self):
logs = self._trainer.model.logs
if self.history_path.is_file():
history = pd.read_csv(self.history_path)
else:
history = pd.DataFrame()
history = history.append(pd.DataFrame([logs]), ignore_index=True)
history.to_csv(self.history_path, index=None)
epoch = logs["epoch"]
if self.plot_history_interval and (epoch % self.plot_history_interval) == 0:
self._plot_history(history, self.history_plot_path)
def on_test_end(self):
metrics = {}
metrics["date"] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
metrics["lr"] = self._trainer.model.get_lr()
metrics.update(self._trainer.model.epoch_metrics["test"])
metrics = pd.DataFrame([metrics])
metrics.to_csv(self.test_path, index=None)
def _plot_history(self, history, plot_history_path):
# Guess metric names
metric_names = history.columns
filter_names_fn = lambda x: x.startswith("val_") and x not in [
"date",
"epoch",
]
metric_names = list(filter(filter_names_fn, metric_names))
metric_names = [name.replace("val_", "") for name in metric_names]
# Plot parameters
basesize = 6
ratio = 0.7
ncols = 3
n_plots = len(metric_names)
nrows = n_plots // ncols
if n_plots % ncols > 0:
nrows += 1
figsize = int(basesize * ncols), int(basesize * ratio * nrows)
fig, axs = plt.subplots(ncols=ncols, nrows=nrows, figsize=figsize, constrained_layout=True) # type: ignore
axs = axs.flatten()
for metric_name, ax in zip(metric_names, axs):
train_metric_name = f"train_{metric_name}"
if train_metric_name in history.columns:
ax.plot(history["epoch"], history[train_metric_name], label="train")
val_metric_name = f"val_{metric_name}"
if val_metric_name in history.columns:
ax.plot(history["epoch"], history[val_metric_name], label="val")
ax.set_xlabel("Epoch")
ax.set_ylabel(metric_name)
title = f"Metric: {metric_name}"
ax.set_title(title, fontsize=18)
ax.legend()
fig.savefig(plot_history_path, dpi=200)
plt.close(fig) |
Yeah good point. What if a callback has all the hooks as interface and for each callback you can override the hooks.
|
Good for me. The question is do we want to pass arguments on those methods or init the callback with something like I don't have strong feelings about this. |
i f you derive it from |
Haven't figured out if it's possible yet but what about something like this
This would solve some readability issues since the functions can have informative names. Also potentially more flexible if we can support mutliple functions binding to the same callback hook. Although I'm not sure how possible it is haha |
it’s equivalent to what i wrote. the decorators are unnecessary. def on_epoch_end: |
Good point! I hadn't thought of that :) |
Ok, anyone want to take a stab at doing this? |
The discussion has reached its apex and I like its conclusion. @williamFalcon's of thinking about callbacks resonates with me:
Callbacks are extras (i.e.
From a community perspective, I like the idea of callbacks because they can create a sharing ecosystem that doesn't live within the Lightning repository. I like that because it reduces barriers for Lightning users to collaborate with each other. |
I completely agree with this schema. From my perspective callback shall be independent on the model (more general) so user can use them as training extension ~ I may place an illustration, you are cooking a cake (DL model) for which you need kitchen and standard equipment (Trainer) and above it, you may use some special tools (callbacks) 🤖 |
I'm happy to look into this, although we should build on #889 - as suggested by @hadim, the biggest question is can we ensure that callbacks are only for non-essential stuff by making the pl_module and trainer seen by the callbacks immutable? @hadim - I'm happy to help out with #889, just let me know if anything needs to be done :) |
Discussion about the implementation happens on the PR for the ones interested: #889 (comment) |
It seems like we landed back on the originally proposed design of a Keras-like callback interface for pieces of code logic that we want to easily enable/disable during training.
(see @hadim 's PR for a concrete implementation #889) Did we reach consensus to move forward with this approach? |
I guess I see it as non essential code that needs to be grouped together for easy enable/disable (same as a mixin really), but easier for non python power users? I don't see another clean way of doing it? The benefit of this is that that code can also be shared across projects? |
Yes, that is my main usage of such callbacks... |
just for the record, I have found also this thought #556 |
I suppose that this is done for now... 🤖 |
🚀 Feature
Simplified API, with callbacks... as e.g. Keras did, pass just list of callbacks to be executed and Trainer will call then when needed instead of having them specified
https://github.com/PyTorchLightning/pytorch-lightning/blob/b1040523b2180300574d961444b00abfa3c84195/pytorch_lightning/trainer/trainer.py#L65-L66
mentioned also in #825 (comment)
The text was updated successfully, but these errors were encountered: