-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🚀 Feature
Motivation
For large scale Lightning deployments, it is very useful to collect certain types of trainer runtime event into an analytics system, so engineers can monitor the overall healthiness of the deployment, as well as obtain useful information for troubleshooting failed/stuck jobs.
Currently, Lightning mainly relies on logging for communicating trainer runtime information to the users. However, logging is not ideal for large scale deployments for the following reasons:
- Many training jobs in large scale deployments are automated, long-running batch jobs. It is impractical to expect the users to notice certain warnings (e.g. API deprecation warning) in a timely fashion.
- The desirable amount of events for debuggability purposes may be too verbose for logging.
- Logging is generally less structured.
Pitch
- Introduce the concept of runtime event in Lightning
- Emit appropriate runtime events at appropriate places (e.g. deprecated API usage)
- Provide a mechanism to register custom handlers for runtime events
- Ideally, we'd like a mechanism to transparently register default handlers (unlike
Plugin
which requires users to explicitly pass toTrainer
) which don’t belong in the core trainer directly. This could make Lightning more attractive for larger organizations that relies on shared tooling. - The backend registration mechanism of
fsspec
is an ideal candidate solution: https://filesystem-spec.readthedocs.io/en/latest/developer.html#implementing-a-backend
- Ideally, we'd like a mechanism to transparently register default handlers (unlike
Alternatives
Additional context
A similar prior feature request: #8186
If you enjoy Lightning, check out our other projects! ⚡
-
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
-
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning
-
Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch
-
Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.