Skip to content

Runtime events with support for custom handlers #8895

@yifuwang

Description

@yifuwang

🚀 Feature

Motivation

For large scale Lightning deployments, it is very useful to collect certain types of trainer runtime event into an analytics system, so engineers can monitor the overall healthiness of the deployment, as well as obtain useful information for troubleshooting failed/stuck jobs.

Currently, Lightning mainly relies on logging for communicating trainer runtime information to the users. However, logging is not ideal for large scale deployments for the following reasons:

  • Many training jobs in large scale deployments are automated, long-running batch jobs. It is impractical to expect the users to notice certain warnings (e.g. API deprecation warning) in a timely fashion.
  • The desirable amount of events for debuggability purposes may be too verbose for logging.
  • Logging is generally less structured.

Pitch

  • Introduce the concept of runtime event in Lightning
  • Emit appropriate runtime events at appropriate places (e.g. deprecated API usage)
  • Provide a mechanism to register custom handlers for runtime events
    • Ideally, we'd like a mechanism to transparently register default handlers (unlike Plugin which requires users to explicitly pass to Trainer) which don’t belong in the core trainer directly. This could make Lightning more attractive for larger organizations that relies on shared tooling.
    • The backend registration mechanism of fsspec is an ideal candidate solution: https://filesystem-spec.readthedocs.io/en/latest/developer.html#implementing-a-backend

Alternatives

Additional context

A similar prior feature request: #8186


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

  • Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

  • Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

Metadata

Metadata

Assignees

Labels

featureIs an improvement or enhancementhelp wantedOpen to be worked onlet's do it!approved to implementloggingRelated to the `LoggerConnector` and `log()`

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions