Description
Is your feature request related to a problem? Please describe.
App SDK currently supports inference within the application process itself. This is simple and efficient for some use cases, though when multiple applications/models are hosted in the "production" environment, remote inference service, e.g. Triton, may be needed, so that the heavy use of resource of inference itself can be centrally managed.
Describe the solution you'd like
Add remote inference support to the built-in inference operators in the App SDK, with runtime options, e.g. using strategy pattern, to support the choice of in proc, or remote inference.
Describe alternatives you've considered
One of the main reasons to use remote inference server, e.g. Triton, is to have dedicated model and required runtime resource management (scheduling and queuing inference requests), so that the application need to directly request local GPU and/or system memory. With the remote service, the whole application with in-proc inference, then need to be scheduled on servers running multiple applications or instances thereof, to ensure resources are available when the app requests them. Simpler with just system mem request, but GPU mem requests has to be properly managed (e.g. K8s fractional request on a visible GPU).
Additional context