You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm new to the FasterTransformer backend, and I'm curious about why we need to set max_batch_size to 1 when the interactive mode is enabled.
The documentation says that this is to guarantee that requests belonging to the same session are directed to the same model instance exclusively. I understand that the requests must be directed to the same model instance, but why exclusively? If we use the Direct mode of the sequence batcher, the requests would be directed to a unique batch slot. Is this sufficient to guarantee the correctness of the inference?
It would be appreciated if someone can give me some clue :)
The text was updated successfully, but these errors were encountered:
Hi there,
I'm new to the FasterTransformer backend, and I'm curious about why we need to set max_batch_size to 1 when the interactive mode is enabled.
The documentation says that this is to guarantee that requests belonging to the same session are directed to the same model instance exclusively. I understand that the requests must be directed to the same model instance, but why exclusively? If we use the Direct mode of the sequence batcher, the requests would be directed to a unique batch slot. Is this sufficient to guarantee the correctness of the inference?
It would be appreciated if someone can give me some clue :)
The text was updated successfully, but these errors were encountered: