Posted On: Sep 1, 2023

Customers can now continuously stream inference responses back to the client when using SageMaker real-time inference to help you build interactive experiences for various generative AI applications such as chatbots, virtual assistants, and music generators.

With interactive Gen-AI applications such as chatbots, you can read the response word-by-word as the chatbot is responding and don't need to wait for the full response. For such applications, minimizing the time-to-first inference response is especially important for creating experiences that feel interactive. Previously, SageMaker Endpoints waited until the full inference response was completed before responding back to the client. With response streaming, partial inferences are continuously returned until the full inference response is completed. 

This feature is available in all commercial regions where SageMaker is available.

For more details on how to use response streaming along with examples, please see our documentation on API reference, getting streaming responses, how containers should respond, and blog here.