You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[`create_model_bundle_from_triton_enhanced_runnable_image_v2`](/api/client/#launch.client.LaunchClient.create_model_bundle_from_triton_enhanced_runnable_image_v2), and [`create_model_bundle_from_streaming_enhanced_runnable_image_v2`](/api/client/#launch.client.LaunchClient.create_model_bundle_from_streaming_enhanced_runnable_image_v2).
14
13
15
14
The first directly pickles a user-specified `load_predict_fn`, a function which
16
15
loads the model and returns a `predict_fn`, a function which takes in a request.
@@ -21,6 +20,7 @@ requests at port 5005 using HTTP and exposes `POST /predict` and
21
20
`GET /readyz` endpoints.
22
21
The fourth is a variant of the third that also starts an instance of the NVidia
23
22
Triton framework for efficient model serving.
23
+
The fifth is a variant of the third that responds with a stream of SSEs at `POST /stream` (the user can decide whether `POST /predict` is also exposed).
24
24
25
25
Each of these modes of creating a model bundle is called a "Flavor".
26
26
@@ -52,6 +52,11 @@ Each of these modes of creating a model bundle is called a "Flavor".
52
52
* You want to use a `RunnableImageFlavor`
53
53
* You also want to use [NVidia's `tritonserver`](https://developer.nvidia.com/nvidia-triton-inference-server) to accelerate model inference
54
54
55
+
A `StreamingEnhancedRunnableImageFlavor` (a runnable image variant) is good if:
56
+
57
+
* You want to use a `RunnableImageFlavor`
58
+
* You also want to support token streaming while the model is generating
59
+
55
60
56
61
=== "Creating From Callables"
57
62
```py
@@ -248,6 +253,44 @@ Each of these modes of creating a model bundle is called a "Flavor".
248
253
```
249
254
250
255
256
+
=== "Creating From a Streaming Enhanced Runnable Image"
257
+
```py
258
+
import os
259
+
from pydantic import BaseModel
260
+
from launch import LaunchClient
261
+
262
+
263
+
class MyRequestSchema(BaseModel):
264
+
x: int
265
+
y: str
266
+
267
+
class MyResponseSchema(BaseModel):
268
+
__root__: int
269
+
270
+
271
+
BUNDLE_PARAMS = {
272
+
"model_bundle_name": "test-streaming-bundle",
273
+
"request_schema": MyRequestSchema,
274
+
"response_schema": MyResponseSchema,
275
+
"repository": "...",
276
+
"tag": "...",
277
+
"command": [ # optional; if provided, will also expose the /predict endpoint
Copy file name to clipboardExpand all lines: docs/concepts/model_endpoints.md
+33-3Lines changed: 33 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,21 +7,26 @@ specifies deployment parameters, such as the minimum and maximum number of
7
7
workers, as well as the requested resources for each worker, such as the number
8
8
of CPUs, amount of memory, GPU count, and type of GPU.
9
9
10
-
Endpoints can be asynchronousor synchronous. Asynchronous endpoints return
10
+
Endpoints can be asynchronous, synchronous, or streaming. Asynchronous endpoints return
11
11
a future immediately after receiving a request, and the future can be used to
12
12
retrieve the prediction once it is ready. Synchronous endpoints return the
13
-
prediction directly after receiving a request.
13
+
prediction directly after receiving a request. Streaming endpoints are variants of synchronous endpoints that return a stream of SSEs instead of a single HTTP response.
14
14
15
15
!!! info
16
16
# Choosing the right inference mode
17
17
18
-
Here are some tips for how to choose between SyncEndpoint, AsyncEndpoint, and BatchJob for deploying your ModelBundle:
18
+
Here are some tips for how to choose between SyncEndpoint, StreamingEndpoint, AsyncEndpoint, and BatchJob for deploying your ModelBundle:
19
19
20
20
A SyncEndpoint is good if:
21
21
22
22
* You have strict latency requirements (e.g. on the order of seconds or less).
23
23
* You are willing to have resources continually allocated.
24
24
25
+
A StreamingEndpoint is good if:
26
+
27
+
* You have stricter requirements on perceived latency than SyncEndpoint can support (e.g. you want tokens generated by the model to start being returned almost immediately rather than waiting for the model generation to finish).
28
+
* You are willing to have resources continually allocated.
Streaming model endpoints are variants of sync model endpoints that are useful for tasks with strict requirements on perceived latency. Streaming endpoints are more expensive than async endpoints.
94
+
!!! Note
95
+
Streaming model endpoints require at least 1 `min_worker`.
0 commit comments