You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[`create_model_bundle_from_triton_enhanced_runnable_image_v2`](/api/client/#launch.client.LaunchClient.create_model_bundle_from_triton_enhanced_runnable_image_v2), and [`create_model_bundle_from_streaming_enhanced_runnable_image_v2`](/api/client/#launch.client.LaunchClient.create_model_bundle_from_streaming_enhanced_runnable_image_v2).
and [`create_model_bundle_from_streaming_enhanced_runnable_image_v2`](/api/client/#launch.client.LaunchClient.create_model_bundle_from_streaming_enhanced_runnable_image_v2).
13
14
14
15
The first directly pickles a user-specified `load_predict_fn`, a function which
15
16
loads the model and returns a `predict_fn`, a function which takes in a request.
@@ -20,7 +21,8 @@ requests at port 5005 using HTTP and exposes `POST /predict` and
20
21
`GET /readyz` endpoints.
21
22
The fourth is a variant of the third that also starts an instance of the NVidia
22
23
Triton framework for efficient model serving.
23
-
The fifth is a variant of the third that responds with a stream of SSEs at `POST /stream` (the user can decide whether `POST /predict` is also exposed).
24
+
The fifth is a variant of the third that responds with a stream of SSEs at `POST /stream` (the user
25
+
can decide whether `POST /predict` is also exposed).
24
26
25
27
Each of these modes of creating a model bundle is called a "Flavor".
26
28
@@ -57,7 +59,6 @@ Each of these modes of creating a model bundle is called a "Flavor".
57
59
* You want to use a `RunnableImageFlavor`
58
60
* You also want to support token streaming while the model is generating
59
61
60
-
61
62
=== "Creating From Callables"
62
63
```py
63
64
import os
@@ -132,7 +133,7 @@ Each of these modes of creating a model bundle is called a "Flavor".
Copy file name to clipboardExpand all lines: docs/concepts/model_endpoints.md
+6-3Lines changed: 6 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,7 +10,8 @@ of CPUs, amount of memory, GPU count, and type of GPU.
10
10
Endpoints can be asynchronous, synchronous, or streaming. Asynchronous endpoints return
11
11
a future immediately after receiving a request, and the future can be used to
12
12
retrieve the prediction once it is ready. Synchronous endpoints return the
13
-
prediction directly after receiving a request. Streaming endpoints are variants of synchronous endpoints that return a stream of SSEs instead of a single HTTP response.
13
+
prediction directly after receiving a request. Streaming endpoints are variants of synchronous
14
+
endpoints that return a stream of SSEs instead of a single HTTP response.
Streaming model endpoints are variants of sync model endpoints that are useful for tasks with strict requirements on perceived latency. Streaming endpoints are more expensive than async endpoints.
94
+
Streaming model endpoints are variants of sync model endpoints that are useful for tasks with strict
95
+
requirements on perceived latency. Streaming endpoints are more expensive than async endpoints.
94
96
!!! Note
95
97
Streaming model endpoints require at least 1 `min_worker`.
0 commit comments