Skip to content

Commit 50c8365

Browse files
authored
Add documentation for streaming bundles/endpoints (#110)
1 parent 7bb40dc commit 50c8365

File tree

6 files changed

+94
-7
lines changed

6 files changed

+94
-7
lines changed

docs/api/endpoint_predictions.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,4 @@
33
::: launch.model_endpoint.EndpointRequest
44
::: launch.model_endpoint.EndpointResponse
55
::: launch.model_endpoint.EndpointResponseFuture
6+
::: launch.model_endpoint.EndpointResponseStream

docs/api/model_endpoints.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,3 +6,4 @@ method and provide a `predict` function.
66

77
::: launch.model_endpoint.AsyncEndpoint
88
::: launch.model_endpoint.SyncEndpoint
9+
::: launch.model_endpoint.StreamingEndpoint

docs/concepts/endpoint_predictions.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,18 @@ predictions. The following code snippet shows how to send tasks to endpoints.
2727
print(response)
2828
```
2929

30+
=== "Sending a Task to a Streaming Endpoint"
31+
```py
32+
import os
33+
from launch import EndpointRequest, LaunchClient
34+
35+
client = LaunchClient(api_key=os.getenv("LAUNCH_API_KEY"))
36+
endpoint = client.get_model_endpoint("demo-endpoint-streaming")
37+
response = endpoint.predict(request=EndpointRequest(args={"x": 2, "y": "hello"}))
38+
print(response)
39+
```
40+
3041
::: launch.model_endpoint.EndpointRequest
3142
::: launch.model_endpoint.EndpointResponseFuture
3243
::: launch.model_endpoint.EndpointResponse
44+
::: launch.model_endpoint.EndpointResponseStream

docs/concepts/model_bundles.md

Lines changed: 46 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,11 @@ are created by packaging a model up into a deployable format.
55

66
## Creating Model Bundles
77

8-
There are four methods for creating model bundles:
8+
There are five methods for creating model bundles:
99
[`create_model_bundle_from_callable_v2`](/api/client/#launch.client.LaunchClient.create_model_bundle_from_callable_v2),
1010
[`create_model_bundle_from_dirs_v2`](/api/client/#launch.client.LaunchClient.create_model_bundle_from_dirs_v2),
1111
[`create_model_bundle_from_runnable_image_v2`](/api/client/#launch.client.LaunchClient.create_model_bundle_from_runnable_image_v2),
12-
and
13-
[`create_model_bundle_from_triton_enhanced_runnable_image_v2`](/api/client/#launch.client.LaunchClient.create_model_bundle_from_triton_enhanced_runnable_image_v2).
12+
[`create_model_bundle_from_triton_enhanced_runnable_image_v2`](/api/client/#launch.client.LaunchClient.create_model_bundle_from_triton_enhanced_runnable_image_v2), and [`create_model_bundle_from_streaming_enhanced_runnable_image_v2`](/api/client/#launch.client.LaunchClient.create_model_bundle_from_streaming_enhanced_runnable_image_v2).
1413

1514
The first directly pickles a user-specified `load_predict_fn`, a function which
1615
loads the model and returns a `predict_fn`, a function which takes in a request.
@@ -21,6 +20,7 @@ requests at port 5005 using HTTP and exposes `POST /predict` and
2120
`GET /readyz` endpoints.
2221
The fourth is a variant of the third that also starts an instance of the NVidia
2322
Triton framework for efficient model serving.
23+
The fifth is a variant of the third that responds with a stream of SSEs at `POST /stream` (the user can decide whether `POST /predict` is also exposed).
2424

2525
Each of these modes of creating a model bundle is called a "Flavor".
2626

@@ -52,6 +52,11 @@ Each of these modes of creating a model bundle is called a "Flavor".
5252
* You want to use a `RunnableImageFlavor`
5353
* You also want to use [NVidia's `tritonserver`](https://developer.nvidia.com/nvidia-triton-inference-server) to accelerate model inference
5454

55+
A `StreamingEnhancedRunnableImageFlavor` (a runnable image variant) is good if:
56+
57+
* You want to use a `RunnableImageFlavor`
58+
* You also want to support token streaming while the model is generating
59+
5560

5661
=== "Creating From Callables"
5762
```py
@@ -248,6 +253,44 @@ Each of these modes of creating a model bundle is called a "Flavor".
248253
```
249254

250255

256+
=== "Creating From a Streaming Enhanced Runnable Image"
257+
```py
258+
import os
259+
from pydantic import BaseModel
260+
from launch import LaunchClient
261+
262+
263+
class MyRequestSchema(BaseModel):
264+
x: int
265+
y: str
266+
267+
class MyResponseSchema(BaseModel):
268+
__root__: int
269+
270+
271+
BUNDLE_PARAMS = {
272+
"model_bundle_name": "test-streaming-bundle",
273+
"request_schema": MyRequestSchema,
274+
"response_schema": MyResponseSchema,
275+
"repository": "...",
276+
"tag": "...",
277+
"command": [ # optional; if provided, will also expose the /predict endpoint
278+
...
279+
],
280+
"streaming_command": [ # required
281+
...
282+
],
283+
"env": {
284+
"TEST_KEY": "test_value",
285+
},
286+
"readiness_initial_delay_seconds": 30,
287+
}
288+
289+
client = LaunchClient(api_key=os.getenv("LAUNCH_API_KEY"))
290+
client.create_model_bundle_from_streaming_enhanced_runnable_image_v2(**BUNDLE_PARAMS)
291+
```
292+
293+
251294
## Configuring Model Bundles
252295

253296
The `app_config` field of a model bundle is a dictionary that can be used to

docs/concepts/model_endpoints.md

Lines changed: 33 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,21 +7,26 @@ specifies deployment parameters, such as the minimum and maximum number of
77
workers, as well as the requested resources for each worker, such as the number
88
of CPUs, amount of memory, GPU count, and type of GPU.
99

10-
Endpoints can be asynchronous or synchronous. Asynchronous endpoints return
10+
Endpoints can be asynchronous, synchronous, or streaming. Asynchronous endpoints return
1111
a future immediately after receiving a request, and the future can be used to
1212
retrieve the prediction once it is ready. Synchronous endpoints return the
13-
prediction directly after receiving a request.
13+
prediction directly after receiving a request. Streaming endpoints are variants of synchronous endpoints that return a stream of SSEs instead of a single HTTP response.
1414

1515
!!! info
1616
# Choosing the right inference mode
1717

18-
Here are some tips for how to choose between SyncEndpoint, AsyncEndpoint, and BatchJob for deploying your ModelBundle:
18+
Here are some tips for how to choose between SyncEndpoint, StreamingEndpoint, AsyncEndpoint, and BatchJob for deploying your ModelBundle:
1919

2020
A SyncEndpoint is good if:
2121

2222
* You have strict latency requirements (e.g. on the order of seconds or less).
2323
* You are willing to have resources continually allocated.
2424

25+
A StreamingEndpoint is good if:
26+
27+
* You have stricter requirements on perceived latency than SyncEndpoint can support (e.g. you want tokens generated by the model to start being returned almost immediately rather than waiting for the model generation to finish).
28+
* You are willing to have resources continually allocated.
29+
2530
An AsyncEndpoint is good if:
2631

2732
* You want to save on compute costs.
@@ -83,6 +88,31 @@ endpoint = client.create_model_endpoint(
8388
)
8489
```
8590

91+
## Creating Streaming Model Endpoints
92+
93+
Streaming model endpoints are variants of sync model endpoints that are useful for tasks with strict requirements on perceived latency. Streaming endpoints are more expensive than async endpoints.
94+
!!! Note
95+
Streaming model endpoints require at least 1 `min_worker`.
96+
97+
```py title="Creating a Streaming Model Endpoint"
98+
import os
99+
from launch import LaunchClient
100+
101+
client = LaunchClient(api_key=os.getenv("LAUNCH_API_KEY"))
102+
endpoint = client.create_model_endpoint(
103+
endpoint_name="demo-endpoint-streaming",
104+
model_bundle="test-streaming-bundle",
105+
cpus=1,
106+
min_workers=1,
107+
endpoint_type="streaming",
108+
update_if_exists=True,
109+
labels={
110+
"team": "MY_TEAM",
111+
"product": "MY_PRODUCT",
112+
},
113+
)
114+
```
115+
86116
## Managing Model Endpoints
87117

88118
Model endpoints can be listed, updated, and deleted using the Launch API.

launch/model_endpoint.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,7 @@ def __repr__(self):
9999

100100
class EndpointRequest:
101101
"""
102-
Represents a single request to either a ``SyncEndpoint`` or ``AsyncEndpoint``.
102+
Represents a single request to either a ``SyncEndpoint``, ``StreamingEndpoint``, or ``AsyncEndpoint``.
103103
104104
Parameters:
105105
url: A url to some file that can be read in to a ModelBundle's predict function. Can be an image, raw text, etc.

0 commit comments

Comments
 (0)