Add documentation for streaming bundles/endpoints (#110)

squeakymouse · web-flow · commit 50c8365ac8e8 · 2023-06-16T10:40:34.000-07:00
diff --git a/docs/api/endpoint_predictions.md b/docs/api/endpoint_predictions.md
@@ -3,3 +3,4 @@
 ::: launch.model_endpoint.EndpointRequest
 ::: launch.model_endpoint.EndpointResponse
 ::: launch.model_endpoint.EndpointResponseFuture
+::: launch.model_endpoint.EndpointResponseStream
diff --git a/docs/api/model_endpoints.md b/docs/api/model_endpoints.md
@@ -6,3 +6,4 @@ method and provide a `predict` function.
 
 ::: launch.model_endpoint.AsyncEndpoint
 ::: launch.model_endpoint.SyncEndpoint
+::: launch.model_endpoint.StreamingEndpoint
diff --git a/docs/concepts/endpoint_predictions.md b/docs/concepts/endpoint_predictions.md
@@ -27,6 +27,18 @@ predictions. The following code snippet shows how to send tasks to endpoints.
     print(response)
     ```
 
+=== "Sending a Task to a Streaming Endpoint"
+    ```py
+    import os
+    from launch import EndpointRequest, LaunchClient
+
+    client = LaunchClient(api_key=os.getenv("LAUNCH_API_KEY"))
+    endpoint = client.get_model_endpoint("demo-endpoint-streaming")
+    response = endpoint.predict(request=EndpointRequest(args={"x": 2, "y": "hello"}))
+    print(response)
+    ```
+
 ::: launch.model_endpoint.EndpointRequest
 ::: launch.model_endpoint.EndpointResponseFuture
 ::: launch.model_endpoint.EndpointResponse
+::: launch.model_endpoint.EndpointResponseStream
diff --git a/docs/concepts/model_bundles.md b/docs/concepts/model_bundles.md
@@ -5,12 +5,11 @@ are created by packaging a model up into a deployable format.
 
 ## Creating Model Bundles
 
-There are four methods for creating model bundles:
+There are five methods for creating model bundles:
 [`create_model_bundle_from_callable_v2`](/api/client/#launch.client.LaunchClient.create_model_bundle_from_callable_v2),
 [`create_model_bundle_from_dirs_v2`](/api/client/#launch.client.LaunchClient.create_model_bundle_from_dirs_v2),
 [`create_model_bundle_from_runnable_image_v2`](/api/client/#launch.client.LaunchClient.create_model_bundle_from_runnable_image_v2),
-and
-[`create_model_bundle_from_triton_enhanced_runnable_image_v2`](/api/client/#launch.client.LaunchClient.create_model_bundle_from_triton_enhanced_runnable_image_v2).
+[`create_model_bundle_from_triton_enhanced_runnable_image_v2`](/api/client/#launch.client.LaunchClient.create_model_bundle_from_triton_enhanced_runnable_image_v2), and [`create_model_bundle_from_streaming_enhanced_runnable_image_v2`](/api/client/#launch.client.LaunchClient.create_model_bundle_from_streaming_enhanced_runnable_image_v2).
 
 The first directly pickles a user-specified `load_predict_fn`, a function which
 loads the model and returns a `predict_fn`, a function which takes in a request.
@@ -21,6 +20,7 @@ requests at port 5005 using HTTP and exposes `POST /predict` and
 `GET /readyz` endpoints.
 The fourth is a variant of the third that also starts an instance of the NVidia
 Triton framework for efficient model serving.
+The fifth is a variant of the third that responds with a stream of SSEs at `POST /stream` (the user can decide whether `POST /predict` is also exposed).
 
 Each of these modes of creating a model bundle is called a "Flavor".
 
@@ -52,6 +52,11 @@ Each of these modes of creating a model bundle is called a "Flavor".
     * You want to use a `RunnableImageFlavor`
     * You also want to use [NVidia's `tritonserver`](https://developer.nvidia.com/nvidia-triton-inference-server) to accelerate model inference
 
+    A `StreamingEnhancedRunnableImageFlavor` (a runnable image variant) is good if:
+
+    * You want to use a `RunnableImageFlavor`
+    * You also want to support token streaming while the model is generating
+
 
 === "Creating From Callables"
     ```py
@@ -248,6 +253,44 @@ Each of these modes of creating a model bundle is called a "Flavor".
     ```
 
 
+=== "Creating From a Streaming Enhanced Runnable Image"
+    ```py
+    import os
+    from pydantic import BaseModel
+    from launch import LaunchClient
+
+
+    class MyRequestSchema(BaseModel):
+        x: int
+        y: str
+
+    class MyResponseSchema(BaseModel):
+        __root__: int
+
+
+    BUNDLE_PARAMS = {
+        "model_bundle_name": "test-streaming-bundle",
+        "request_schema": MyRequestSchema,
+        "response_schema": MyResponseSchema,
+        "repository": "...",
+        "tag": "...",
+        "command": [ # optional; if provided, will also expose the /predict endpoint
+            ...
+        ],
+        "streaming_command": [ # required
+            ...
+        ],
+        "env": {
+            "TEST_KEY": "test_value",
+        },
+        "readiness_initial_delay_seconds": 30,
+    }
+
+    client = LaunchClient(api_key=os.getenv("LAUNCH_API_KEY"))
+    client.create_model_bundle_from_streaming_enhanced_runnable_image_v2(**BUNDLE_PARAMS)
+    ```
+
+
 ## Configuring Model Bundles
 
 The `app_config` field of a model bundle is a dictionary that can be used to
diff --git a/docs/concepts/model_endpoints.md b/docs/concepts/model_endpoints.md
@@ -7,21 +7,26 @@ specifies deployment parameters, such as the minimum and maximum number of
 workers, as well as the requested resources for each worker, such as the number
 of CPUs, amount of memory, GPU count, and type of GPU.
 
-Endpoints can be asynchronous or synchronous. Asynchronous endpoints return
+Endpoints can be asynchronous, synchronous, or streaming. Asynchronous endpoints return
 a future immediately after receiving a request, and the future can be used to
 retrieve the prediction once it is ready. Synchronous endpoints return the
-prediction directly after receiving a request.
+prediction directly after receiving a request. Streaming endpoints are variants of synchronous endpoints that return a stream of SSEs instead of a single HTTP response.
 
 !!! info
     # Choosing the right inference mode
 
-    Here are some tips for how to choose between SyncEndpoint, AsyncEndpoint, and BatchJob for deploying your ModelBundle:
+    Here are some tips for how to choose between SyncEndpoint, StreamingEndpoint, AsyncEndpoint, and BatchJob for deploying your ModelBundle:
 
     A SyncEndpoint is good if:
 
     * You have strict latency requirements (e.g. on the order of seconds or less).
     * You are willing to have resources continually allocated.
 
+    A StreamingEndpoint is good if:
+
+    * You have stricter requirements on perceived latency than SyncEndpoint can support (e.g. you want tokens generated by the model to start being returned almost immediately rather than waiting for the model generation to finish).
+    * You are willing to have resources continually allocated.
+
     An AsyncEndpoint is good if:
 
     * You want to save on compute costs.
@@ -83,6 +88,31 @@ endpoint = client.create_model_endpoint(
 )
 ```
 
+## Creating Streaming Model Endpoints
+
+Streaming model endpoints are variants of sync model endpoints that are useful for tasks with strict requirements on perceived latency. Streaming endpoints are more expensive than async endpoints.
+!!! Note
+    Streaming model endpoints require at least 1 `min_worker`.
+
+```py title="Creating a Streaming Model Endpoint"
+import os
+from launch import LaunchClient
+
+client = LaunchClient(api_key=os.getenv("LAUNCH_API_KEY"))
+endpoint = client.create_model_endpoint(
+    endpoint_name="demo-endpoint-streaming",
+    model_bundle="test-streaming-bundle",
+    cpus=1,
+    min_workers=1,
+    endpoint_type="streaming",
+    update_if_exists=True,
+    labels={
+        "team": "MY_TEAM",
+        "product": "MY_PRODUCT",
+    },
+)
+```
+
 ## Managing Model Endpoints
 
 Model endpoints can be listed, updated, and deleted using the Launch API.
diff --git a/launch/model_endpoint.py b/launch/model_endpoint.py
@@ -99,7 +99,7 @@ def __repr__(self):
 
 class EndpointRequest:
     """
-    Represents a single request to either a ``SyncEndpoint`` or ``AsyncEndpoint``.
+    Represents a single request to either a ``SyncEndpoint``, ``StreamingEndpoint``,  or ``AsyncEndpoint``.
 
     Parameters:
         url: A url to some file that can be read in to a ModelBundle's predict function. Can be an image, raw text, etc.

Original file line number	Diff line number	Diff line change
@@ -6,3 +6,4 @@ method and provide a `predict` function.
`6`	`6`
`7`	`7`	`::: launch.model_endpoint.AsyncEndpoint`
`8`	`8`	`::: launch.model_endpoint.SyncEndpoint`
	`9`	`+::: launch.model_endpoint.StreamingEndpoint`