AceCoder0
diff --git a/‎benchmarks/inference/deepspeedometer/README.md‎
Lines changed: 85 additions & 0 deletions b/‎benchmarks/inference/deepspeedometer/README.md‎
Lines changed: 85 additions & 0 deletions
diff --git a/‎benchmarks/inference/deepspeedometer/configs/128k-120.yaml‎
Lines changed: 5 additions & 0 deletions b/‎benchmarks/inference/deepspeedometer/configs/128k-120.yaml‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎benchmarks/inference/deepspeedometer/configs/1300-120.yaml‎
Lines changed: 4 additions & 0 deletions b/‎benchmarks/inference/deepspeedometer/configs/1300-120.yaml‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎benchmarks/inference/deepspeedometer/configs/2600-60.yaml‎
Lines changed: 4 additions & 0 deletions b/‎benchmarks/inference/deepspeedometer/configs/2600-60.yaml‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎benchmarks/inference/deepspeedometer/configs/500-500.yaml‎
Lines changed: 4 additions & 0 deletions b/‎benchmarks/inference/deepspeedometer/configs/500-500.yaml‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎benchmarks/inference/deepspeedometer/pyproject.toml‎
Lines changed: 32 additions & 0 deletions b/‎benchmarks/inference/deepspeedometer/pyproject.toml‎
Lines changed: 32 additions & 0 deletions
diff --git a/‎benchmarks/inference/deepspeedometer/run_example.sh‎
Lines changed: 1 addition & 0 deletions b/‎benchmarks/inference/deepspeedometer/run_example.sh‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎benchmarks/inference/deepspeedometer/src/deepspeedometer/__init__.py‎
Lines changed: 2 additions & 0 deletions b/‎benchmarks/inference/deepspeedometer/src/deepspeedometer/__init__.py‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎benchmarks/inference/deepspeedometer/src/deepspeedometer/arg_parsing.py‎
Lines changed: 51 additions & 0 deletions b/‎benchmarks/inference/deepspeedometer/src/deepspeedometer/arg_parsing.py‎
Lines changed: 51 additions & 0 deletions
@@ -0,0 +1,85 @@
+# DeepSpeedometer
+
+This benchmark is designed to measure performance of LLM serving solutions. Using a different number of parallel clients sending requests to an inference server, we gather data to plot throughput-latency curves and find the saturation point of an inference server that demonstrates the maximum performance.
+
+## Installation
+
+To install the benchmark, clone this repository and install using `pip`:
+```shell
+git clone https://github.com/Microsoft/DeepSpeedExamples
+cd ./DeepSpeedExamples/benchmarks/deepspeedometer
+pip install .
+```
+
+## Usage
+
+To quickly test the benchmark code without creating an inference server, run the following:
+```
+python3 -m deepspeedometer.benchmark_runner --model facebook/opt-125m --api dummy
+```
+
+### Supports APIs
+
+The benchmark supports different APIs, each with their own client type. Depending on the client, you may need to run the benchmark against a locally hosted inference server or a remote inference server. Adding support for new serving solutions can be achieved by creating a new client class that defines a few basic methods. See the section below on adding new clients for more information.
+
+The clients (i.e., APIs) curently supported (and configuration options for each) are listed below. You can see more information about the configuration options by looking at the `*ClientConfig` classes located in `clients/*.py`:
+
+1. `fastgen`: Runs a local model inference server with DeepSpeed's FastGen. Config options include:
+    - `model`: Which model to use for serving (required)
+    - `deployment_name`: Name of the deployment server
+    - `tp_size`: Tensor parallel size for each model replicas
+    - `num_replicas`: Number of model replicas
+    - `max_ragged_batch_size`: Max number of requests running per model replicas
+    - `quantization_mode`: Type of quantization to use
+2. `vllm`: Runs a local model inference server with vLLM.
+    - `model`: Which model to use for serving (required)
+    - `tp_size`: Tensor parallel size for model
+    - `port`: Which port to use for REST API
+3. `azureml`: Interfaces with remote AzureML online endpoint/deployment.
+    - `api_url`: AzureML endpoint API URL (required)
+    - `api_key`: AzureML token key for connecting to endpoint (required)
+    - `deployment_name`: Name of deployment hosted in given endpoint (required)
+
+### Benchmark Configuration
+
+The Benchmark has many options for tailoring performance measurements to a specific use-cases. For additional information and default values, see the `BenchmarkConfig` class defined in `benchmark_runner.py`.
+
+- `api`: Which API to use
+- `warmup_requests`: Number of warm-up requests to run before measuring performance
+- `result_dir`: Directory where results will be written out (as JSON files)
+- `use_threading`: Whether to use threading for the benchmark clients. Default is to use multi-processing
+- `config_file`: One or more config YAML files that contain values for any of the Prompt configuration options (see below section on prompt configuration)
+- `num_clients`: One or more integer values for the number of parallel clients to run
+- `num_requests_per_client`: Number of requests that will be run by each of the parallel clients
+- `min_requests`: Minimum number of requests to be sent during duration of benchmark. Useful when there is a low number of clients to ensure good measurement.
+- `prompt_text_source`: Text file or string that will be sampled to generate request prompts
+- `early_stop_latency`: When running multiple values for `num_clients`, if the average latency per request exceeds this value (in seconds) the benchmark will not test a larger number of parallel clients
+- `force`: Force the overwrite of result files. By default, if a result file exists, the benchmark is skipped
+
+### Prompt Configuration
+
+These options allow users to modify the prompt input and generation behavior of the served models. Note that you can run multiple prompt configurations in a single command by using the `config_file` input as described in the Benchmark Configuration section.
+
+- `model`: Which model to use for tokenizing prompts (required)
+- `prompt_generator_seed`: Seed value for random number generation
+- `max_prompt_length`: The maximum prompt length allowed
+- `prompt_length`: Target mean prompt length
+- `prompt_lenght_var`: Variance of generated prompt lengths
+- `max_new_tokens`: Target mean number of generated tokens
+- `max_new_tokens_var`: Variance of generated tokens
+- `streaming`: Whether to enabled streaming output for generated tokens
+
+#### About Prompt Generation
+
+To mimic real-world serving scenarios, this benchmark samples prompt length and generated token length values from a normal distribution. This distribution can be manipulated with the `prompt_length*` and `max_new_tokens*` values in the prompt configuration. To get all prompt lengths and generation lengths to match exactly, set the `*_var` values to 0.
+
+## Adding New Client APIs
+
+The DeepSpeedometer benchmark was designed to allow easily adding support for new inference server solutions. To do so:
+
+1. Create a new `*_client.py` file in the `clients/` directory.
+2. Define a `*Client` class that inherits from the `BaseClient` class in `clients/base.py`. This class should define 5 methods: `start_service`, `stop_service`, `prepare_request`, `send_request`, and `process_response`. Take a look at the type hints for these methods in the `BaseClient` class to understand the expected inputs and outputs for each method.
+3. Define a `*ClientConfig` class that inherits from the `BaseConfigModel` class. Place any configuration options (i.e., user-passed command line arguments) necessary for your defined `*Client` class in here.
+4. Import the newly added `*Client` and `*ClientConfig` into `clients/__init__.py` and add them to the `client_config_classes` and `client_classes` dictionaries.
+
+For the simplest example of adding a new client, take a look at the `clients/dummy_client.py` file where we have defined a client that does not stand up a server and only returns a sample of the input prompt after a short sleep cycle. We use this as a light-weight class for unit testing.
@@ -0,0 +1,5 @@
+prompt_length: 128000
+prompt_length_var: 0.1
+max_prompt_length: 131072
+max_new_tokens: 120
+max_new_tokens_var: 0.3
@@ -0,0 +1,4 @@
+prompt_length: 1300
+prompt_lenght_var: 0.3
+max_new_tokens: 120
+max_new_tokens_var: 0.3
@@ -0,0 +1,4 @@
+prompt_length: 2600
+prompt_lenght_var: 0.3
+max_new_tokens: 60
+max_new_tokens_var: 0.3
@@ -0,0 +1,4 @@
+prompt_length: 500
+prompt_lenght_var: 0.3
+max_new_tokens: 500
+max_new_tokens_var: 0.3
@@ -0,0 +1,32 @@
+[build-system]
+requires = ["setuptools>=61.0"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "deepspeedometer"
+version = "0.0.1"
+authors = [
+  { name="Ammar Ahmad Awan", email="ammar.awan@microsoft.com" },
+  { name="Arash Bakhitiari", email="abakhtiari@microsoft.com" },
+  { name="Connor Holmes"},
+  { name="Lev Kurilenko", email="lev.kurilenko@microsoft.com" },
+  { name="Heyang Qin", email="heyangqin@microsoft.com" },
+  { name="Masahiro Tanaka", email="mtanaka@microsoft.com" },
+  { name="Michael Wyatt", email="michaelwyatt@microsoft.com" },
+]
+description = "LLM benchmarking tool"
+readme = "README.md"
+requires-python = ">=3.8"
+classifiers = [
+    "Programming Language :: Python :: 3",
+]
+dependencies = [
+    "loguru",
+    "pydantic>=2.0.0",
+    "torch",
+    "tqdm",
+    "transformers",
+]
+
+[project.urls]
+Homepage = "https://github.com/Microsoft/DeepSpeedExamples/tree/master/benchmarks/inference/deepspeedometer"
+Issues = "https://github.com/Microsoft/DeepSpeedExamples/issues"
@@ -0,0 +1 @@
+python -m src.deepspeedometer.benchmark_runner --model "facebook/opt-125m" --api dummy --config_file ./configs/1300-120.yaml
@@ -0,0 +1,2 @@
+from .arg_parsing import parse_args_to_configs
+from .benchmark_runner import BenchmarkRunner
@@ -0,0 +1,51 @@
+import argparse
+from typing import List, Tuple
+
+from .benchmark_runner import BenchmarkConfig
+from .clients import client_config_classes
+from .config import BaseConfigModel
+
+
+def parse_args_to_configs(args: List[str]) -> Tuple[BenchmarkConfig, BaseConfigModel]:
+    def add_model(parser: argparse.ArgumentParser, model: BaseConfigModel):
+        """Adds fields from pydantic model to the parser."""
+        for name, field in model.model_fields.items():
+            field_type = field.annotation
+
+            # Get information about number of arguments expected
+            nargs = None
+            if getattr(field.annotation, "_name", "") == "List":
+                nargs = "+"
+                field_type = field.annotation.__args__[0]
+
+            # Add field to parser
+            parser.add_argument(
+                f"--{name}",
+                dest=name,
+                nargs=nargs,
+                type=field_type,
+                required=getattr(field, "required", False),
+                default=getattr(field, "default", None),
+                help=getattr(field, "description", ""),
+            )
+
+    # Parse benchmark config fields
+    parser = argparse.ArgumentParser(allow_abbrev=False)
+    add_model(parser, BenchmarkConfig)
+    benchmark_args, remaining_args = parser.parse_known_args(args)
+    benchmark_config = BenchmarkConfig(**vars(benchmark_args))
+    unused_args = set(remaining_args)
+
+    # Parse client config fields
+    client_config_class = client_config_classes[benchmark_config.api]
+    parser = argparse.ArgumentParser(allow_abbrev=False)
+    add_model(parser, client_config_class)
+    client_args, remaining_args = parser.parse_known_args(args)
+    client_config = client_config_class(**vars(client_args))
+
+    # Check for unused arguments
+    unused_args = unused_args.intersection(remaining_args)
+    if unused_args:
+        raise ValueError(f"Unused arguments: {unused_args}")
+
+    return benchmark_config, client_config
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+python -m src.deepspeedometer.benchmark_runner --model "facebook/opt-125m" --api dummy --config_file ./configs/1300-120.yaml`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+from .arg_parsing import parse_args_to_configs`
	`2`	`+from .benchmark_runner import BenchmarkRunner`