Skip to content

Commit bbab278

Browse files
authored
Refactored LLM benchmark code (deepspeedai#899)
* add refactored LLM benchmark code, initial commit * move prompt processing outside benchmark loop * formatting and improvements * slight refactor of benchmark runner, cleaning up, adding type hints * add tests, small refactors to improve code, make installable package * clean up code, add TODO notes for intended changes * update author list * add early stopping of benchmarks * add support for longer prompts and cleanup * address PR feedback, fix bugs, small updates * small change * fix small bugs around prompt length and max token size * remove debug prints * Update 128k-120.yaml * add min_requests override and print out for result summary * add README, rename benchmark * update unit tests
1 parent 75df1d7 commit bbab278

27 files changed

Lines changed: 1486 additions & 0 deletions
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
# DeepSpeedometer
2+
3+
This benchmark is designed to measure performance of LLM serving solutions. Using a different number of parallel clients sending requests to an inference server, we gather data to plot throughput-latency curves and find the saturation point of an inference server that demonstrates the maximum performance.
4+
5+
## Installation
6+
7+
To install the benchmark, clone this repository and install using `pip`:
8+
```shell
9+
git clone https://github.com/Microsoft/DeepSpeedExamples
10+
cd ./DeepSpeedExamples/benchmarks/deepspeedometer
11+
pip install .
12+
```
13+
14+
## Usage
15+
16+
To quickly test the benchmark code without creating an inference server, run the following:
17+
```
18+
python3 -m deepspeedometer.benchmark_runner --model facebook/opt-125m --api dummy
19+
```
20+
21+
### Supports APIs
22+
23+
The benchmark supports different APIs, each with their own client type. Depending on the client, you may need to run the benchmark against a locally hosted inference server or a remote inference server. Adding support for new serving solutions can be achieved by creating a new client class that defines a few basic methods. See the section below on adding new clients for more information.
24+
25+
The clients (i.e., APIs) curently supported (and configuration options for each) are listed below. You can see more information about the configuration options by looking at the `*ClientConfig` classes located in `clients/*.py`:
26+
27+
1. `fastgen`: Runs a local model inference server with DeepSpeed's FastGen. Config options include:
28+
- `model`: Which model to use for serving (required)
29+
- `deployment_name`: Name of the deployment server
30+
- `tp_size`: Tensor parallel size for each model replicas
31+
- `num_replicas`: Number of model replicas
32+
- `max_ragged_batch_size`: Max number of requests running per model replicas
33+
- `quantization_mode`: Type of quantization to use
34+
2. `vllm`: Runs a local model inference server with vLLM.
35+
- `model`: Which model to use for serving (required)
36+
- `tp_size`: Tensor parallel size for model
37+
- `port`: Which port to use for REST API
38+
3. `azureml`: Interfaces with remote AzureML online endpoint/deployment.
39+
- `api_url`: AzureML endpoint API URL (required)
40+
- `api_key`: AzureML token key for connecting to endpoint (required)
41+
- `deployment_name`: Name of deployment hosted in given endpoint (required)
42+
43+
### Benchmark Configuration
44+
45+
The Benchmark has many options for tailoring performance measurements to a specific use-cases. For additional information and default values, see the `BenchmarkConfig` class defined in `benchmark_runner.py`.
46+
47+
- `api`: Which API to use
48+
- `warmup_requests`: Number of warm-up requests to run before measuring performance
49+
- `result_dir`: Directory where results will be written out (as JSON files)
50+
- `use_threading`: Whether to use threading for the benchmark clients. Default is to use multi-processing
51+
- `config_file`: One or more config YAML files that contain values for any of the Prompt configuration options (see below section on prompt configuration)
52+
- `num_clients`: One or more integer values for the number of parallel clients to run
53+
- `num_requests_per_client`: Number of requests that will be run by each of the parallel clients
54+
- `min_requests`: Minimum number of requests to be sent during duration of benchmark. Useful when there is a low number of clients to ensure good measurement.
55+
- `prompt_text_source`: Text file or string that will be sampled to generate request prompts
56+
- `early_stop_latency`: When running multiple values for `num_clients`, if the average latency per request exceeds this value (in seconds) the benchmark will not test a larger number of parallel clients
57+
- `force`: Force the overwrite of result files. By default, if a result file exists, the benchmark is skipped
58+
59+
### Prompt Configuration
60+
61+
These options allow users to modify the prompt input and generation behavior of the served models. Note that you can run multiple prompt configurations in a single command by using the `config_file` input as described in the Benchmark Configuration section.
62+
63+
- `model`: Which model to use for tokenizing prompts (required)
64+
- `prompt_generator_seed`: Seed value for random number generation
65+
- `max_prompt_length`: The maximum prompt length allowed
66+
- `prompt_length`: Target mean prompt length
67+
- `prompt_lenght_var`: Variance of generated prompt lengths
68+
- `max_new_tokens`: Target mean number of generated tokens
69+
- `max_new_tokens_var`: Variance of generated tokens
70+
- `streaming`: Whether to enabled streaming output for generated tokens
71+
72+
#### About Prompt Generation
73+
74+
To mimic real-world serving scenarios, this benchmark samples prompt length and generated token length values from a normal distribution. This distribution can be manipulated with the `prompt_length*` and `max_new_tokens*` values in the prompt configuration. To get all prompt lengths and generation lengths to match exactly, set the `*_var` values to 0.
75+
76+
## Adding New Client APIs
77+
78+
The DeepSpeedometer benchmark was designed to allow easily adding support for new inference server solutions. To do so:
79+
80+
1. Create a new `*_client.py` file in the `clients/` directory.
81+
2. Define a `*Client` class that inherits from the `BaseClient` class in `clients/base.py`. This class should define 5 methods: `start_service`, `stop_service`, `prepare_request`, `send_request`, and `process_response`. Take a look at the type hints for these methods in the `BaseClient` class to understand the expected inputs and outputs for each method.
82+
3. Define a `*ClientConfig` class that inherits from the `BaseConfigModel` class. Place any configuration options (i.e., user-passed command line arguments) necessary for your defined `*Client` class in here.
83+
4. Import the newly added `*Client` and `*ClientConfig` into `clients/__init__.py` and add them to the `client_config_classes` and `client_classes` dictionaries.
84+
85+
For the simplest example of adding a new client, take a look at the `clients/dummy_client.py` file where we have defined a client that does not stand up a server and only returns a sample of the input prompt after a short sleep cycle. We use this as a light-weight class for unit testing.
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
prompt_length: 128000
2+
prompt_length_var: 0.1
3+
max_prompt_length: 131072
4+
max_new_tokens: 120
5+
max_new_tokens_var: 0.3
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
prompt_length: 1300
2+
prompt_lenght_var: 0.3
3+
max_new_tokens: 120
4+
max_new_tokens_var: 0.3
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
prompt_length: 2600
2+
prompt_lenght_var: 0.3
3+
max_new_tokens: 60
4+
max_new_tokens_var: 0.3
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
prompt_length: 500
2+
prompt_lenght_var: 0.3
3+
max_new_tokens: 500
4+
max_new_tokens_var: 0.3
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
[build-system]
2+
requires = ["setuptools>=61.0"]
3+
build-backend = "setuptools.build_meta"
4+
[project]
5+
name = "deepspeedometer"
6+
version = "0.0.1"
7+
authors = [
8+
{ name="Ammar Ahmad Awan", email="ammar.awan@microsoft.com" },
9+
{ name="Arash Bakhitiari", email="abakhtiari@microsoft.com" },
10+
{ name="Connor Holmes"},
11+
{ name="Lev Kurilenko", email="lev.kurilenko@microsoft.com" },
12+
{ name="Heyang Qin", email="heyangqin@microsoft.com" },
13+
{ name="Masahiro Tanaka", email="mtanaka@microsoft.com" },
14+
{ name="Michael Wyatt", email="michaelwyatt@microsoft.com" },
15+
]
16+
description = "LLM benchmarking tool"
17+
readme = "README.md"
18+
requires-python = ">=3.8"
19+
classifiers = [
20+
"Programming Language :: Python :: 3",
21+
]
22+
dependencies = [
23+
"loguru",
24+
"pydantic>=2.0.0",
25+
"torch",
26+
"tqdm",
27+
"transformers",
28+
]
29+
30+
[project.urls]
31+
Homepage = "https://github.com/Microsoft/DeepSpeedExamples/tree/master/benchmarks/inference/deepspeedometer"
32+
Issues = "https://github.com/Microsoft/DeepSpeedExamples/issues"
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
python -m src.deepspeedometer.benchmark_runner --model "facebook/opt-125m" --api dummy --config_file ./configs/1300-120.yaml
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
from .arg_parsing import parse_args_to_configs
2+
from .benchmark_runner import BenchmarkRunner
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
import argparse
2+
from typing import List, Tuple
3+
4+
from .benchmark_runner import BenchmarkConfig
5+
from .clients import client_config_classes
6+
from .config import BaseConfigModel
7+
8+
9+
def parse_args_to_configs(args: List[str]) -> Tuple[BenchmarkConfig, BaseConfigModel]:
10+
def add_model(parser: argparse.ArgumentParser, model: BaseConfigModel):
11+
"""Adds fields from pydantic model to the parser."""
12+
for name, field in model.model_fields.items():
13+
field_type = field.annotation
14+
15+
# Get information about number of arguments expected
16+
nargs = None
17+
if getattr(field.annotation, "_name", "") == "List":
18+
nargs = "+"
19+
field_type = field.annotation.__args__[0]
20+
21+
# Add field to parser
22+
parser.add_argument(
23+
f"--{name}",
24+
dest=name,
25+
nargs=nargs,
26+
type=field_type,
27+
required=getattr(field, "required", False),
28+
default=getattr(field, "default", None),
29+
help=getattr(field, "description", ""),
30+
)
31+
32+
# Parse benchmark config fields
33+
parser = argparse.ArgumentParser(allow_abbrev=False)
34+
add_model(parser, BenchmarkConfig)
35+
benchmark_args, remaining_args = parser.parse_known_args(args)
36+
benchmark_config = BenchmarkConfig(**vars(benchmark_args))
37+
unused_args = set(remaining_args)
38+
39+
# Parse client config fields
40+
client_config_class = client_config_classes[benchmark_config.api]
41+
parser = argparse.ArgumentParser(allow_abbrev=False)
42+
add_model(parser, client_config_class)
43+
client_args, remaining_args = parser.parse_known_args(args)
44+
client_config = client_config_class(**vars(client_args))
45+
46+
# Check for unused arguments
47+
unused_args = unused_args.intersection(remaining_args)
48+
if unused_args:
49+
raise ValueError(f"Unused arguments: {unused_args}")
50+
51+
return benchmark_config, client_config

0 commit comments

Comments
 (0)