|
| 1 | +# DeepSpeedometer |
| 2 | + |
| 3 | +This benchmark is designed to measure performance of LLM serving solutions. Using a different number of parallel clients sending requests to an inference server, we gather data to plot throughput-latency curves and find the saturation point of an inference server that demonstrates the maximum performance. |
| 4 | + |
| 5 | +## Installation |
| 6 | + |
| 7 | +To install the benchmark, clone this repository and install using `pip`: |
| 8 | +```shell |
| 9 | +git clone https://github.com/Microsoft/DeepSpeedExamples |
| 10 | +cd ./DeepSpeedExamples/benchmarks/deepspeedometer |
| 11 | +pip install . |
| 12 | +``` |
| 13 | + |
| 14 | +## Usage |
| 15 | + |
| 16 | +To quickly test the benchmark code without creating an inference server, run the following: |
| 17 | +``` |
| 18 | +python3 -m deepspeedometer.benchmark_runner --model facebook/opt-125m --api dummy |
| 19 | +``` |
| 20 | + |
| 21 | +### Supports APIs |
| 22 | + |
| 23 | +The benchmark supports different APIs, each with their own client type. Depending on the client, you may need to run the benchmark against a locally hosted inference server or a remote inference server. Adding support for new serving solutions can be achieved by creating a new client class that defines a few basic methods. See the section below on adding new clients for more information. |
| 24 | + |
| 25 | +The clients (i.e., APIs) curently supported (and configuration options for each) are listed below. You can see more information about the configuration options by looking at the `*ClientConfig` classes located in `clients/*.py`: |
| 26 | + |
| 27 | +1. `fastgen`: Runs a local model inference server with DeepSpeed's FastGen. Config options include: |
| 28 | + - `model`: Which model to use for serving (required) |
| 29 | + - `deployment_name`: Name of the deployment server |
| 30 | + - `tp_size`: Tensor parallel size for each model replicas |
| 31 | + - `num_replicas`: Number of model replicas |
| 32 | + - `max_ragged_batch_size`: Max number of requests running per model replicas |
| 33 | + - `quantization_mode`: Type of quantization to use |
| 34 | +2. `vllm`: Runs a local model inference server with vLLM. |
| 35 | + - `model`: Which model to use for serving (required) |
| 36 | + - `tp_size`: Tensor parallel size for model |
| 37 | + - `port`: Which port to use for REST API |
| 38 | +3. `azureml`: Interfaces with remote AzureML online endpoint/deployment. |
| 39 | + - `api_url`: AzureML endpoint API URL (required) |
| 40 | + - `api_key`: AzureML token key for connecting to endpoint (required) |
| 41 | + - `deployment_name`: Name of deployment hosted in given endpoint (required) |
| 42 | + |
| 43 | +### Benchmark Configuration |
| 44 | + |
| 45 | +The Benchmark has many options for tailoring performance measurements to a specific use-cases. For additional information and default values, see the `BenchmarkConfig` class defined in `benchmark_runner.py`. |
| 46 | + |
| 47 | +- `api`: Which API to use |
| 48 | +- `warmup_requests`: Number of warm-up requests to run before measuring performance |
| 49 | +- `result_dir`: Directory where results will be written out (as JSON files) |
| 50 | +- `use_threading`: Whether to use threading for the benchmark clients. Default is to use multi-processing |
| 51 | +- `config_file`: One or more config YAML files that contain values for any of the Prompt configuration options (see below section on prompt configuration) |
| 52 | +- `num_clients`: One or more integer values for the number of parallel clients to run |
| 53 | +- `num_requests_per_client`: Number of requests that will be run by each of the parallel clients |
| 54 | +- `min_requests`: Minimum number of requests to be sent during duration of benchmark. Useful when there is a low number of clients to ensure good measurement. |
| 55 | +- `prompt_text_source`: Text file or string that will be sampled to generate request prompts |
| 56 | +- `early_stop_latency`: When running multiple values for `num_clients`, if the average latency per request exceeds this value (in seconds) the benchmark will not test a larger number of parallel clients |
| 57 | +- `force`: Force the overwrite of result files. By default, if a result file exists, the benchmark is skipped |
| 58 | + |
| 59 | +### Prompt Configuration |
| 60 | + |
| 61 | +These options allow users to modify the prompt input and generation behavior of the served models. Note that you can run multiple prompt configurations in a single command by using the `config_file` input as described in the Benchmark Configuration section. |
| 62 | + |
| 63 | +- `model`: Which model to use for tokenizing prompts (required) |
| 64 | +- `prompt_generator_seed`: Seed value for random number generation |
| 65 | +- `max_prompt_length`: The maximum prompt length allowed |
| 66 | +- `prompt_length`: Target mean prompt length |
| 67 | +- `prompt_lenght_var`: Variance of generated prompt lengths |
| 68 | +- `max_new_tokens`: Target mean number of generated tokens |
| 69 | +- `max_new_tokens_var`: Variance of generated tokens |
| 70 | +- `streaming`: Whether to enabled streaming output for generated tokens |
| 71 | + |
| 72 | +#### About Prompt Generation |
| 73 | + |
| 74 | +To mimic real-world serving scenarios, this benchmark samples prompt length and generated token length values from a normal distribution. This distribution can be manipulated with the `prompt_length*` and `max_new_tokens*` values in the prompt configuration. To get all prompt lengths and generation lengths to match exactly, set the `*_var` values to 0. |
| 75 | + |
| 76 | +## Adding New Client APIs |
| 77 | + |
| 78 | +The DeepSpeedometer benchmark was designed to allow easily adding support for new inference server solutions. To do so: |
| 79 | + |
| 80 | +1. Create a new `*_client.py` file in the `clients/` directory. |
| 81 | +2. Define a `*Client` class that inherits from the `BaseClient` class in `clients/base.py`. This class should define 5 methods: `start_service`, `stop_service`, `prepare_request`, `send_request`, and `process_response`. Take a look at the type hints for these methods in the `BaseClient` class to understand the expected inputs and outputs for each method. |
| 82 | +3. Define a `*ClientConfig` class that inherits from the `BaseConfigModel` class. Place any configuration options (i.e., user-passed command line arguments) necessary for your defined `*Client` class in here. |
| 83 | +4. Import the newly added `*Client` and `*ClientConfig` into `clients/__init__.py` and add them to the `client_config_classes` and `client_classes` dictionaries. |
| 84 | + |
| 85 | +For the simplest example of adding a new client, take a look at the `clients/dummy_client.py` file where we have defined a client that does not stand up a server and only returns a sample of the input prompt after a short sleep cycle. We use this as a light-weight class for unit testing. |
0 commit comments