NOTE: This is an experimental tool and is not currently being supported since it's not fully functional. Please use the MII benchmark which can be found here: https://github.com/deepspeedai/DeepSpeedExamples/tree/master/benchmarks/inference/mii
This benchmark is designed to measure performance of LLM serving solutions. Using a different number of parallel clients sending requests to an inference server, we gather data to plot throughput-latency curves and find the saturation point of an inference server that demonstrates the maximum performance.
To install the benchmark, clone this repository and install using pip:
git clone https://github.com/deepspeedai/DeepSpeedExamples
cd ./DeepSpeedExamples/benchmarks/deepspeedometer
pip install .To quickly test the benchmark code without creating an inference server, run the following:
python3 -m deepspeedometer.benchmark_runner --model facebook/opt-125m --api dummy
The benchmark supports different APIs, each with their own client type. Depending on the client, you may need to run the benchmark against a locally hosted inference server or a remote inference server. Adding support for new serving solutions can be achieved by creating a new client class that defines a few basic methods. See the section below on adding new clients for more information.
The clients (i.e., APIs) curently supported (and configuration options for each) are listed below. You can see more information about the configuration options by looking at the *ClientConfig classes located in clients/*.py:
fastgen: Runs a local model inference server with DeepSpeed's FastGen. Config options include:model: Which model to use for serving (required)deployment_name: Name of the deployment servertp_size: Tensor parallel size for each model replicasnum_replicas: Number of model replicasmax_ragged_batch_size: Max number of requests running per model replicasquantization_mode: Type of quantization to use
vllm: Runs a local model inference server with vLLM.model: Which model to use for serving (required)tp_size: Tensor parallel size for modelport: Which port to use for REST API
azureml: Interfaces with remote AzureML online endpoint/deployment.api_url: AzureML endpoint API URL (required)api_key: AzureML token key for connecting to endpoint (required)deployment_name: Name of deployment hosted in given endpoint (required)
The Benchmark has many options for tailoring performance measurements to a specific use-cases. For additional information and default values, see the BenchmarkConfig class defined in benchmark_runner.py.
api: Which API to usewarmup_requests: Number of warm-up requests to run before measuring performanceresult_dir: Directory where results will be written out (as JSON files)use_threading: Whether to use threading for the benchmark clients. Default is to use multi-processingconfig_file: One or more config YAML files that contain values for any of the Prompt configuration options (see below section on prompt configuration)num_clients: One or more integer values for the number of parallel clients to runnum_requests_per_client: Number of requests that will be run by each of the parallel clientsmin_requests: Minimum number of requests to be sent during duration of benchmark. Useful when there is a low number of clients to ensure good measurement.prompt_text_source: Text file or string that will be sampled to generate request promptsearly_stop_latency: When running multiple values fornum_clients, if the average latency per request exceeds this value (in seconds) the benchmark will not test a larger number of parallel clientsforce: Force the overwrite of result files. By default, if a result file exists, the benchmark is skipped
These options allow users to modify the prompt input and generation behavior of the served models. Note that you can run multiple prompt configurations in a single command by using the config_file input as described in the Benchmark Configuration section.
model: Which model to use for tokenizing prompts (required)prompt_generator_seed: Seed value for random number generationmax_prompt_length: The maximum prompt length allowedprompt_length: Target mean prompt lengthprompt_lenght_var: Variance of generated prompt lengthsmax_new_tokens: Target mean number of generated tokensmax_new_tokens_var: Variance of generated tokensstreaming: Whether to enabled streaming output for generated tokens
To mimic real-world serving scenarios, this benchmark samples prompt length and generated token length values from a normal distribution. This distribution can be manipulated with the prompt_length* and max_new_tokens* values in the prompt configuration. To get all prompt lengths and generation lengths to match exactly, set the *_var values to 0.
The DeepSpeedometer benchmark was designed to allow easily adding support for new inference server solutions. To do so:
- Create a new
*_client.pyfile in theclients/directory. - Define a
*Clientclass that inherits from theBaseClientclass inclients/base.py. This class should define 5 methods:start_service,stop_service,prepare_request,send_request, andprocess_response. Take a look at the type hints for these methods in theBaseClientclass to understand the expected inputs and outputs for each method. - Define a
*ClientConfigclass that inherits from theBaseConfigModelclass. Place any configuration options (i.e., user-passed command line arguments) necessary for your defined*Clientclass in here. - Import the newly added
*Clientand*ClientConfigintoclients/__init__.pyand add them to theclient_config_classesandclient_classesdictionaries.
For the simplest example of adding a new client, take a look at the clients/dummy_client.py file where we have defined a client that does not stand up a server and only returns a sample of the input prompt after a short sleep cycle. We use this as a light-weight class for unit testing.