This is an example to demonstrate how to compress a request's KV cache externally.
Your server should have at least 1 GPU.
This will use port 8000 for vllm and port 8001 for the LMCache worker. The controller itself occupies port 9000 and 9001.
- Start two vllm engines at port 8000 and port 8001:
CUDA_VISIBLE_DEVICES=0 LMCACHE_CONFIG_FILE=example.yaml vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 4096 --gpu-memory-utilization 0.8 --port 8000 --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'- Start the lmcache controller at port 9000 and the monitor at port 9001:
lmcache_controller --host localhost --port 9000 --monitor-port 9001- Send a request to vllm engine:
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"prompt": "Explain the significance of KV cache in language models.",
"max_tokens": 10
}'LMCache will automatically offloads the KV cache to CPU.
- Tokenize the prompt:
curl -X POST http://localhost:8000/tokenize \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"prompt": "Explain the significance of KV cache in language models."
}'You should be able to see the returned token ids as:
{"count":12,"max_model_len":4096,"tokens":[128000,849,21435,279,26431,315,85748,6636,304,4221,4211,13],"token_strs":null}
- Using Cachegen to compress request's KV cache:
curl -X POST http://localhost:9000/compress \
-H "Content-Type: application/json" \
-d '{
"instance_id": "lmcache_default_instance",
"method": "cachegen",
"location": "LocalCPUBackend",
"tokens": [128000, 849, 21435, 279, 26431, 315, 85748, 6636, 304, 4221, 4211, 13]
}'You should be able to see a return message indicating the KV cache has started to be compressed
{"num_tokens": 12, "event_id": "xxx"}
num_tokens: 12 means that there are 12 tokens's KV cache are being compressed in the system. The returned event_id can be used to check the status of the compress operation (this functionality is coming soon).