Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

README.md

LMCache Pin/Persistence

This is an example to demonstrate how to pin/persist a request's KV cache in an LMCacheEngine externally.

Prerequisites

Your server should have at least 1 GPU.

This will use port 8000 for 1 vllm and port 8001 for LMCache controller.

Steps

  1. Start the vllm engine at port 8000:
CUDA_VISIBLE_DEVICES=0 LMCACHE_CONFIG_FILE=example.yaml vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --max-model-len 4096  --gpu-memory-utilization 0.8 --port 8000 --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
  1. Start the lmcache controller at port 9000 and the monitor at port 9001:
lmcache_controller --host localhost --port 9000 --monitor-port 9001
  1. Send a request to vllm engine:
curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "prompt": "Explain the significance of KV cache in language models.",
    "max_tokens": 10
  }'
  1. Pin a request's KV cache in the system:
curl -X POST http://localhost:9000/pin \
  -H "Content-Type: application/json" \
  -d '{
    "tokens": [128000, 849, 21435, 279, 26431, 315, 85748, 6636, 304, 4221, 4211, 13]
  }'

You should be able to see a return message indicating the KV cache has been successfully pinned in the system:

{"success": true, "event_id": "xxx"}