Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

README.md

User Controllable Caching

This is an example to demonstrate user controllable caching (e.g., specify whether to cache a request or not).

Prerequisites

Your server should have at least 1 GPU.

This will use the port 8000 for 1 vllm.

Steps

  1. Start the vllm engine at port 8000:
CUDA_VISIBLE_DEVICES=0 LMCACHE_USE_EXPERIMENTAL=True LMCACHE_CONFIG_FILE=example.yaml vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --max-model-len 4096  --gpu-memory-utilization 0.8 --port 8000 --kv-transfer-config '{"kv_connector":"LMCacheConnector", "kv_role":"kv_both"}'
  1. Send a request to vllm engine with store_cache: True:
curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "prompt": "Explain the significance of KV cache in language models.",
    "max_tokens": 10,
    "store_cache": True,
  }'

You should be able to see logs indicating the KV cache is stored:

DEBUG LMCache: Store skips 0 tokens and then stores 13 tokens [2025-03-02 21:58:55,147]
  1. Send request to vllm engine 2:
curl -X POST http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "prompt": "What's the weather today in Chicago?",
    "max_tokens": 10,
    "store_cache": False,
  }'

You should be able to see logs indicating the KV cache is NOT stored:

DEBUG LMCache: User has specified not to store the cache [2025-03-02 21:54:58,380]

Note that cache is stored by default.