DeepConf is a confidence-aware reasoning approach for large language models that uses model-internal confidence signals to dynamically filter out low-quality reasoning traces during generation, improving both efficiency and accuracy.
Based on the paper "Deep Think with Confidence" by Fu et al. (2024), DeepConf implements:
- Token-level confidence scoring using entropy and log-probability metrics
- Online mode with early termination to save computational resources
- Warmup phase for threshold calibration
- Consensus-based stopping when high agreement is reached
- Weighted majority voting for final answer selection
- ✅ Local models only - Works with OptiLLM's local inference engine
- ✅ Two variants:
low(aggressive, top 10%) andhigh(conservative, top 90%) - ✅ Configurable parameters for different use cases
- ✅ Early termination to reduce token usage by 50-70%
- ✅ Automatic quality control without external evaluation
Set up OptiLLM for local inference:
export OPTILLM_API_KEY=optillm
python optillm.py --model your-local-model
Then make a request with DeepConf decoding:
import openai
client = openai.OpenAI(
api_key="optillm",
base_url="http://localhost:8000/v1"
)
response = client.chat.completions.create(
model="your-model",
messages=[
{"role": "user", "content": "Solve this math problem: What is the derivative of x^3 + 2x^2 - 5x + 1?"}
],
extra_body={
"decoding": "deepconf",
"variant": "low", # "low" or "high"
"warmup_samples": 16, # Number of calibration traces
"max_traces": 64, # Maximum total traces
"consensus_threshold": 0.95 # Stop when consensus reached
}
)
print(response.choices[0].message.content)
| Parameter | Default | Description |
|---|---|---|
variant |
"low" |
Filtering strategy: "low" (top 10%, aggressive) or "high" (top 90%, conservative) |
warmup_samples |
16 |
Number of initial traces for threshold calibration |
consensus_threshold |
0.95 |
Stop generation when this level of agreement is reached |
max_traces |
128 |
Maximum number of traces to generate |
window_size |
2048 |
Sliding window size for group confidence calculation |
top_k |
5 |
Number of top tokens for confidence calculation |
min_trace_length |
100 |
Minimum tokens before allowing early termination |
max_tokens_per_trace |
4096 |
Maximum tokens per individual trace |
confidence_metric |
"average_confidence" |
Metric used for threshold calculation |
include_stats |
false |
Include processing statistics in response |
Include statistics in the response for debugging:
response = client.chat.completions.create(
model="your-model",
messages=[...],
extra_body={
"decoding": "deepconf",
"variant": "high",
"include_stats": true,
"warmup_samples": 8,
"max_traces": 32
}
)
- Warmup Phase: Generate initial traces to calibrate confidence threshold
- Online Generation: Generate traces with early termination based on confidence
- Consensus Check: Stop when sufficient agreement is reached
- Final Selection: Use weighted majority voting to select the best answer
- Token Entropy:
H = -∑P(j) log P(j) - Token Confidence:
C = -(1/k) ∑log P(j)for top-k tokens - Group Confidence: Sliding window averages over token confidences
- Trace Confidence: Average confidence across all tokens in a trace
- DeepConf-low: Uses 90th percentile threshold (keeps top 10% traces) - more aggressive filtering
- DeepConf-high: Uses 10th percentile threshold (keeps top 90% traces) - more conservative filtering
DeepConf typically achieves:
- 50-70% reduction in token usage compared to standard sampling
- Maintained or improved accuracy through confidence-based filtering
- Automatic quality control without requiring external evaluation models
- Local model inference (PyTorch)
- OptiLLM with
OPTILLM_API_KEY=optillm - Compatible with transformer models that provide logits access
- Paper: "Deep Think with Confidence" by Fu et al. (2025)
- arXiv: https://arxiv.org/abs/2508.15260
- Authors: Yichao Fu (UCSD), Xuewei Wang (Meta AI), Yuandong Tian (Meta AI), Jiawei Zhao (Meta AI)