Skip to content

Commit e114dcf

Browse files
chandrasekharan-zipstackclaudemuhammad-ali-e
authored
UN-3344 [FIX] Unified retry for LLM and embedding providers (#1886)
* [FIX] Unified retry for LLM and embedding providers litellm's retry only works for SDK-based providers (OpenAI/Azure). httpx-based providers (Anthropic, Vertex, Bedrock, Mistral) and ALL embedding calls silently ignore max_retries. This adds self-managed retry with exponential backoff at the SDK layer, disabling litellm's own retry entirely for consistency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [REFACTOR] DRY retry logic into reusable call_with_retry utilities Move retry loops out of LLM/Embedding classes into generic call_with_retry, acall_with_retry, and iter_with_retry functions in retry_utils.py. Both classes now call these directly instead of maintaining their own retry helper methods. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [FIX] Consolidate retry logic, expose max_retries for all adapters - Extract _get_retry_delay() shared helper to eliminate duplicated retry decision logic across call_with_retry, acall_with_retry, iter_with_retry, and retry_with_exponential_backoff - Add num_retries=0 to embedding._pop_retry_params() to fully disable litellm's internal retry for embedding calls - Expose max_retries in UI JSON schemas for embedding adapters (OpenAI, Azure, VertexAI, Ollama) and Ollama LLM — previously the field existed in Pydantic models but wasn't shown to users, silently defaulting to 0 retries - Add debug logging to LLM and Embedding retry parameter extraction - Clarify docstrings distinguishing is_retryable_litellm_error() from is_retryable_error() (different exception hierarchies) - Remove stale noqa: C901 from simplified retry_with_exponential_backoff Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [FIX] Set max_retries default to 3 for all embedding and Ollama LLM adapters Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [FIX] Address greptile review: fix shadowed ConnectionError, use MRO check - Fix `requests.ConnectionError` shadowing Python's builtin `ConnectionError` in `is_retryable_litellm_error()` — rename import to `RequestsConnectionError` and use `builtins.ConnectionError` / `builtins.TimeoutError` explicitly - Use `__mro__`-based class name check instead of `type(error).__name__` to also catch subclasses of retryable error types - P1 (num_retries not zeroed) was already fixed in prior commit Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [FIX] Address CodeRabbit review: add APITimeoutError, validate max_retries - Add APITimeoutError to _RETRYABLE_ERROR_NAMES for explicit OpenAI SDK timeout coverage - Add _validate_max_retries() guard to call_with_retry, acall_with_retry, iter_with_retry to fail fast on negative values instead of silently returning None Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * UN-3344 [FIX] Reduce cognitive complexity and remove useless except clause Address SonarCloud findings on PR #1886: - S3776: Flatten retry_with_exponential_backoff.wrapper by moving the success logging + return out of the try block and using `continue` in the retry path, so the except branch only handles the give-up case. - S2737: Drop the `except Exception: raise` clause — it was a no-op that added complexity without changing behavior (non-matching exceptions propagate naturally). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * UN-3344 [FIX] Extract retry loop to top-level helper to drop cognitive complexity Sonar still flagged retry_with_exponential_backoff at complexity 16 after the previous flatten. Nested def decorator / def wrapper counted against the outer function's score. Move the retry body to a module-level _invoke_with_retries helper so the decorator factory just delegates, bringing the outer function well under the 15 threshold. Behavior is unchanged — all paths (success, retry, give-up, non-retryable propagate) are preserved and covered by the existing SDK1 tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * UN-3344 [FIX] Honor Retry-After, close stream gen on retry, share give-up log Address review comments on PR #1886: - #10 (resource leak): close the generator returned by fn() before retrying in iter_with_retry — otherwise streaming providers leak an in-flight HTTP socket until GC. - #12 (behavioral regression): when we zero out SDK/wrapper retries we also lose the OpenAI SDK's native Retry-After handling on 429/503. _get_retry_delay now checks error.response.headers["retry-after"] and uses that value ahead of exponential backoff. HTTP-date form is not parsed; those fall back to backoff. - #8 (observability gap): move the "Giving up ... after N attempt(s)" log into _get_retry_delay so all four retry helpers (call_with_retry, acall_with_retry, iter_with_retry, decorator) share the same exhaustion signal. Previously only the decorator path logged it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * UN-3344 [REFACTOR] Share retry-kwargs helper and add TypeVar to retry wrappers Address review comments on PR #1886: - #9 (typing): call_with_retry / acall_with_retry / iter_with_retry previously returned `object`, erasing caller type info. Add PEP 695 generics so the return type flows from the wrapped callable: acall_with_retry now takes Callable[[], Awaitable[T]] and iter_with_retry takes Callable[[], Iterable[T]] -> Generator[T, ...]. - #11 / #13 (DRY): `_pop_retry_params` in embedding.py and `_disable_litellm_retry` in llm.py were identical logic. Lift to shared `pop_litellm_retry_kwargs` helper in retry_utils.py and delete both methods. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: ali <117142933+muhammad-ali-e@users.noreply.github.com>
1 parent fd1bc63 commit e114dcf

9 files changed

Lines changed: 452 additions & 107 deletions

File tree

unstract/sdk1/src/unstract/sdk1/adapters/embedding1/static/azure.json

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,14 @@
6161
"title": "Embedding Batch Size",
6262
"default": 5
6363
},
64+
"max_retries": {
65+
"type": "number",
66+
"minimum": 0,
67+
"multipleOf": 1,
68+
"title": "Max Retries",
69+
"default": 3,
70+
"description": "The maximum number of times to retry a request if it fails."
71+
},
6472
"timeout": {
6573
"type": "number",
6674
"minimum": 0,

unstract/sdk1/src/unstract/sdk1/adapters/embedding1/static/bedrock.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,8 +43,8 @@
4343
"minimum": 0,
4444
"multipleOf": 1,
4545
"title": "Max Retries",
46-
"default": 5,
47-
"description": "Maximum number of retries to attempt when a request fails."
46+
"default": 3,
47+
"description": "The maximum number of times to retry a request if it fails."
4848
},
4949
"timeout": {
5050
"type": "number",

unstract/sdk1/src/unstract/sdk1/adapters/embedding1/static/ollama.json

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,14 @@
3131
"multipleOf": 1,
3232
"title": "Embed Batch Size",
3333
"default": 10
34+
},
35+
"max_retries": {
36+
"type": "number",
37+
"minimum": 0,
38+
"multipleOf": 1,
39+
"title": "Max Retries",
40+
"default": 3,
41+
"description": "The maximum number of times to retry a request if it fails."
3442
}
3543
}
3644
}

unstract/sdk1/src/unstract/sdk1/adapters/embedding1/static/openai.json

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,14 @@
4444
"title": "Embed Batch Size",
4545
"default": 10
4646
},
47+
"max_retries": {
48+
"type": "number",
49+
"minimum": 0,
50+
"multipleOf": 1,
51+
"title": "Max Retries",
52+
"default": 3,
53+
"description": "The maximum number of times to retry a request if it fails."
54+
},
4755
"timeout": {
4856
"type": "number",
4957
"minimum": 0,

unstract/sdk1/src/unstract/sdk1/adapters/embedding1/static/vertexai.json

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,14 @@
5757
"retrieval"
5858
],
5959
"default": "default"
60+
},
61+
"max_retries": {
62+
"type": "number",
63+
"minimum": 0,
64+
"multipleOf": 1,
65+
"title": "Max Retries",
66+
"default": 3,
67+
"description": "The maximum number of times to retry a request if it fails."
6068
}
6169
}
6270
}

unstract/sdk1/src/unstract/sdk1/adapters/llm1/static/ollama.json

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,14 @@
4848
"default": 3900,
4949
"description": "The maximum number of context tokens for the model."
5050
},
51+
"max_retries": {
52+
"type": "number",
53+
"minimum": 0,
54+
"multipleOf": 1,
55+
"title": "Max Retries",
56+
"default": 3,
57+
"description": "The maximum number of times to retry a request if it fails."
58+
},
5159
"request_timeout": {
5260
"type": "number",
5361
"minimum": 0,

unstract/sdk1/src/unstract/sdk1/embedding.py

Lines changed: 39 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
from __future__ import annotations
22

3+
import logging
34
import os
45
from typing import TYPE_CHECKING
56

@@ -14,10 +15,18 @@
1415
from unstract.sdk1.exceptions import SdkError, parse_litellm_err
1516
from unstract.sdk1.platform import PlatformHelper
1617
from unstract.sdk1.utils.callback_manager import CallbackManager
18+
from unstract.sdk1.utils.retry_utils import (
19+
acall_with_retry,
20+
call_with_retry,
21+
is_retryable_litellm_error,
22+
pop_litellm_retry_kwargs,
23+
)
1724

1825
if TYPE_CHECKING:
1926
from unstract.sdk1.tool.base import BaseTool
2027

28+
logger = logging.getLogger(__name__)
29+
2130
litellm.drop_params = True
2231

2332

@@ -115,9 +124,14 @@ def get_embedding(self, text: str) -> list[float]:
115124
try:
116125
kwargs = self.kwargs.copy()
117126
model = kwargs.pop("model")
127+
max_retries = pop_litellm_retry_kwargs(kwargs, self._get_adapter_info())
118128

119-
resp = litellm.embedding(model=model, input=[text], **kwargs)
120-
129+
resp = call_with_retry(
130+
lambda: litellm.embedding(model=model, input=[text], **kwargs),
131+
max_retries=max_retries,
132+
retry_predicate=is_retryable_litellm_error,
133+
description=self._get_adapter_info(),
134+
)
121135
return resp["data"][0]["embedding"]
122136
except Exception as e:
123137
raise parse_litellm_err(e, self._get_adapter_info()) from e
@@ -127,9 +141,14 @@ def get_embeddings(self, texts: list[str]) -> list[list[float]]:
127141
try:
128142
kwargs = self.kwargs.copy()
129143
model = kwargs.pop("model")
144+
max_retries = pop_litellm_retry_kwargs(kwargs, self._get_adapter_info())
130145

131-
resp = litellm.embedding(model=model, input=texts, **kwargs)
132-
146+
resp = call_with_retry(
147+
lambda: litellm.embedding(model=model, input=texts, **kwargs),
148+
max_retries=max_retries,
149+
retry_predicate=is_retryable_litellm_error,
150+
description=self._get_adapter_info(),
151+
)
133152
return [data["embedding"] for data in resp["data"]]
134153
except Exception as e:
135154
raise parse_litellm_err(e, self._get_adapter_info()) from e
@@ -139,26 +158,34 @@ async def get_aembedding(self, text: str) -> list[float]:
139158
try:
140159
kwargs = self.kwargs.copy()
141160
model = kwargs.pop("model")
161+
max_retries = pop_litellm_retry_kwargs(kwargs, self._get_adapter_info())
142162

143-
resp = await litellm.aembedding(model=model, input=[text], **kwargs)
144-
163+
resp = await acall_with_retry(
164+
lambda: litellm.aembedding(model=model, input=[text], **kwargs),
165+
max_retries=max_retries,
166+
retry_predicate=is_retryable_litellm_error,
167+
description=self._get_adapter_info(),
168+
)
145169
return resp["data"][0]["embedding"]
146170
except Exception as e:
147-
provider_name = f"{self.adapter.get_name()}"
148-
raise parse_litellm_err(e, provider_name) from e
171+
raise parse_litellm_err(e, self._get_adapter_info()) from e
149172

150173
async def get_aembeddings(self, texts: list[str]) -> list[list[float]]:
151174
"""Return async embedding vectors for list of query strings."""
152175
try:
153176
kwargs = self.kwargs.copy()
154177
model = kwargs.pop("model")
178+
max_retries = pop_litellm_retry_kwargs(kwargs, self._get_adapter_info())
155179

156-
resp = await litellm.aembedding(model=model, input=texts, **kwargs)
157-
180+
resp = await acall_with_retry(
181+
lambda: litellm.aembedding(model=model, input=texts, **kwargs),
182+
max_retries=max_retries,
183+
retry_predicate=is_retryable_litellm_error,
184+
description=self._get_adapter_info(),
185+
)
158186
return [data["embedding"] for data in resp["data"]]
159187
except Exception as e:
160-
provider_name = f"{self.adapter.get_name()}"
161-
raise parse_litellm_err(e, provider_name) from e
188+
raise parse_litellm_err(e, self._get_adapter_info()) from e
162189

163190
def test_connection(self) -> bool:
164191
"""Test connection to the embedding provider."""

unstract/sdk1/src/unstract/sdk1/llm.py

Lines changed: 36 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,13 @@
2424
TokenCounterCompat,
2525
capture_metrics,
2626
)
27+
from unstract.sdk1.utils.retry_utils import (
28+
acall_with_retry,
29+
call_with_retry,
30+
is_retryable_litellm_error,
31+
iter_with_retry,
32+
pop_litellm_retry_kwargs,
33+
)
2734

2835
logger = logging.getLogger(__name__)
2936

@@ -285,9 +292,14 @@ def complete(self, prompt: str, **kwargs: object) -> dict[str, object]:
285292
# if hasattr(self, "thinking_dict") and self.thinking_dict is not None:
286293
# completion_kwargs["temperature"] = 1
287294

288-
response: dict[str, object] = litellm.completion(
289-
messages=messages,
290-
**completion_kwargs,
295+
max_retries = pop_litellm_retry_kwargs(
296+
completion_kwargs, self._get_adapter_info()
297+
)
298+
response: dict[str, object] = call_with_retry(
299+
lambda: litellm.completion(messages=messages, **completion_kwargs),
300+
max_retries=max_retries,
301+
retry_predicate=is_retryable_litellm_error,
302+
description=self._get_adapter_info(),
291303
)
292304

293305
response_text = response["choices"][0]["message"]["content"]
@@ -373,14 +385,20 @@ def stream_complete(
373385
completion_kwargs = self.adapter.validate({**self.kwargs, **kwargs})
374386
completion_kwargs.pop("cost_model", None)
375387

388+
max_retries = pop_litellm_retry_kwargs(
389+
completion_kwargs, self._get_adapter_info()
390+
)
376391
has_yielded_content = False
377-
for chunk in litellm.completion(
378-
messages=messages,
379-
stream=True,
380-
stream_options={
381-
"include_usage": True,
382-
},
383-
**completion_kwargs,
392+
for chunk in iter_with_retry(
393+
lambda: litellm.completion(
394+
messages=messages,
395+
stream=True,
396+
stream_options={"include_usage": True},
397+
**completion_kwargs,
398+
),
399+
max_retries=max_retries,
400+
retry_predicate=is_retryable_litellm_error,
401+
description=self._get_adapter_info(),
384402
):
385403
if chunk.get("usage"):
386404
self._record_usage(
@@ -437,9 +455,14 @@ async def acomplete(self, prompt: str, **kwargs: object) -> dict[str, object]:
437455
completion_kwargs = self.adapter.validate({**self.kwargs, **kwargs})
438456
completion_kwargs.pop("cost_model", None)
439457

440-
response = await litellm.acompletion(
441-
messages=messages,
442-
**completion_kwargs,
458+
max_retries = pop_litellm_retry_kwargs(
459+
completion_kwargs, self._get_adapter_info()
460+
)
461+
response = await acall_with_retry(
462+
lambda: litellm.acompletion(messages=messages, **completion_kwargs),
463+
max_retries=max_retries,
464+
retry_predicate=is_retryable_litellm_error,
465+
description=self._get_adapter_info(),
443466
)
444467
response_text = response["choices"][0]["message"]["content"]
445468
finish_reason = response["choices"][0].get("finish_reason")

0 commit comments

Comments
 (0)