UN-3344 [FIX] Unified retry for LLM and embedding providers (#1886)

chandrasekharan-zipstack · claude · muhammad-ali-e · web-flow · commit e114dcf99dce · 2026-04-17T16:47:23.000+05:30
* [FIX] Unified retry for LLM and embedding providers litellm's retry only works for SDK-based providers (OpenAI/Azure). httpx-based providers (Anthropic, Vertex, Bedrock, Mistral) and ALL embedding calls silently ignore max_retries. This adds self-managed retry with exponential backoff at the SDK layer, disabling litellm's own retry entirely for consistency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [REFACTOR] DRY retry logic into reusable call_with_retry utilities Move retry loops out of LLM/Embedding classes into generic call_with_retry, acall_with_retry, and iter_with_retry functions in retry_utils.py. Both classes now call these directly instead of maintaining their own retry helper methods. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [FIX] Consolidate retry logic, expose max_retries for all adapters - Extract _get_retry_delay() shared helper to eliminate duplicated retry decision logic across call_with_retry, acall_with_retry, iter_with_retry, and retry_with_exponential_backoff - Add num_retries=0 to embedding._pop_retry_params() to fully disable litellm's internal retry for embedding calls - Expose max_retries in UI JSON schemas for embedding adapters (OpenAI, Azure, VertexAI, Ollama) and Ollama LLM — previously the field existed in Pydantic models but wasn't shown to users, silently defaulting to 0 retries - Add debug logging to LLM and Embedding retry parameter extraction - Clarify docstrings distinguishing is_retryable_litellm_error() from is_retryable_error() (different exception hierarchies) - Remove stale noqa: C901 from simplified retry_with_exponential_backoff Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [FIX] Set max_retries default to 3 for all embedding and Ollama LLM adapters Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [FIX] Address greptile review: fix shadowed ConnectionError, use MRO check - Fix `requests.ConnectionError` shadowing Python's builtin `ConnectionError` in `is_retryable_litellm_error()` — rename import to `RequestsConnectionError` and use `builtins.ConnectionError` / `builtins.TimeoutError` explicitly - Use `__mro__`-based class name check instead of `type(error).__name__` to also catch subclasses of retryable error types - P1 (num_retries not zeroed) was already fixed in prior commit Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [FIX] Address CodeRabbit review: add APITimeoutError, validate max_retries - Add APITimeoutError to _RETRYABLE_ERROR_NAMES for explicit OpenAI SDK timeout coverage - Add _validate_max_retries() guard to call_with_retry, acall_with_retry, iter_with_retry to fail fast on negative values instead of silently returning None Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * UN-3344 [FIX] Reduce cognitive complexity and remove useless except clause Address SonarCloud findings on PR #1886: - S3776: Flatten retry_with_exponential_backoff.wrapper by moving the success logging + return out of the try block and using `continue` in the retry path, so the except branch only handles the give-up case. - S2737: Drop the `except Exception: raise` clause — it was a no-op that added complexity without changing behavior (non-matching exceptions propagate naturally). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * UN-3344 [FIX] Extract retry loop to top-level helper to drop cognitive complexity Sonar still flagged retry_with_exponential_backoff at complexity 16 after the previous flatten. Nested def decorator / def wrapper counted against the outer function's score. Move the retry body to a module-level _invoke_with_retries helper so the decorator factory just delegates, bringing the outer function well under the 15 threshold. Behavior is unchanged — all paths (success, retry, give-up, non-retryable propagate) are preserved and covered by the existing SDK1 tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * UN-3344 [FIX] Honor Retry-After, close stream gen on retry, share give-up log Address review comments on PR #1886: - #10 (resource leak): close the generator returned by fn() before retrying in iter_with_retry — otherwise streaming providers leak an in-flight HTTP socket until GC. - #12 (behavioral regression): when we zero out SDK/wrapper retries we also lose the OpenAI SDK's native Retry-After handling on 429/503. _get_retry_delay now checks error.response.headers["retry-after"] and uses that value ahead of exponential backoff. HTTP-date form is not parsed; those fall back to backoff. - #8 (observability gap): move the "Giving up ... after N attempt(s)" log into _get_retry_delay so all four retry helpers (call_with_retry, acall_with_retry, iter_with_retry, decorator) share the same exhaustion signal. Previously only the decorator path logged it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * UN-3344 [REFACTOR] Share retry-kwargs helper and add TypeVar to retry wrappers Address review comments on PR #1886: - #9 (typing): call_with_retry / acall_with_retry / iter_with_retry previously returned `object`, erasing caller type info. Add PEP 695 generics so the return type flows from the wrapped callable: acall_with_retry now takes Callable[[], Awaitable[T]] and iter_with_retry takes Callable[[], Iterable[T]] -> Generator[T, ...]. - #11 / #13 (DRY): `_pop_retry_params` in embedding.py and `_disable_litellm_retry` in llm.py were identical logic. Lift to shared `pop_litellm_retry_kwargs` helper in retry_utils.py and delete both methods. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: ali <117142933+muhammad-ali-e@users.noreply.github.com>
diff --git a/unstract/sdk1/src/unstract/sdk1/adapters/embedding1/static/azure.json b/unstract/sdk1/src/unstract/sdk1/adapters/embedding1/static/azure.json
@@ -61,6 +61,14 @@
       "title": "Embedding Batch Size",
       "default": 5
     },
+    "max_retries": {
+      "type": "number",
+      "minimum": 0,
+      "multipleOf": 1,
+      "title": "Max Retries",
+      "default": 3,
+      "description": "The maximum number of times to retry a request if it fails."
+    },
     "timeout": {
       "type": "number",
       "minimum": 0,
diff --git a/unstract/sdk1/src/unstract/sdk1/adapters/embedding1/static/bedrock.json b/unstract/sdk1/src/unstract/sdk1/adapters/embedding1/static/bedrock.json
@@ -43,8 +43,8 @@
         "minimum": 0,
         "multipleOf": 1,
         "title": "Max Retries",
-        "default": 5,
-        "description": "Maximum number of retries to attempt when a request fails."
+        "default": 3,
+        "description": "The maximum number of times to retry a request if it fails."
       },
       "timeout": {
         "type": "number",
diff --git a/unstract/sdk1/src/unstract/sdk1/adapters/embedding1/static/ollama.json b/unstract/sdk1/src/unstract/sdk1/adapters/embedding1/static/ollama.json
@@ -31,6 +31,14 @@
       "multipleOf": 1,
       "title": "Embed Batch Size",
       "default": 10
+    },
+    "max_retries": {
+      "type": "number",
+      "minimum": 0,
+      "multipleOf": 1,
+      "title": "Max Retries",
+      "default": 3,
+      "description": "The maximum number of times to retry a request if it fails."
     }
   }
 }
diff --git a/unstract/sdk1/src/unstract/sdk1/adapters/embedding1/static/openai.json b/unstract/sdk1/src/unstract/sdk1/adapters/embedding1/static/openai.json
@@ -44,6 +44,14 @@
       "title": "Embed Batch Size",
       "default": 10
     },
+    "max_retries": {
+      "type": "number",
+      "minimum": 0,
+      "multipleOf": 1,
+      "title": "Max Retries",
+      "default": 3,
+      "description": "The maximum number of times to retry a request if it fails."
+    },
     "timeout": {
       "type": "number",
       "minimum": 0,
diff --git a/unstract/sdk1/src/unstract/sdk1/adapters/embedding1/static/vertexai.json b/unstract/sdk1/src/unstract/sdk1/adapters/embedding1/static/vertexai.json
@@ -57,6 +57,14 @@
             "retrieval"
           ],
           "default": "default"
+      },
+      "max_retries": {
+        "type": "number",
+        "minimum": 0,
+        "multipleOf": 1,
+        "title": "Max Retries",
+        "default": 3,
+        "description": "The maximum number of times to retry a request if it fails."
       }
     }
   }
diff --git a/unstract/sdk1/src/unstract/sdk1/adapters/llm1/static/ollama.json b/unstract/sdk1/src/unstract/sdk1/adapters/llm1/static/ollama.json
@@ -48,6 +48,14 @@
       "default": 3900,
       "description": "The maximum number of context tokens for the model."
     },
+    "max_retries": {
+      "type": "number",
+      "minimum": 0,
+      "multipleOf": 1,
+      "title": "Max Retries",
+      "default": 3,
+      "description": "The maximum number of times to retry a request if it fails."
+    },
     "request_timeout": {
       "type": "number",
       "minimum": 0,
diff --git a/unstract/sdk1/src/unstract/sdk1/embedding.py b/unstract/sdk1/src/unstract/sdk1/embedding.py
@@ -1,5 +1,6 @@
 from __future__ import annotations
 
+import logging
 import os
 from typing import TYPE_CHECKING
 
@@ -14,10 +15,18 @@
 from unstract.sdk1.exceptions import SdkError, parse_litellm_err
 from unstract.sdk1.platform import PlatformHelper
 from unstract.sdk1.utils.callback_manager import CallbackManager
+from unstract.sdk1.utils.retry_utils import (
+    acall_with_retry,
+    call_with_retry,
+    is_retryable_litellm_error,
+    pop_litellm_retry_kwargs,
+)
 
 if TYPE_CHECKING:
     from unstract.sdk1.tool.base import BaseTool
 
+logger = logging.getLogger(__name__)
+
 litellm.drop_params = True
 
 
@@ -115,9 +124,14 @@ def get_embedding(self, text: str) -> list[float]:
         try:
             kwargs = self.kwargs.copy()
             model = kwargs.pop("model")
+            max_retries = pop_litellm_retry_kwargs(kwargs, self._get_adapter_info())
 
-            resp = litellm.embedding(model=model, input=[text], **kwargs)
-
+            resp = call_with_retry(
+                lambda: litellm.embedding(model=model, input=[text], **kwargs),
+                max_retries=max_retries,
+                retry_predicate=is_retryable_litellm_error,
+                description=self._get_adapter_info(),
+            )
             return resp["data"][0]["embedding"]
         except Exception as e:
             raise parse_litellm_err(e, self._get_adapter_info()) from e
@@ -127,9 +141,14 @@ def get_embeddings(self, texts: list[str]) -> list[list[float]]:
         try:
             kwargs = self.kwargs.copy()
             model = kwargs.pop("model")
+            max_retries = pop_litellm_retry_kwargs(kwargs, self._get_adapter_info())
 
-            resp = litellm.embedding(model=model, input=texts, **kwargs)
-
+            resp = call_with_retry(
+                lambda: litellm.embedding(model=model, input=texts, **kwargs),
+                max_retries=max_retries,
+                retry_predicate=is_retryable_litellm_error,
+                description=self._get_adapter_info(),
+            )
             return [data["embedding"] for data in resp["data"]]
         except Exception as e:
             raise parse_litellm_err(e, self._get_adapter_info()) from e
@@ -139,26 +158,34 @@ async def get_aembedding(self, text: str) -> list[float]:
         try:
             kwargs = self.kwargs.copy()
             model = kwargs.pop("model")
+            max_retries = pop_litellm_retry_kwargs(kwargs, self._get_adapter_info())
 
-            resp = await litellm.aembedding(model=model, input=[text], **kwargs)
-
+            resp = await acall_with_retry(
+                lambda: litellm.aembedding(model=model, input=[text], **kwargs),
+                max_retries=max_retries,
+                retry_predicate=is_retryable_litellm_error,
+                description=self._get_adapter_info(),
+            )
             return resp["data"][0]["embedding"]
         except Exception as e:
-            provider_name = f"{self.adapter.get_name()}"
-            raise parse_litellm_err(e, provider_name) from e
+            raise parse_litellm_err(e, self._get_adapter_info()) from e
 
     async def get_aembeddings(self, texts: list[str]) -> list[list[float]]:
         """Return async embedding vectors for list of query strings."""
         try:
             kwargs = self.kwargs.copy()
             model = kwargs.pop("model")
+            max_retries = pop_litellm_retry_kwargs(kwargs, self._get_adapter_info())
 
-            resp = await litellm.aembedding(model=model, input=texts, **kwargs)
-
+            resp = await acall_with_retry(
+                lambda: litellm.aembedding(model=model, input=texts, **kwargs),
+                max_retries=max_retries,
+                retry_predicate=is_retryable_litellm_error,
+                description=self._get_adapter_info(),
+            )
             return [data["embedding"] for data in resp["data"]]
         except Exception as e:
-            provider_name = f"{self.adapter.get_name()}"
-            raise parse_litellm_err(e, provider_name) from e
+            raise parse_litellm_err(e, self._get_adapter_info()) from e
 
     def test_connection(self) -> bool:
         """Test connection to the embedding provider."""
diff --git a/unstract/sdk1/src/unstract/sdk1/llm.py b/unstract/sdk1/src/unstract/sdk1/llm.py
@@ -24,6 +24,13 @@
     TokenCounterCompat,
     capture_metrics,
 )
+from unstract.sdk1.utils.retry_utils import (
+    acall_with_retry,
+    call_with_retry,
+    is_retryable_litellm_error,
+    iter_with_retry,
+    pop_litellm_retry_kwargs,
+)
 
 logger = logging.getLogger(__name__)
 
@@ -285,9 +292,14 @@ def complete(self, prompt: str, **kwargs: object) -> dict[str, object]:
             # if hasattr(self, "thinking_dict") and self.thinking_dict is not None:
             #     completion_kwargs["temperature"] = 1
 
-            response: dict[str, object] = litellm.completion(
-                messages=messages,
-                **completion_kwargs,
+            max_retries = pop_litellm_retry_kwargs(
+                completion_kwargs, self._get_adapter_info()
+            )
+            response: dict[str, object] = call_with_retry(
+                lambda: litellm.completion(messages=messages, **completion_kwargs),
+                max_retries=max_retries,
+                retry_predicate=is_retryable_litellm_error,
+                description=self._get_adapter_info(),
             )
 
             response_text = response["choices"][0]["message"]["content"]
@@ -373,14 +385,20 @@ def stream_complete(
             completion_kwargs = self.adapter.validate({**self.kwargs, **kwargs})
             completion_kwargs.pop("cost_model", None)
 
+            max_retries = pop_litellm_retry_kwargs(
+                completion_kwargs, self._get_adapter_info()
+            )
             has_yielded_content = False
-            for chunk in litellm.completion(
-                messages=messages,
-                stream=True,
-                stream_options={
-                    "include_usage": True,
-                },
-                **completion_kwargs,
+            for chunk in iter_with_retry(
+                lambda: litellm.completion(
+                    messages=messages,
+                    stream=True,
+                    stream_options={"include_usage": True},
+                    **completion_kwargs,
+                ),
+                max_retries=max_retries,
+                retry_predicate=is_retryable_litellm_error,
+                description=self._get_adapter_info(),
             ):
                 if chunk.get("usage"):
                     self._record_usage(
@@ -437,9 +455,14 @@ async def acomplete(self, prompt: str, **kwargs: object) -> dict[str, object]:
             completion_kwargs = self.adapter.validate({**self.kwargs, **kwargs})
             completion_kwargs.pop("cost_model", None)
 
-            response = await litellm.acompletion(
-                messages=messages,
-                **completion_kwargs,
+            max_retries = pop_litellm_retry_kwargs(
+                completion_kwargs, self._get_adapter_info()
+            )
+            response = await acall_with_retry(
+                lambda: litellm.acompletion(messages=messages, **completion_kwargs),
+                max_retries=max_retries,
+                retry_predicate=is_retryable_litellm_error,
+                description=self._get_adapter_info(),
             )
             response_text = response["choices"][0]["message"]["content"]
             finish_reason = response["choices"][0].get("finish_reason")
diff --git a/unstract/sdk1/src/unstract/sdk1/utils/retry_utils.py b/unstract/sdk1/src/unstract/sdk1/utils/retry_utils.py

Original file line number	Diff line number	Diff line change
`@@ -31,6 +31,14 @@`
`31`	`31`	`"multipleOf": 1,`
`32`	`32`	`"title": "Embed Batch Size",`
`33`	`33`	`"default": 10`
	`34`	`+ },`
	`35`	`+ "max_retries": {`
	`36`	`+ "type": "number",`
	`37`	`+ "minimum": 0,`
	`38`	`+ "multipleOf": 1,`
	`39`	`+ "title": "Max Retries",`
	`40`	`+ "default": 3,`
	`41`	`+ "description": "The maximum number of times to retry a request if it fails."`
`34`	`42`	`}`
`35`	`43`	`}`
`36`	`44`	`}`
Original file line number	Diff line number	Diff line change
`@@ -57,6 +57,14 @@`
`57`	`57`	`"retrieval"`
`58`	`58`	`],`
`59`	`59`	`"default": "default"`
	`60`	`+ },`
	`61`	`+ "max_retries": {`
	`62`	`+ "type": "number",`
	`63`	`+ "minimum": 0,`
	`64`	`+ "multipleOf": 1,`
	`65`	`+ "title": "Max Retries",`
	`66`	`+ "default": 3,`
	`67`	`+ "description": "The maximum number of times to retry a request if it fails."`
`60`	`68`	`}`
`61`	`69`	`}`
`62`	`70`	`}`