fix: retry on transient AWS credential resolution failures#19558
Merged
Conversation
S3 segment pushes that use the AWS SDK v2 transfer manager can resolve credentials on the async upload path. If a file-session credential refresh, container credential lookup, or IMDS lookup is temporarily unavailable, the SDK reports an SdkClientException such as 'Unable to load credentials from any of the providers in the chain'. Druid's S3 push path already wraps uploads in retryS3Operation, but these credential-provider failures were not classified as recoverable after the SDK v2 migration. That made an intermittent credential miss fail the task immediately instead of using the existing retry budget.
jtuglu1
commented
Jun 5, 2026
| "Throttling" | ||
| ); | ||
|
|
||
| private static final String UNABLE_TO_LOAD_CREDENTIALS_FROM_PROVIDER_CHAIN = |
Contributor
Author
There was a problem hiding this comment.
ideally it'd be better to catch a particular expired creds exception type but these messages are all generic AwsSdkException class, so parsing the message is needed.
abhishekrb19
approved these changes
Jun 5, 2026
Contributor
abhishekrb19
left a comment
There was a problem hiding this comment.
One comment, lgtm otherwise
| private static final String UNABLE_TO_LOAD_CREDENTIALS_FROM_PROVIDER_CHAIN = | ||
| "Unable to load credentials from any of the providers in the chain"; | ||
| private static final String FAILED_TO_LOAD_CREDENTIALS_FROM_IMDS = "Failed to load credentials from IMDS"; | ||
| private static final String CANNOT_REFRESH_AWS_CREDENTIALS = "cannot refresh AWS credentials"; |
Contributor
There was a problem hiding this comment.
It seems like cannot refresh AWS credentials is not present in the v2 SDK: https://github.com/aws/aws-sdk-java-v2
Maybe this was a legacy one from v1 SDK that isn't relevant anymore and can be removed?
Contributor
Author
There was a problem hiding this comment.
Possibly, I think it's fine to keep though; would rather retry on a false positive than risk killing a task in-case it surfaces somehow. I can push a follow-up to delete if we need to.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
S3 segment pushes that use the AWS SDK v2 transfer manager can resolve credentials on the async upload path. If a file-session credential refresh, container credential lookup, or IMDS lookup is temporarily unavailable, the SDK reports an SdkClientException such as 'Unable to load credentials from any of the providers in the chain'.
Druid's S3 push path already wraps uploads in retryS3Operation, but these credential-provider failures were not classified as recoverable after the SDK v2 migration. That made an intermittent credential miss fail the task immediately instead of using the existing retry budget.
Release note
Retry on transient AWS credential resolution failures
This PR has: