Skip to content

fix: retry on transient AWS credential resolution failures#19558

Merged
jtuglu1 merged 1 commit into
apache:masterfrom
jtuglu1:fix-creds-refresh-retry
Jun 5, 2026
Merged

fix: retry on transient AWS credential resolution failures#19558
jtuglu1 merged 1 commit into
apache:masterfrom
jtuglu1:fix-creds-refresh-retry

Conversation

@jtuglu1
Copy link
Copy Markdown
Contributor

@jtuglu1 jtuglu1 commented Jun 4, 2026

Description

S3 segment pushes that use the AWS SDK v2 transfer manager can resolve credentials on the async upload path. If a file-session credential refresh, container credential lookup, or IMDS lookup is temporarily unavailable, the SDK reports an SdkClientException such as 'Unable to load credentials from any of the providers in the chain'.

Druid's S3 push path already wraps uploads in retryS3Operation, but these credential-provider failures were not classified as recoverable after the SDK v2 migration. That made an intermittent credential miss fail the task immediately instead of using the existing retry budget.

Release note

Retry on transient AWS credential resolution failures


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

S3 segment pushes that use the AWS SDK v2 transfer manager can resolve credentials on the async upload path. If a file-session credential refresh, container credential lookup, or IMDS lookup is temporarily unavailable, the SDK reports an SdkClientException such as 'Unable to load credentials from any of the providers in the chain'.

Druid's S3 push path already wraps uploads in retryS3Operation, but these credential-provider failures were not classified as recoverable after the SDK v2 migration. That made an intermittent credential miss fail the task immediately instead of using the existing retry budget.
@jtuglu1 jtuglu1 added the Bug label Jun 4, 2026
@jtuglu1 jtuglu1 requested review from clintropolis and gianm June 4, 2026 22:09
@jtuglu1 jtuglu1 changed the title fix: retry transient AWS credential resolution failures fix: retry on transient AWS credential resolution failures Jun 5, 2026
"Throttling"
);

private static final String UNABLE_TO_LOAD_CREDENTIALS_FROM_PROVIDER_CHAIN =
Copy link
Copy Markdown
Contributor Author

@jtuglu1 jtuglu1 Jun 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally it'd be better to catch a particular expired creds exception type but these messages are all generic AwsSdkException class, so parsing the message is needed.

Copy link
Copy Markdown
Contributor

@abhishekrb19 abhishekrb19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment, lgtm otherwise

private static final String UNABLE_TO_LOAD_CREDENTIALS_FROM_PROVIDER_CHAIN =
"Unable to load credentials from any of the providers in the chain";
private static final String FAILED_TO_LOAD_CREDENTIALS_FROM_IMDS = "Failed to load credentials from IMDS";
private static final String CANNOT_REFRESH_AWS_CREDENTIALS = "cannot refresh AWS credentials";
Copy link
Copy Markdown
Contributor

@abhishekrb19 abhishekrb19 Jun 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like cannot refresh AWS credentials is not present in the v2 SDK: https://github.com/aws/aws-sdk-java-v2

Maybe this was a legacy one from v1 SDK that isn't relevant anymore and can be removed?

Copy link
Copy Markdown
Contributor Author

@jtuglu1 jtuglu1 Jun 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, I think it's fine to keep though; would rather retry on a false positive than risk killing a task in-case it surfaces somehow. I can push a follow-up to delete if we need to.

@jtuglu1 jtuglu1 merged commit d06aa83 into apache:master Jun 5, 2026
40 checks passed
@github-actions github-actions Bot added this to the 38.0.0 milestone Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants