Retry KMS requests on transient errors#1953
Conversation
Add libmongocrypt CAPI bindings for KMS retry support and wire retry logic through the sync and reactive driver stacks. Transient KMS HTTP and network errors are retried with backoff delays managed by libmongocrypt; retry is enabled unconditionally. - Add native bindings: mongocrypt_setopt_retry_kms, mongocrypt_kms_ctx_usleep, mongocrypt_kms_ctx_feed_with_retry, mongocrypt_kms_ctx_fail - Add sleepMicroseconds(), feedAndRetry(), fail() to MongoKeyDecryptor - Enable KMS retry unconditionally in MongoCryptImpl - Rewrite sync Crypt.decryptKey() with retry loop, timeout-aware - Add retry logic to reactive KeyManagementService.decryptKey() - Fix TlsChannelImpl.read() to preserve bytes delivered alongside close_notify (already fixed upstream in marianobarrios/tls-channel) - Add spec Section 24 KMS retry integration tests (sync + reactive) - Add Evergreen CI task for KMS retry tests JAVA-5391
There was a problem hiding this comment.
Pull request overview
Adds libmongocrypt-backed retry support for KMS requests and wires it through the sync and reactive driver encryption flows, including a small TLS-channel fix needed for correct KMS response handling and new CI coverage for the retry prose tests.
Changes:
- Introduce new libmongocrypt CAPI/JNA bindings and surface them via
MongoKeyDecryptorto drive sleep/backoff and retry decisions. - Implement retry loops in sync
Crypt.decryptKey()and reactiveKeyManagementService.decryptKey()using libmongocrypt’s retry signals and operation-timeout awareness. - Add KMS retry prose tests (sync + reactive) and an Evergreen task/script to run them.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| mongodb-crypt/src/main/com/mongodb/internal/crypt/capi/MongoKeyDecryptorImpl.java | Implements new retry-related native calls (usleep, feed_with_retry, fail) on the KMS ctx wrapper. |
| mongodb-crypt/src/main/com/mongodb/internal/crypt/capi/MongoKeyDecryptor.java | Extends decryptor API with retry hooks and a default initial KMS read size constant. |
| mongodb-crypt/src/main/com/mongodb/internal/crypt/capi/MongoCryptImpl.java | Enables KMS retry option in libmongocrypt during initialization. |
| mongodb-crypt/src/main/com/mongodb/internal/crypt/capi/CAPI.java | Adds JNA native bindings for KMS retry APIs. |
| driver-sync/src/main/com/mongodb/client/internal/Crypt.java | Reworks KMS decryption to loop retries with backoff and timeout checks. |
| driver-reactive-streams/src/main/com/mongodb/reactivestreams/client/internal/crypt/KeyManagementService.java | Adds reactive retry flow with libmongocrypt-provided delay and retry decisions. |
| driver-core/src/main/com/mongodb/internal/connection/tlschannel/impl/TlsChannelImpl.java | Preserves bytes produced alongside TLS close_notify instead of immediately returning -1. |
| driver-sync/src/test/functional/com/mongodb/client/AbstractClientSideEncryptionKmsRetryProseTest.java | Adds shared prose tests for KMS retry behaviors (TCP/HTTP retry, exhausted retries, timeout mid-retry). |
| driver-sync/src/test/functional/com/mongodb/client/ClientSideEncryptionKmsRetryProseTest.java | Sync concrete test wiring for the shared retry prose tests. |
| driver-reactive-streams/src/test/functional/com/mongodb/reactivestreams/client/ClientSideEncryptionKmsRetryProseTest.java | Reactive concrete test wiring (via sync adapter) for the shared retry prose tests. |
| .evergreen/run-kms-retry-tests.sh | Adds CI script to run sync + reactive KMS retry prose tests with required trust material. |
| .evergreen/.evg.yml | Adds Evergreen function/task/buildvariant wiring to execute the KMS retry test script. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Assigned |
stIncMale
left a comment
There was a problem hiding this comment.
Some questions and suggestions after a partial review (the smaller part of it).
| * @param crypt The @ref mongocrypt_t object to update | ||
| * @param enable Whether to enable KMS retry | ||
| * @return A boolean indicating success. If false, an error status is set. | ||
| * @since 5.8 |
There was a problem hiding this comment.
Why do we specify @since on an internal API? (the same applies to MongoKeyDecryptor) I fail to see an answer myself.
I see that some existing program elements in CAPI have @since specified, while some other do not (like mongocrypt_setopt_enable_multiple_collinfo, which is relatively new).
There was a problem hiding this comment.
Just convention I suppose and I suppose its slightly informative - we don't publish the javadocs either yet write them.
There was a problem hiding this comment.
Resolving as its a Nit / question and not worth changing
There was a problem hiding this comment.
Ended up removing all @since tags here - as you mentioned the usage was mixed. Also its really tied to the libmongocrypt version.
| * | ||
| * @param crypt The @ref mongocrypt_t object to update | ||
| * @param enable Whether to enable KMS retry | ||
| * @return A boolean indicating success. If false, an error status is set. |
There was a problem hiding this comment.
Let's add the
Retrieve it with @ref mongocrypt_ctx_status
sentence from the documentation in mongocrypt.h to the @return text. It is useful, and we seem to have it in the documentation of other methods. Though it should be formatted properly according to the Java documentation comment rules.
There was a problem hiding this comment.
Updated all Javadocs to bring inline with mongocrypt.h as of 1.18.1
| * Opt-into handling the MONGOCRYPT_CTX_NEED_KMS state with retry logic. | ||
| * | ||
| * <p>If opted in, KMS requests will include retry information accessible via | ||
| * {@link #mongocrypt_kms_ctx_usleep}, {@link #mongocrypt_kms_ctx_feed_with_retry}, |
There was a problem hiding this comment.
mongocrypt_kms_ctx_feed_with_retrywas addded later in MONGOCRYPT-763 add in-place retry API libmongocrypt#1011 (https://jira.mongodb.org/browse/MONGOCRYPT-763).- The original prose, which JAVA-5391 and this PR are supposed to implement, instructed us to use
mongocrypt_kms_ctx_feed. The current prose, added in the above libmongocrypt PR, instructs us to use eithermongocrypt_kms_ctx_feedormongocrypt_kms_ctx_feed_with_retry.
Thus, we do not have to use mongocrypt_kms_ctx_feed_with_retry. If we decide to introduce this method and use it, then we should
also
- fully implement https://jira.mongodb.org/browse/MONGOCRYPT-763 in this PR (it seems, it mostly was implemented?);
- example: the documentation of
mongocrypt_kms_ctx_failwas updated in MONGOCRYPT-763 add in-place retry API libmongocrypt#1011, and that change is not reflected in the current PR.
- example: the documentation of
- specify JAVA-5772 in addition to JAVA-5391 in the description of the current PR
- resolve https://jira.mongodb.org/browse/JAVA-5772 (split from https://jira.mongodb.org/browse/MONGOCRYPT-763) once this PR is merged.
I wonder how the PR ended up using a method from a different change, why nothing was explicitly said about that to help orienting a reviewer, and what was the plan for JAVA-5772 given that this PR implements most of what should be done in JAVA-5772.
There was a problem hiding this comment.
Thanks - Let me review JAVA-5772 and ensure all is covered there. I followed integrating.md but that most likely was the latest version and effectively have done both tickets.
There was a problem hiding this comment.
Confirmed this includes JAVA-5772. I didn't see a drivers ticket for it and missed it completely. Not that its very clear either.
| /** | ||
| * Feed bytes from the HTTP response, with retry support. | ||
| * | ||
| * <p>Requires {@link #mongocrypt_setopt_retry_kms} to be enabled. | ||
| * | ||
| * @param kms The @ref mongocrypt_kms_ctx_t. | ||
| * @param bytes The bytes to feed. | ||
| * @param should_retry Receives whether the driver should retry the KMS request. | ||
| * @return A boolean indicating success. | ||
| * @since 5.8 | ||
| */ | ||
| public static native boolean | ||
| mongocrypt_kms_ctx_feed_with_retry(mongocrypt_kms_ctx_t kms, | ||
| mongocrypt_binary_t bytes, | ||
| ByteByReference should_retry); |
There was a problem hiding this comment.
For some reason the documentation of this method in the PR omits parts of the documentation from mongocrypt.h. Let's bring them all here, unless there is a good reason to omit any specific part.
| bytesToReturn = res.bytesProduced; | ||
| if (res.wasClosed) { | ||
| return -1; | ||
| // JAVA-5391: return any bytes produced alongside close_notify; the next read | ||
| // sees shutdownReceived and returns -1. Fixed in upstream marianobarrios/tls-channel; | ||
| // this is the minimal patch until the vendored snapshot is refreshed. | ||
| return bytesToReturn > 0 ? bytesToReturn : -1; |
There was a problem hiding this comment.
Could you please
- Either explain the problem in this GitHub thread, or link to an existing explanation / bug report.
- Also explain why the fix is required for the current PR.
- Share a link to the upstream PR/commit with the fix.
I think, given that this change is in a vendored code, we should do it in a separate PR, to avoid complicating the future work of updating the vendored code.
There was a problem hiding this comment.
Yes it was causing the tests to fail (I can't recall which one) but it was not returning bytes that needed to be read. Most likely the mock kms server.
There was a problem hiding this comment.
Ah yes it was:
ClientSideEncryptionKmsRetryProseTest > testCreateDataKeyAndEncryptWithHttpRetry(String) > Case 2: HTTP retry with aws FAILED
If causes: MongoClientException: Exception in encryption library: Allocation size must be greater than zero.
It doesn't happen on sync because it uses SSLSocket/InputStream, which correctly delivers data before signalling EOF on close_notify.
It isn't triggered by gcp or azure because the failpoint fires on the OAuth token request. So the actual key operation uses another connection and succeeds. As AWS doesn't use a token, and fails with data unconsumed on the wire.
According to Claude:
The TLS spec allows close_notify to arrive in the same flight as application data. > Any real-world TLS server can legitimately:
- Send Connection: close and the response in one write (common for HTTP/1.0 or error responses)
- Send a response and immediately initiate TLS shutdown
- Have the OS coalesce the response and close_notify into a single TCP segment due to Nagle's algorithm or buffering
TLSChannelImpl fixed this behaviour in 0.5.0
| } | ||
|
|
||
| @Override | ||
| public boolean feedAndRetry(final ByteBuffer bytes) { |
There was a problem hiding this comment.
The current name feedAndRetry suggests that the method also retries, which is not true. Let's find a name that is less misleading. Maybe feedWithRetry (as is in CAPI), or feedWithRetrySupport?
| /** | ||
| * Initial read size after re-sending a KMS request. Matches libmongocrypt's DEFAULT_MAX_KMS_BYTE_REQUEST | ||
| * and is used when {@link #bytesNeeded()} still returns 0 because libmongocrypt's should_retry flag has | ||
| * not yet been cleared by {@link #feedAndRetry}. | ||
| * | ||
| * @since 5.8 | ||
| */ | ||
| int DEFAULT_KMS_READ_SIZE = 1024; |
There was a problem hiding this comment.
Where does this constant and the rules accompanying it come from? It seem to be in conflict with both the current integration.md instructions and the API documentation of the methods mongocrypt_kms_ctx_bytes_needed, mongocrypt_kms_ctx_feed_with_retry.
There was a problem hiding this comment.
As per the comment above - it matches DEFAULT_MAX_KMS_BYTE_REQUEST.
The fail() function resets the parser and sets should_retry = true, which deliberately makes bytes_needed return 0. This creates an awkward contract: the driver must somehow know to do a read even though bytes_needed says "I don't need anything."
The integrating guide doesn't document this gap — it just says "loop while bytes_needed > 0" and "call fail() on network error, retry if it returns true," without acknowledging that those two instructions are incompatible without a fallback.
There was a problem hiding this comment.
I asked Claude to compare my implementation to that of the python binding. Seems I over complicated it (or shortcutted it) and I can just revisit the State machine loop. Will update.
| mongocrypt_setopt_bypass_query_analysis (mongocrypt_t crypt); | ||
|
|
||
| /** | ||
| * Opt-into handling the MONGOCRYPT_CTX_NEED_KMS state with retry logic. |
There was a problem hiding this comment.
Let's make MONGOCRYPT_CTX_NEED_KMS a @link to the corresponding program element.
There was a problem hiding this comment.
- Let's update the documentation of the
mongocrypt_kms_ctx_bytes_neededmethod that existed before this PR, such that it mentions not only to themongocrypt_kms_ctx_feedmethod, but also to themongocrypt_kms_ctx_feed_with_retrymethod. - Let's express those notions using
@link.
| /** | ||
| * Signal to libmongocrypt that a network error occurred on this KMS request. | ||
| * | ||
| * <p>Requires {@link #mongocrypt_setopt_retry_kms} to be enabled. | ||
| * | ||
| * @param kms The @ref mongocrypt_kms_ctx_t. | ||
| * @return True if the request should be retried, false if retries are exhausted. | ||
| * @since 5.8 | ||
| */ | ||
| public static native boolean | ||
| mongocrypt_kms_ctx_fail(mongocrypt_kms_ctx_t kms); |
There was a problem hiding this comment.
True if the request should be retried
The documentation of the native method says "may be retried", not "should", which is different. Let's make our documentation consistent with that of the native API, and update the documentation of MongoKeyDecryptor.fail accordingly.
There was a problem hiding this comment.
I've got Claude to update them all - as most are very out of date.
I've also added JAVA-6225 to remove any unused statics / methods that either we never or no longer implement
There was a problem hiding this comment.
I am pausing the review until we address the somewhat structurally important concerns:
- #1953 (comment)
- #1953 (comment)
- #1953 (comment)
- #1953 (comment) - the tests should pass. Update: they pass.
| /** | ||
| * Note: Not thread-safe: methods mutate the underlying native {@code mongocrypt_kms_ctx_t} and must be invoked serially. | ||
| * Callers perform retries sequentially — the sync driver in a {@code while} loop and the reactive driver via | ||
| * {@code Mono.flatMap} — so no external synchronization is required. | ||
| */ | ||
| class MongoKeyDecryptorImpl implements MongoKeyDecryptor { |
There was a problem hiding this comment.
- We have
@NotThreadSafefor this, let's use it.- Also, given that for some non-obvious reason (do you know what it is?) we use
MongoKeyDecryptorImplvia theMongoKeyDecryptorinterface, we should document the interface as non-thread-safe.
- Also, given that for some non-obvious reason (do you know what it is?) we use
- If you want to document explicitly that the wrapped
mongocrypt_kms_ctx_tis not thread-safe, then let's do that in the documentation of that type, not here. - The "methods mutate the underlying..." text is not needed - it does not add anything useful to the "not thread-safe" information, nor does it on its own imply non-thread-safety.
- "Callers perform retries sequentially ..." - all that information may be a comment on the calling code, if it's really needed there. I think, the documentation of
MongoKeyDecryptorImplis a wrong place for the comments for the code that uses the type.
There was a problem hiding this comment.
Opted to remove it. Unfortunately, @NotThreadSafe isn't available to use here.
…nd added a rule to AGENTS.md
Add libmongocrypt CAPI bindings for KMS retry support and wire retry logic through the sync and reactive driver stacks. Transient KMS HTTP and network errors are retried with backoff delays managed by libmongocrypt; retry is enabled unconditionally.
JAVA-5391