Skip to content

Recover OpenFeature provider after initialization timeout#11474

Open
leoromanovsky wants to merge 5 commits into
masterfrom
leo.romanovsky/ffl-2339-java-provider-nonblocking-defaults
Open

Recover OpenFeature provider after initialization timeout#11474
leoromanovsky wants to merge 5 commits into
masterfrom
leo.romanovsky/ffl-2339-java-provider-nonblocking-defaults

Conversation

@leoromanovsky
Copy link
Copy Markdown
Contributor

@leoromanovsky leoromanovsky commented May 27, 2026

Motivation

A Java application can start while its Datadog Agent is already running but does not yet have an FFE_FLAGS payload cached for that tracer. That is the customer-reported shape we have been investigating: the tracer starts, asks for feature flag configuration, and the OpenFeature provider waits for usable configuration during initialization.

That blocking behavior is intentional. setProviderAndWait() should wait until the provider receives usable flag configuration, or fail with the configured timeout. The bug is what happens after the timeout path: the OpenFeature provider transitions to ERROR, but when FFE_FLAGS arrives later it must transition back to READY without requiring an application restart.

There are two edge cases this PR needs to handle correctly. First, a real config can arrive right at the timeout boundary, after the evaluator has received it but before provider initialization has finished deciding whether it timed out. Second, usable config can disappear after the provider is already READY; in that case the provider state should not remain READY while evaluations return PROVIDER_NOT_READY.

Changes

This keeps provider initialization blocking. DDEvaluator.initialize() registers for Feature Flagging Gateway updates and waits on the initialization latch until a non-null ServerConfiguration arrives or the configured timeout expires. A null config update means there is still no usable FFE product for the provider, so it does not satisfy initialization.

The provider tracks initialization state explicitly. If real configuration arrives while initialization is still blocked, the provider records that initial config was received and lets the OpenFeature SDK publish PROVIDER_READY after initialize() returns. If initialization has already timed out and the provider is in ERROR, the next real config update moves the provider to READY and emits PROVIDER_READY. After the provider is ready, later real config updates emit PROVIDER_CONFIGURATION_CHANGED.

Nullable config updates after readiness are handled as a real lifecycle transition. If the provider was READY and FFE config becomes unavailable, it emits PROVIDER_ERROR; evaluations then return the caller default with PROVIDER_NOT_READY. A later real config moves the provider back to READY.

Decisions

Blocking is not the problem we are fixing here. Blocking until config or timeout is the Java behavior we want because the provider should not advertise readiness before it has usable flag configuration. The missing piece was recovery after timeout and after losing usable configuration. This PR fixes those recovery paths without changing the public provider API or the caller-facing timeout option.

This also pairs with the first-RC subscription work that landed separately: that work makes the tracer request FFE_FLAGS as early as possible from an already-running Agent, while this change makes the Java OpenFeature provider recover correctly if that first attempt still times out. The related Go tracer reference is SubscribeRC, which subscribes FFE_FLAGS during tracer startup so it is included in the first RC request: https://github.com/DataDog/dd-trace-go/blob/3ded6653e44aeb0d27bd5944e1e8033775473768/internal/openfeature/rc_subscription.go#L40-L44

Evidence

Dogfooding validation is captured in DataDog/ffe-dogfooding#71. That PR adds the local Java build path used to run this PR against the full ffe-dogfooding compose stack before the Java artifacts are published.

The local validation built dd-java-agent-1.63.0-SNAPSHOT.jar and dd-openfeature-1.63.0-SNAPSHOT.jar from this dd-trace-java branch, staged the local provider JAR into the Java dogfooding app image, and mounted the local dd-trace-java checkout so the Java container started with the local agent JAR.

Commands used for the dogfooding smoke test:

DD_TRACE_JAVA_PATH=/path/to/dd-trace-java scripts/prepare-local-java.sh
DD_API_KEY=local_dummy DD_TRACE_JAVA_PATH=/path/to/dd-trace-java docker-compose -f docker-compose.yml -f local/docker-compose.java.yml up --build -d

The full compose stack started successfully with app-go, app-python, app-nodejs, app-java, app-ruby, app-dotnet, datadog-agent, mock-intake, otlp-intake, and evaluator all running. Health checks passed on app ports 8081 through 8086, and the evaluator /stats endpoint responded. The Java container log confirmed the local Java tracer was used:

Using local Java agent: /opt/dd-trace-java/dd-java-agent/build/libs/dd-java-agent-1.63.0-SNAPSHOT.jar
App built with local dd-openfeature JAR

State Transitions

stateDiagram-v2
    [*] --> NOT_STARTED
    NOT_STARTED --> INITIALIZING: initialize()

    INITIALIZING --> INITIALIZING: null config / keep blocking
    INITIALIZING --> INITIAL_CONFIG_RECEIVED: real config arrives before initialize returns
    INITIALIZING --> ERROR: timeout without real config / throw ProviderNotReadyError

    INITIAL_CONFIG_RECEIVED --> READY: initialize returns / OpenFeature SDK emits PROVIDER_READY

    ERROR --> READY: later real config / emit PROVIDER_READY
    ERROR --> ERROR: null config / remain unavailable

    READY --> READY: later real config / emit PROVIDER_CONFIGURATION_CHANGED
    READY --> ERROR: null config / emit PROVIDER_ERROR
Loading

@leoromanovsky leoromanovsky added comp: openfeature OpenFeature type: bug Bug report and fix tag: ai generated Largely based on code generated by an AI or LLM labels May 27, 2026
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 27, 2026

🟢 Java Benchmark SLOs — All performance SLOs passed

Suite Status
Startup 🟢 pass

SLO thresholds are defined here based on automatically generated metrics. A warning is raised when results are within 5% of the threshold.

PR vs. master results

Startup Time

Scenario This PR master Change
insecure-bank / iast 14,003 ms 13,935 ms +0.5%
insecure-bank / tracing 12,947 ms 13,016 ms -0.5%
petclinic / appsec 16,504 ms 16,285 ms +1.3%
petclinic / iast 16,510 ms 16,607 ms -0.6%
petclinic / profiling 16,426 ms 16,515 ms -0.5%
petclinic / tracing 15,818 ms 16,012 ms -1.2%

Commit: fcfa34c9 · CI Pipeline · Benchmarking Platform UI


Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion.

@leoromanovsky leoromanovsky changed the title Avoid blocking OpenFeature provider initialization Recover OpenFeature provider after config timeout May 27, 2026
@leoromanovsky leoromanovsky changed the title Recover OpenFeature provider after config timeout Recover OpenFeature provider after initialization timeout May 27, 2026
@leoromanovsky leoromanovsky removed the tag: ai generated Largely based on code generated by an AI or LLM label May 28, 2026
@leoromanovsky leoromanovsky marked this pull request as ready for review May 28, 2026 02:04
@leoromanovsky leoromanovsky requested a review from a team as a code owner May 28, 2026 02:04
@leoromanovsky leoromanovsky requested review from dd-oleksii and typotter and removed request for a team May 28, 2026 02:04
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fcfa34c9cd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

}
initializationState.set(InitializationState.READY);
} catch (final OpenFeatureError e) {
initializationState.set(InitializationState.ERROR);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid reverting READY recovery after a timeout race

If the initialization timeout wins the compareAndSet(INITIALIZING, ERROR) path and a real config arrives before this catch runs, onConfigurationChange() can already transition the provider back to READY and emit PROVIDER_READY; this unconditional assignment then puts the internal state back to ERROR while configuration is present. In that state later config-loss updates are ignored by onConfigurationUnavailable(), so the OpenFeature client can remain ready while evaluations return PROVIDER_NOT_READY. Preserve an already-recovered READY state instead of always overwriting it here.

Useful? React with 👍 / 👎.

Comment on lines +132 to +133
if (state != InitializationState.READY) {
return;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle null config while initial readiness is pending

When a real config releases the initialization latch, the provider is in INITIAL_CONFIG_RECEIVED until initialize() finishes; if a null config update arrives in that window, this branch drops it because the state is not yet READY, and initialize() can subsequently set the provider to READY even though evaluator.hasConfiguration() is already false. That leaves the client advertising readiness while evaluations return PROVIDER_NOT_READY until another RC update arrives, so the unavailable transition needs to account for INITIAL_CONFIG_RECEIVED too.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp: openfeature OpenFeature type: bug Bug report and fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants