Add leader election configuration to operator deployment by yuyue9284 · Pull Request #2185 · NVIDIA/gpu-operator

yuyue9284 · 2026-03-02T21:28:18Z

Description

Add config to operator's leader election, for single replica, disable the leader election can prevent the unwanted crash of gpu-operator due to transient error.

#772

Checklist

No secrets, sensitive information, or unrelated changes
Lint checks passing (make lint)
Generated assets in-sync (make validate-generated-assets)
Go mod artifacts in-sync (make validate-modules)
Test cases are added for new code paths

Testing

copy-pr-bot · 2026-03-02T21:28:22Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

rajathagasthya · 2026-04-01T17:10:14Z

/ok-to-test 1d0bdfe

rajathagasthya · 2026-04-02T15:58:40Z

@yuyue9284 We've changed our CI workflows, so could you rebase your PR? Thanks.

Signed-off-by: Yue Yu <yuyue9284@outlook.com>

yuyue9284 · 2026-04-07T21:30:02Z

@yuyue9284 We've changed our CI workflows, so could you rebase your PR? Thanks.

Thanks, updated.

tariq1890 · 2026-04-07T22:17:55Z

Is there a good reason to actually disable leader election here? @yuyue9284 The issue you've linked is doesn't seem like the failed leader election is the problem, but a rather a symptom of a failing API server. I am open to changing my view on this, but I am not convinced that we need a setting for this right now.

yuyue9284 · 2026-04-07T22:31:33Z

Is there a good reason to actually disable leader election here? @yuyue9284 The issue you've linked is doesn't seem like the failed leader election is the problem, but a rather a symptom of a failing API server. I am open to changing my view on this, but I am not convinced that we need a setting for this right now.

Hi @tariq1890, we run gpu-operator on a large cluster, during the other system components update, the API server may face the temporally performance degrade during that time, and as a result the gpu-operator might fail to renew the lease and crash/re-list at that time, make the API server perf worse, since we are only running one instance of operator, we want to disable the leader election to avoid restart at that code path, but it is not supported by the chart..

yuyue9284 requested review from cdesiniotis, karthikvetrivel, rahulait, rajathagasthya, shivamerla and tariq1890 as code owners March 2, 2026 21:28

yuyue9284 force-pushed the support-disable-leader-election branch from 5922776 to 1d0bdfe Compare March 2, 2026 21:34

Add leader election configuration to operator deployment

7b186af

Signed-off-by: Yue Yu <yuyue9284@outlook.com>

yuyue9284 force-pushed the support-disable-leader-election branch from 1d0bdfe to 7b186af Compare April 7, 2026 21:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add leader election configuration to operator deployment#2185

Add leader election configuration to operator deployment#2185
yuyue9284 wants to merge 1 commit intoNVIDIA:mainfrom
yuyue9284:support-disable-leader-election

yuyue9284 commented Mar 2, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Mar 2, 2026

Uh oh!

rajathagasthya commented Apr 1, 2026

Uh oh!

rajathagasthya commented Apr 2, 2026

Uh oh!

yuyue9284 commented Apr 7, 2026

Uh oh!

tariq1890 commented Apr 7, 2026

Uh oh!

yuyue9284 commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yuyue9284 commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Testing

Uh oh!

copy-pr-bot bot commented Mar 2, 2026

Uh oh!

rajathagasthya commented Apr 1, 2026

Uh oh!

rajathagasthya commented Apr 2, 2026

Uh oh!

yuyue9284 commented Apr 7, 2026

Uh oh!

tariq1890 commented Apr 7, 2026

Uh oh!

yuyue9284 commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yuyue9284 commented Mar 2, 2026 •

edited

Loading