Skip to content

Add leader election configuration to operator deployment#2185

Open
yuyue9284 wants to merge 1 commit intoNVIDIA:mainfrom
yuyue9284:support-disable-leader-election
Open

Add leader election configuration to operator deployment#2185
yuyue9284 wants to merge 1 commit intoNVIDIA:mainfrom
yuyue9284:support-disable-leader-election

Conversation

@yuyue9284
Copy link
Copy Markdown

@yuyue9284 yuyue9284 commented Mar 2, 2026

Description

Add config to operator's leader election, for single replica, disable the leader election can prevent the unwanted crash of gpu-operator due to transient error.

#772

Checklist

  • No secrets, sensitive information, or unrelated changes
  • Lint checks passing (make lint)
  • Generated assets in-sync (make validate-generated-assets)
  • Go mod artifacts in-sync (make validate-modules)
  • Test cases are added for new code paths

Testing

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 2, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yuyue9284 yuyue9284 force-pushed the support-disable-leader-election branch from 5922776 to 1d0bdfe Compare March 2, 2026 21:34
@rajathagasthya
Copy link
Copy Markdown
Contributor

/ok-to-test 1d0bdfe

@rajathagasthya
Copy link
Copy Markdown
Contributor

@yuyue9284 We've changed our CI workflows, so could you rebase your PR? Thanks.

Signed-off-by: Yue Yu <yuyue9284@outlook.com>
@yuyue9284 yuyue9284 force-pushed the support-disable-leader-election branch from 1d0bdfe to 7b186af Compare April 7, 2026 21:29
@yuyue9284
Copy link
Copy Markdown
Author

@yuyue9284 We've changed our CI workflows, so could you rebase your PR? Thanks.

Thanks, updated.

@tariq1890
Copy link
Copy Markdown
Contributor

Is there a good reason to actually disable leader election here? @yuyue9284 The issue you've linked is doesn't seem like the failed leader election is the problem, but a rather a symptom of a failing API server. I am open to changing my view on this, but I am not convinced that we need a setting for this right now.

@yuyue9284
Copy link
Copy Markdown
Author

Is there a good reason to actually disable leader election here? @yuyue9284 The issue you've linked is doesn't seem like the failed leader election is the problem, but a rather a symptom of a failing API server. I am open to changing my view on this, but I am not convinced that we need a setting for this right now.

Hi @tariq1890, we run gpu-operator on a large cluster, during the other system components update, the API server may face the temporally performance degrade during that time, and as a result the gpu-operator might fail to renew the lease and crash/re-list at that time, make the API server perf worse, since we are only running one instance of operator, we want to disable the leader election to avoid restart at that code path, but it is not supported by the chart..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants