Add leader election configuration to operator deployment#2185
Add leader election configuration to operator deployment#2185yuyue9284 wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
5922776 to
1d0bdfe
Compare
|
/ok-to-test 1d0bdfe |
|
@yuyue9284 We've changed our CI workflows, so could you rebase your PR? Thanks. |
Signed-off-by: Yue Yu <yuyue9284@outlook.com>
1d0bdfe to
7b186af
Compare
Thanks, updated. |
|
Is there a good reason to actually disable leader election here? @yuyue9284 The issue you've linked is doesn't seem like the failed leader election is the problem, but a rather a symptom of a failing API server. I am open to changing my view on this, but I am not convinced that we need a setting for this right now. |
Hi @tariq1890, we run gpu-operator on a large cluster, during the other system components update, the API server may face the temporally performance degrade during that time, and as a result the gpu-operator might fail to renew the lease and crash/re-list at that time, make the API server perf worse, since we are only running one instance of operator, we want to disable the leader election to avoid restart at that code path, but it is not supported by the chart.. |
Description
Add config to operator's leader election, for single replica, disable the leader election can prevent the unwanted crash of gpu-operator due to transient error.
#772
Checklist
make lint)make validate-generated-assets)make validate-modules)Testing