server: make HikariCP leak detection configurable#13407
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 4.22 #13407 +/- ##
============================================
- Coverage 17.68% 17.67% -0.01%
+ Complexity 15793 15791 -2
============================================
Files 5922 5922
Lines 533123 533182 +59
Branches 65201 65210 +9
============================================
- Hits 94268 94251 -17
- Misses 428212 428284 +72
- Partials 10643 10647 +4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
@blueorangutan package kvm |
|
@andrijapanicsb a [SL] Jenkins job has been kicked to build packages. It will be bundled with kvm SystemVM template(s). I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 18233 |
|
@blueorangutan help |
|
@andrijapanicsb [SL] I understand these words: "help", "hello", "thanks", "package", "test" Blessed contributors for kicking Trillian test jobs: ['rohityadavcloud', 'shwstppr', 'Damans227', 'vishesh92', 'Pearl1594', 'harikrishna-patnala', 'nvazquez', 'DaanHoogland', 'weizhouapache', 'borisstoyanov', 'vladimirpetrov', 'kiranchavala', 'andrijapanicsb', 'NuxRo', 'rajujith', 'sureshanaparti', 'abh1sar', 'sudo87', 'RosiKyu'] |
Description
This PR makes the HikariCP leak-detection threshold and JMX MBean registration configurable per database pool via
db.properties, instead of relying on HikariCP defaults that cannot be changed without a code change.CloudStack already maps a subset of
db.propertiesvalues ontoHikariConfiginframework/db/src/main/java/com/cloud/utils/db/TransactionLegacy.java(e.g.maxActive,maxIdle,maxWait,minIdleConnections,connectionTimeout,keepAliveTime). This PR adds two more, following the exact same parsing/threading pattern:db.<pool>.leakDetectionThresholdHikariConfig#setLeakDetectionThreshold(long)0(disabled)db.<pool>.registerMbeansHikariConfig#setRegisterMbeans(boolean)false(disabled)Supported for all three pools that use the shared datasource factory:
cloud,usage,simulator.Behaviour:
leakDetectionThresholdabsent or0→ leak detection disabled (unchanged default behaviour). Only applied when set to a value> 0. (HikariCP itself ignores values below 2000 ms with a warning.)registerMbeansabsent orfalse→ MBeans disabled (unchanged default).true→ Hikari JMX MBeans registered for live pool-counter observation.Motivation / context: in production we saw the management server become unstable — and eventually crash — on clusters exercising Host-HA. Watching MySQL with
SHOW PROCESSLISTduring the incident showed the number of sessions owned by thecloudDB user climbing steadily over a couple of hours, all of them in theSleepstate, until the HikariCP pool (db.cloud.maxActive, default250) was exhausted and the server could no longer borrow a connection. That signature — monotonically growing, never-reaped, all idle, all owned by theclouduser — is a classic DB connection leak in a periodic code path (suspected Host-HA host checks) that borrows a pooled connection and never returns it.The problem is these symptoms tell you that connections leak, not where. HikariCP already has the exact tool for that —
leakDetectionThreshold— but CloudStack hard-wires it off with no way to turn it on. This PR exposes it (andregisterMbeans) throughdb.propertiesso an operator can enable leak detection on a live server; HikariCP then logs anApparent connection leak detectedstack trace identifying the precise code path that borrowed the connection and failed to return it, and the MBeans give live pool-counter visibility. The actual leak fix is a separate change; this PR is the diagnostic enabler.Everything is disabled by default, so there is no behavioural change for existing deployments that don't set the new properties.
Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
N/A
How Has This Been Tested?
Build: compiled the affected module and its dependencies off tag
4.22.1.0:Result: BUILD SUCCESS (checkstyle passed). The change is confined to property parsing and threading through the existing
createDataSource→createHikaricpDataSourcechain, reusing the existingparseNumber(...)helper.Unit tests: the apply-logic is factored into a package-private
applyHikariDebugSettings(HikariConfig, Long, Boolean, String)and covered by 4 newTransactionLegacyTestcases — defaults-disabled,0-keeps-disabled, leak-detection-enabled (60000), and register-MBeans-enabled:Result: Tests run: 4, Failures: 0, Errors: 0 — BUILD SUCCESS.
Runtime validation plan (on a patched management server):
/etc/cloudstack/management/db.properties:systemctl restart cloudstack-managementjava.lang.Exception: Apparent connection leak detectedwith a stack trace throughcom.zaxxer.hikari.HikariDataSource.getConnection(...)→com.cloud.utils.db.TransactionLegacy...identifying the borrowing path.registerMbeans=true, thecom.zaxxer.hikari:type=Pool (cloud)MBean is visible via JMX for live pool counters.How did you try to break this feature and the system with this change?
Edge cases considered:
registerMbeans=false(existing behaviour preserved).leakDetectionThreshold=0→ not applied (disabled).leakDetectionThresholdbetween 1–1999 ms → passed to Hikari, which warns and ignores it (documented Hikari behaviour; noted in the code comment and the sample config).registerMbeans=falseexplicitly → MBeans off.getDefaultHikaricpDataSource) → untouched.These cases (defaults,
0, enabled threshold, enabled MBeans) are locked down by the newapplyHikariDebugSettingsunit tests.