Skip to content

Tweak probes#102

Merged
aaron-steinfeld merged 4 commits intomainfrom
tweak-probes
Aug 26, 2021
Merged

Tweak probes#102
aaron-steinfeld merged 4 commits intomainfrom
tweak-probes

Conversation

@aaron-steinfeld
Copy link
Copy Markdown
Contributor

@aaron-steinfeld aaron-steinfeld commented Aug 26, 2021

Description

After deploying the previous probe changes in a more complex environment with a service mesh, instrumentation and constrained resources (limit around 400m instead of the default 2 cpus). I found the default values were a little ambitious. The grpc-based check is significantly more sensitive to resource constraints then the previous checks, which from my perspective is a good thing and reflective of its increased fidelity - that is, the health check reflects the response time the service clients would see.

In order to account for that, I took the following changes:

  • Over-provisioned the check timeout in code and instead allowed constraining it via configuration, which is now defaulted to 3s
  • Increased the default liveness failure threshold to 3 - so it would now take 15s in its default config to detect and restart.
  • Catch health check deadline exceptions and convert them to just a false (unhealthy) response. The exception log statement is available for debug. The deadline errors cause confusion in the logs and only reflect one class of health check failure (since the checks are also constrained on the probe config side).

As a separate exercise, it might be worth exploring why the cpu usage on attribute service is so bursty, as it does really minimal compute. It appears to be correlated with a synchronized burst of traffic, taking quite a bit of time to stabilize to its baseline (significantly longer than any call should take). Below log, with some temporary messaging enabled for illustration, shows this behavior.

2021-08-26 19:05:10.605 [qtp560041895-31] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [6ms]
2021-08-26 19:05:15.604 [qtp560041895-30] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [5ms]
2021-08-26 19:05:20.605 [qtp560041895-31] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [5ms]
2021-08-26 19:05:25.118 [grpc-default-executor-4] INFO  o.m.d.connection - Opened connection [connectionId{localValue:15, serverValue:348458}] to mongo
2021-08-26 19:05:25.120 [grpc-default-executor-1] INFO  o.m.d.connection - Opened connection [connectionId{localValue:14, serverValue:348455}] to mongo
2021-08-26 19:05:25.120 [grpc-default-executor-3] INFO  o.m.d.connection - Opened connection [connectionId{localValue:12, serverValue:348459}] to mongo
2021-08-26 19:05:25.120 [grpc-default-executor-0] INFO  o.m.d.connection - Opened connection [connectionId{localValue:11, serverValue:348457}] to mongo
2021-08-26 19:05:25.120 [grpc-default-executor-2] INFO  o.m.d.connection - Opened connection [connectionId{localValue:13, serverValue:348456}] to mongo
2021-08-26 19:05:25.706 [qtp560041895-33] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [106ms]
2021-08-26 19:05:33.199 [qtp560041895-30] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [2498ms]
2021-08-26 19:05:38.001 [qtp560041895-28] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [2401ms]
2021-08-26 19:05:41.902 [qtp560041895-30] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [1202ms]
2021-08-26 19:05:45.943 [qtp560041895-33] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [344ms]
2021-08-26 19:05:50.823 [qtp560041895-31] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [224ms]
2021-08-26 19:05:55.700 [qtp560041895-33] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [98ms]
2021-08-26 19:06:01.010 [qtp560041895-29] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [407ms]
2021-08-26 19:06:05.901 [qtp560041895-33] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [202ms]
2021-08-26 19:06:10.698 [qtp560041895-29] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [99ms]
2021-08-26 19:06:15.798 [qtp560041895-28] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [199ms]
2021-08-26 19:06:20.703 [qtp560041895-29] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [104ms]
2021-08-26 19:06:25.801 [qtp560041895-28] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [201ms]
2021-08-26 19:06:30.798 [qtp560041895-32] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [198ms]
2021-08-26 19:06:35.798 [qtp560041895-28] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [198ms]
2021-08-26 19:06:40.603 [qtp560041895-32] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [3ms]
2021-08-26 19:06:45.602 [qtp560041895-26] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [2ms]
2021-08-26 19:06:50.604 [qtp560041895-33] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [4ms]
2021-08-26 19:06:55.601 [qtp560041895-26] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [1ms]

Testing

Deployed and tuned this a bit in a test environment running with a mesh and constrained resources.

@codecov
Copy link
Copy Markdown

codecov bot commented Aug 26, 2021

Codecov Report

Merging #102 (8c250e3) into main (494351f) will decrease coverage by 0.30%.
The diff coverage is 62.50%.

Impacted file tree graph

@@             Coverage Diff              @@
##               main     #102      +/-   ##
============================================
- Coverage     81.59%   81.28%   -0.31%     
  Complexity      232      232              
============================================
  Files            28       28              
  Lines           788      791       +3     
  Branches         57       57              
============================================
  Hits            643      643              
- Misses           97      100       +3     
  Partials         48       48              
Flag Coverage Δ
integration 81.28% <62.50%> (-0.31%) ⬇️
unit 68.25% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
.../core/attribute/service/AttributeServiceEntry.java 77.50% <62.50%> (-6.29%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 494351f...8c250e3. Read the comment docs.

@github-actions

This comment has been minimized.

@aaron-steinfeld aaron-steinfeld marked this pull request as ready for review August 26, 2021 19:37
@aaron-steinfeld aaron-steinfeld requested a review from a team August 26, 2021 19:37
@aaron-steinfeld aaron-steinfeld merged commit 00abb99 into main Aug 26, 2021
@aaron-steinfeld aaron-steinfeld deleted the tweak-probes branch August 26, 2021 23:32
@github-actions
Copy link
Copy Markdown
Contributor

Unit Test Results

21 files  ±0  21 suites  ±0   12s ⏱️ ±0s
77 tests ±0  77 ✔️ ±0  0 💤 ±0  0 ❌ ±0 

Results for commit 00abb99. ± Comparison against base commit 494351f.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants