Tweak probes by aaron-steinfeld · Pull Request #102 · hypertrace/attribute-service

aaron-steinfeld · 2021-08-26T19:34:52Z

Description

After deploying the previous probe changes in a more complex environment with a service mesh, instrumentation and constrained resources (limit around 400m instead of the default 2 cpus). I found the default values were a little ambitious. The grpc-based check is significantly more sensitive to resource constraints then the previous checks, which from my perspective is a good thing and reflective of its increased fidelity - that is, the health check reflects the response time the service clients would see.

In order to account for that, I took the following changes:

Over-provisioned the check timeout in code and instead allowed constraining it via configuration, which is now defaulted to 3s
Increased the default liveness failure threshold to 3 - so it would now take 15s in its default config to detect and restart.
Catch health check deadline exceptions and convert them to just a false (unhealthy) response. The exception log statement is available for debug. The deadline errors cause confusion in the logs and only reflect one class of health check failure (since the checks are also constrained on the probe config side).

As a separate exercise, it might be worth exploring why the cpu usage on attribute service is so bursty, as it does really minimal compute. It appears to be correlated with a synchronized burst of traffic, taking quite a bit of time to stabilize to its baseline (significantly longer than any call should take). Below log, with some temporary messaging enabled for illustration, shows this behavior.

2021-08-26 19:05:10.605 [qtp560041895-31] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [6ms]
2021-08-26 19:05:15.604 [qtp560041895-30] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [5ms]
2021-08-26 19:05:20.605 [qtp560041895-31] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [5ms]
2021-08-26 19:05:25.118 [grpc-default-executor-4] INFO  o.m.d.connection - Opened connection [connectionId{localValue:15, serverValue:348458}] to mongo
2021-08-26 19:05:25.120 [grpc-default-executor-1] INFO  o.m.d.connection - Opened connection [connectionId{localValue:14, serverValue:348455}] to mongo
2021-08-26 19:05:25.120 [grpc-default-executor-3] INFO  o.m.d.connection - Opened connection [connectionId{localValue:12, serverValue:348459}] to mongo
2021-08-26 19:05:25.120 [grpc-default-executor-0] INFO  o.m.d.connection - Opened connection [connectionId{localValue:11, serverValue:348457}] to mongo
2021-08-26 19:05:25.120 [grpc-default-executor-2] INFO  o.m.d.connection - Opened connection [connectionId{localValue:13, serverValue:348456}] to mongo
2021-08-26 19:05:25.706 [qtp560041895-33] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [106ms]
2021-08-26 19:05:33.199 [qtp560041895-30] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [2498ms]
2021-08-26 19:05:38.001 [qtp560041895-28] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [2401ms]
2021-08-26 19:05:41.902 [qtp560041895-30] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [1202ms]
2021-08-26 19:05:45.943 [qtp560041895-33] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [344ms]
2021-08-26 19:05:50.823 [qtp560041895-31] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [224ms]
2021-08-26 19:05:55.700 [qtp560041895-33] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [98ms]
2021-08-26 19:06:01.010 [qtp560041895-29] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [407ms]
2021-08-26 19:06:05.901 [qtp560041895-33] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [202ms]
2021-08-26 19:06:10.698 [qtp560041895-29] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [99ms]
2021-08-26 19:06:15.798 [qtp560041895-28] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [199ms]
2021-08-26 19:06:20.703 [qtp560041895-29] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [104ms]
2021-08-26 19:06:25.801 [qtp560041895-28] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [201ms]
2021-08-26 19:06:30.798 [qtp560041895-32] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [198ms]
2021-08-26 19:06:35.798 [qtp560041895-28] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [198ms]
2021-08-26 19:06:40.603 [qtp560041895-32] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [3ms]
2021-08-26 19:06:45.602 [qtp560041895-26] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [2ms]
2021-08-26 19:06:50.604 [qtp560041895-33] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [4ms]
2021-08-26 19:06:55.601 [qtp560041895-26] INFO  o.h.c.a.s.AttributeServiceEntry - health check result: true [1ms]

Testing

Deployed and tuned this a bit in a test environment running with a mesh and constrained resources.

codecov · 2021-08-26T19:36:01Z

Codecov Report

Merging #102 (8c250e3) into main (494351f) will decrease coverage by 0.30%.
The diff coverage is 62.50%.

@@             Coverage Diff              @@
##               main     #102      +/-   ##
============================================
- Coverage     81.59%   81.28%   -0.31%     
  Complexity      232      232              
============================================
  Files            28       28              
  Lines           788      791       +3     
  Branches         57       57              
============================================
  Hits            643      643              
- Misses           97      100       +3     
  Partials         48       48

Flag	Coverage Δ
integration	`81.28% <62.50%> (-0.31%)`	⬇️
unit	`68.25% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
.../core/attribute/service/AttributeServiceEntry.java	`77.50% <62.50%> (-6.29%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 494351f...8c250e3. Read the comment docs.

github-actions · 2021-08-26T23:33:26Z

Unit Test Results

21 files ±0 21 suites ±0 12s ⏱️ ±0s
77 tests ±0 77 ✔️ ±0 0 💤 ±0 0 ❌ ±0

Results for commit 00abb99. ± Comparison against base commit 494351f.

aaron-steinfeld added 4 commits August 26, 2021 14:14

chore: tweak default probe behavior

bff6165

chore: move timeout to config

e0b04d6

chore: tweak defaults

a678322

chore: more tweak

8c250e3

This comment has been minimized.

Sign in to view

aaron-steinfeld requested review from avinashkolluru, laxmanchekka, ravisingal, skjindal93, surajpuvvada and tim-mwangi August 26, 2021 19:36

aaron-steinfeld marked this pull request as ready for review August 26, 2021 19:37

aaron-steinfeld requested a review from a team August 26, 2021 19:37

surajpuvvada approved these changes Aug 26, 2021

View reviewed changes

tim-mwangi approved these changes Aug 26, 2021

View reviewed changes

aaron-steinfeld merged commit 00abb99 into main Aug 26, 2021

aaron-steinfeld deleted the tweak-probes branch August 26, 2021 23:32

aaron-steinfeld mentioned this pull request Nov 5, 2021

Merge helm charts hypertrace/hypertrace-alert-engine#80

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tweak probes#102

Tweak probes#102
aaron-steinfeld merged 4 commits intomainfrom
tweak-probes

aaron-steinfeld commented Aug 26, 2021 •

edited

Loading

Uh oh!

codecov bot commented Aug 26, 2021 •

edited

Loading

Uh oh!

This comment has been minimized.

github-actions bot commented Aug 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

aaron-steinfeld commented Aug 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing

Uh oh!

codecov bot commented Aug 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment has been minimized.

github-actions bot commented Aug 26, 2021

Unit Test Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aaron-steinfeld commented Aug 26, 2021 •

edited

Loading

codecov bot commented Aug 26, 2021 •

edited

Loading