Skip to content

JAVA-6155 add thread dumps to evergreen logs on failed test cases#1932

Open
strogiyotec wants to merge 2 commits into
mongodb:mainfrom
strogiyotec:JAVA-6155
Open

JAVA-6155 add thread dumps to evergreen logs on failed test cases#1932
strogiyotec wants to merge 2 commits into
mongodb:mainfrom
strogiyotec:JAVA-6155

Conversation

@strogiyotec
Copy link
Copy Markdown
Contributor

@strogiyotec strogiyotec commented Apr 1, 2026

JAVA-6155
This PR adds thread dumps whenever test fails
The motivation behind this PR was a failing test case poolClearedExceptionMustBeRetryable
Claude identified a deadlock that I wasn't able to reproduce locally
With this PR develop could easily see thread dumps of all the threads directly in evergreen and locally

Here is an example of a patch that deliberately makes a test case fail and it's output

@strogiyotec strogiyotec requested a review from a team as a code owner April 1, 2026 18:21
@strogiyotec strogiyotec requested review from Copilot and nhachicha and removed request for nhachicha April 1, 2026 18:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a JUnit 5 TestWatcher extension that emits a full JVM thread dump to the test logs whenever a test fails, making deadlocks and hung-thread scenarios easier to diagnose in Evergreen and locally.

Changes:

  • Add ThreadDumpOnFailureExtension (JUnit Jupiter TestWatcher) that logs a thread dump on test failure.
  • Register the extension via META-INF/services/... and enable JUnit Jupiter extension autodetection via junit-platform.properties.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
driver-core/src/test/unit/com/mongodb/internal/diagnostics/ThreadDumpOnFailureExtension.java Adds the failure hook and thread-dump collection logic.
driver-core/src/test/resources/META-INF/services/org.junit.jupiter.api.extension.Extension Registers the extension for ServiceLoader-based autodetection.
driver-core/src/test/resources/junit-platform.properties Enables JUnit Jupiter extension autodetection so the extension is picked up automatically.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

public void testFailed(final ExtensionContext context, final Throwable cause) {
String testName = context.getDisplayName();
String threadDump = getAllThreadsDump();
LOGGER.error("Test failed: " + testName + "\nThread dump:\n" + threadDump);
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testFailed ignores the provided cause, so the log entry won’t include the exception/stack trace that triggered the failure. Logging the throwable (e.g., using the error(String, Throwable) overload) would keep the failure reason adjacent to the thread dump in Evergreen/local logs.

Suggested change
LOGGER.error("Test failed: " + testName + "\nThread dump:\n" + threadDump);
LOGGER.error("Test failed: " + testName + "\nThread dump:\n" + threadDump, cause);

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

@strogiyotec strogiyotec Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cause will already be provided on assertion fail I don't think duplicating it for thread dumps makes sense

Comment on lines +39 to +50
public void testFailed(final ExtensionContext context, final Throwable cause) {
String testName = context.getDisplayName();
String threadDump = getAllThreadsDump();
LOGGER.error("Test failed: " + testName + "\nThread dump:\n" + threadDump);
}

private static String getAllThreadsDump() {
final ThreadMXBean threadMXBean = ManagementFactory.getThreadMXBean();
ThreadInfo[] threadInfos = threadMXBean.dumpAllThreads(
threadMXBean.isObjectMonitorUsageSupported(),
threadMXBean.isSynchronizerUsageSupported()
);
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getAllThreadsDump() (via ThreadMXBean.dumpAllThreads) can throw at runtime (e.g., SecurityException). If this happens inside testFailed, the extension itself may error and you’ll lose the diagnostic output. Consider wrapping thread-dump generation in a try/catch and logging a clear fallback message so the original test failure reporting isn’t impacted.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, wrapped it with try catch

*/
public final class ThreadDumpOnFailureExtension implements TestWatcher {

private static final Logger LOGGER = Loggers.getLogger(ThreadDumpOnFailureExtension.class.getName());
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logger naming is inconsistent with other driver-core tests: most use a short suffix like ClassName.class.getSimpleName() (e.g., ConnectionPoolTest / ConnectionPoolAsyncTest). Passing class.getName() here yields a very long logger name (org.mongodb.driver.com.mongodb...) that’s harder to filter/read in logs; consider switching to getSimpleName() or a conventional test logger suffix.

Suggested change
private static final Logger LOGGER = Loggers.getLogger(ThreadDumpOnFailureExtension.class.getName());
private static final Logger LOGGER = Loggers.getLogger(ThreadDumpOnFailureExtension.class.getSimpleName());

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced to use simple name

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +19 to +30
import java.lang.management.ManagementFactory;
import java.lang.management.ThreadInfo;
import java.lang.management.ThreadMXBean;


import com.mongodb.internal.diagnostics.logging.Logger;
import com.mongodb.internal.diagnostics.logging.Loggers;

import org.junit.jupiter.api.extension.ExtensionContext;
import org.junit.jupiter.api.extension.TestWatcher;


Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import order/grouping here is inconsistent with the repo’s established convention (typically project com.mongodb.* imports first, then org.*, then java.*, e.g. driver-core/src/test/unit/com/mongodb/internal/connection/ConnectionPoolTest.java:19-28). Reorder the imports to match the surrounding codebase style to keep diffs consistent and reduce merge noise.

Suggested change
import java.lang.management.ManagementFactory;
import java.lang.management.ThreadInfo;
import java.lang.management.ThreadMXBean;
import com.mongodb.internal.diagnostics.logging.Logger;
import com.mongodb.internal.diagnostics.logging.Loggers;
import org.junit.jupiter.api.extension.ExtensionContext;
import org.junit.jupiter.api.extension.TestWatcher;
import com.mongodb.internal.diagnostics.logging.Logger;
import com.mongodb.internal.diagnostics.logging.Loggers;
import org.junit.jupiter.api.extension.ExtensionContext;
import org.junit.jupiter.api.extension.TestWatcher;
import java.lang.management.ManagementFactory;
import java.lang.management.ThreadInfo;
import java.lang.management.ThreadMXBean;

Copilot uses AI. Check for mistakes.
Comment on lines +39 to +43
public void testFailed(final ExtensionContext context, final Throwable cause) {
String testName = context.getDisplayName();
String threadDump = getAllThreadsDump();
LOGGER.error("Test failed: " + testName + "\nThread dump:\n" + threadDump);
}
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logs a full JVM thread dump on every failed test. When failures cascade (e.g., environment outage causing many tests to fail), this can produce very large Evergreen logs and make the original failures harder to find. Consider throttling (e.g., only dump once per JVM via an AtomicBoolean, or once per test class) and/or adding a clear header/footer delimiter so each dump is easy to locate in the log stream.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1 @@
junit.jupiter.extensions.autodetection.enabled=true
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

junit-platform.properties is included in driver-core’s testArtifacts JAR (see buildSrc/src/main/kotlin/conventions/test-artifacts.gradle.kts:26-29), and several other modules depend on :driver-core:testArtifacts for their tests (e.g., driver-sync/build.gradle.kts:37-39). That means this setting may unintentionally enable JUnit Jupiter extension autodetection beyond driver-core depending on classpath resource resolution. If the intent is to scope this behavior to a specific module/test task, consider enabling it via the Gradle Test task/system property instead, or placing a shared properties file in a deliberately shared test-resources location.

Suggested change
junit.jupiter.extensions.autodetection.enabled=true
# NOTE:
# This file is included in driver-core's testArtifacts JAR and is therefore
# visible on the test classpath of other modules that depend on :driver-core:testArtifacts.
# Enabling JUnit Jupiter extension autodetection here would unintentionally
# turn it on for those modules as well.
#
# If you need autodetection, configure it in the specific module or Gradle
# Test task instead (e.g. via system property junit.jupiter.extensions.autodetection.enabled=true),
# or in a module-local junit-platform.properties that is not shared via testArtifacts.
#
# Example (do NOT uncomment here unless you really want it shared globally):
# junit.jupiter.extensions.autodetection.enabled=true

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

@nhachicha nhachicha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI review (Opus 3.8 1M)

driver-core/src/test/resources/junit-platform.properties

  • 🟡 [important] L1: Autodetection leaks to downstream modules via testArtifacts.
    Verified: buildSrc/.../test-artifacts.gradle.kts builds the test JAR with from(sourceSets.test.get().output), which bundles processed test resources (including this file and the META-INF/services
    registration) alongside compiled test classes. driver-sync/build.gradle.kts:38 (and other modules) declare testImplementation(project(path = ":driver-core", configuration = "testArtifacts")). Net effect:
    junit.jupiter.extensions.autodetection.enabled=true and the ThreadDumpOnFailureExtension get activated for every consuming module's test run, not just driver-core. Two risks: (a) unintended global
    activation the author may not have planned for; (b) if any consuming module ships its own junit-platform.properties, JUnit Platform's classpath resolution of a single junit-platform.properties becomes
    order-dependent/non-deterministic. If global activation is actually desired, that's defensible — but it should be a deliberate, documented decision. Otherwise scope it via the Gradle Test task system
    property (systemProperty("junit.jupiter.extensions.autodetection.enabled", true)) in the relevant module(s).

driver-core/src/test/unit/com/mongodb/internal/diagnostics/ThreadDumpOnFailureExtension.java

  • 🟡 [important] L44-54: ThreadInfo.toString() truncates each stack to 8 frames.
    The JDK's ThreadInfo.toString() is hardcoded to emit at most MAX_FRAMES = 8 stack frames per thread, then prints .... Given the stated motivation (diagnosing a deadlock), 8 frames is often too shallow
    to see where threads are actually blocked. For a complete dump, iterate frames yourself, e.g. build the header from ThreadInfo and append the full info.getStackTrace(). The lock/monitor info you correctly
    enabled (below) also only renders for frames within that 8-frame window when using toString().

  • 🟢 [nit] L46-49: Captures dumps only for tests that fail, not tests that hang.
    TestWatcher.testFailed fires only when a test produces a failure. A true deadlock that hangs until the Gradle/JVM test timeout kills the process never reaches testFailed, so no dump is produced for that
    case. Worth being aware of relative to the deadlock-diagnosis goal; a JVM-level/@Timeout-driven approach would be needed to cover hangs. Not blocking for the failure-path use case.

  • 🟢 [nit] L20-31: Import grouping and double blank lines.
    Imports are grouped java.* first then com.mongodb.*/org.*, with doubled blank lines (L21-22, L29-30). I checked config/checkstyle/checkstyle.xml — only RedundantImport/UnusedImports are enforced,
    so this will not fail CI. Purely cosmetic; tidy up the consecutive blank lines if touching the file.

  • 🟢 [nit] L48-52: No delimiter between concatenated thread entries.
    sb.append(info) relies on each ThreadInfo.toString() trailing newline. A clear header/footer (e.g. ===== THREAD DUMP (test: X) =====) would make dumps far easier to locate and bound in large Evergreen
    logs.

  • 🎉 [praise] L46-49: Good defensive dumpAllThreads call — passing isObjectMonitorUsageSupported() / isSynchronizerUsageSupported() captures lock-ownership info, which is exactly what's useful for
    deadlock analysis. Nicely done, and the SecurityException fallback keeps the extension from masking the real test failure.

Prior Review Comments

  • 🎉 [praise] Extension.java:50Wrap thread-dump generation in try/catch. Addressed — current code catches SecurityException and returns a fallback message.
  • 🎉 [praise] Extension.java:36Logger naming should use a short suffix. Addressed — now Loggers.getLogger(ThreadDumpOnFailureExtension.class.getSimpleName()). Confirmed against
    Loggers.getLogger (prepends org.mongodb.driver.), so getSimpleName() is the right convention.
  • 💡 [suggestion] Extension.java:42Log the cause too. Disputed by the author ("the cause will already be provided on assertion fail"). Reasonable: JUnit already reports the failure cause
    independently, so this is largely redundant. Minor counterpoint — logging the cause keeps it adjacent to the dump in the same log line for correlation — but not worth blocking.
  • 🟡 [important] junit-platform.properties:1testArtifacts JAR propagates this setting to other modules. Unaddressed. Merged with the L1 finding above; I independently verified the build wiring
    confirms the concern.
  • 🟢 [nit] Extension.java:43Full dump on every failed test can flood logs on cascading failures. Unaddressed. Valid for failure storms; consider throttling (once per JVM via AtomicBoolean)
    and/or the delimiter suggested above. Low priority for a diagnostic util.
  • 🟢 [nit] Extension.java:30Import order inconsistent with repo convention. Unaddressed. Cosmetic only (not Checkstyle-enforced) — see nit above.

Decision:

  1. junit-platform.properties propagation — confirm whether enabling autodetection + the extension across all :driver-core testArtifacts consumers (driver-sync, etc.) is intended. If yes, add a
    comment in the properties file documenting it. If no, scope it via a per-module Gradle Test system property instead.
  2. 8-frame truncation — replace ThreadInfo.toString() with explicit iteration over info.getStackTrace() so deep deadlock stacks aren't cut off at 8 frames.
AI review copilot (GPT 5.5) Medium — driver-core/src/test/resources/junit-platform.properties:1

Placing junit-platform.properties under driver-core/src/test/resources means it is packaged into driver-core:testArtifacts, which is consumed by driver-sync, driver-reactive-streams, driver-scala, anddriver-legacy. Those modules will unintentionally inherit JUnit extension autodetection and load ThreadDumpOnFailureExtension for all their test failures.

That changes behavior outside driver-core and could produce very large logs during broad/cascading failures. Suggested fix: avoid classpath-wide autodetection from shared test artifacts, or configure it per module via Gradle/system property if global behavior is intended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants