Skip to content

feat: native socket I/O tracking via PLT hooks (PROF-10637)#488

Draft
jbachorik wants to merge 2 commits intomainfrom
muse/impl-20260416-133933
Draft

feat: native socket I/O tracking via PLT hooks (PROF-10637)#488
jbachorik wants to merge 2 commits intomainfrom
muse/impl-20260416-133933

Conversation

@jbachorik
Copy link
Copy Markdown
Collaborator

@jbachorik jbachorik commented Apr 16, 2026

What does this PR do?:
Intercepts libc send/recv calls via PLT hooking and records
datadog.NativeSocketEvent JFR events with byte-weighted inverse-transform
sampling (PID rate control, target ~5000 events/min).

Motivation:
Track blocking TCP socket I/O at the libc function level to surface socket
latency and throughput in Datadog profiler. Netty with Java NIO transport
(the primary use case) goes through libc send/recv, making PLT patching
the right interception point. Feature is explicitly opt-in via the
nativesocket profiler argument. Implements PROF-10637.

Additional Notes:

Key design decisions:

  • PLT-hooks send/recv only (TCP blocking I/O); UDP sendto/recvfrom
    and Netty native epoll/io_uring are explicitly out of scope
  • Hook bodies run on the calling Java thread, not in a signal handler —
    malloc and locking are safe inside hooks
  • fd-to-remote-address cache is not cleared on JFR chunk boundaries:
    the same fd may remain valid across chunks and clearing would race with
    in-flight recordings
  • _orig_send/_orig_recv are intentionally not nulled in
    unpatch_socket_functions to avoid a memory-ordering race with in-flight
    hook invocations on aarch64
  • NativeSocketEvent struct lives in nativeSocketSampler.h (not
    event.h) because it is only used by NativeSocketSampler
  • macOS: entire implementation compiled out as no-op stubs (#ifdef __linux__)
  • Known limitation: dlopen-loaded libraries patched after start() are not
    intercepted (documented in libraryPatcher.h)

How to test the change?:

  • GTest unit tests: NativeSocketSamplerHookTest in
    ddprof-lib/src/test/cpp/nativeSocketSampler_ut.cpp — verify that
    send_hook/recv_hook delegate to the installed _orig_send/_orig_recv
    function pointers
  • 12 JUnit integration tests in
    ddprof-test/src/test/java/com/datadoghq/profiler/nativesocket/:
    • NativeSocketEnabledTest — events produced when feature is enabled
    • NativeSocketEventFieldsTest — all 8 required JFR fields present and valid
    • NativeSocketSendRecvSeparateTest — SEND and RECV events tracked independently
    • NativeSocketDisabledTest — no events when feature is not enabled
    • NativeSocketRateLimitTest — event count is substantially less than operation count (subsampling active), weight > 1 on sampled events
    • NativeSocketRemoteAddressTest — remoteAddress field is in ip:port format
    • NativeSocketMacOsNoOpTest — no events on macOS (no-op stub)
    • NativeSocketStackTraceTest — stack trace captured on events
    • NativeSocketBytesAccuracyTest — bytesTransferred field matches actual bytes
    • NativeSocketUdpExcludedTest — UDP sends do not produce events
    • NativeSocketEventThreadTest — eventThread field populated with calling thread
    • NativeSocketNettyNioTest — Netty 4.x with NioEventLoopGroup produces events

Spec: #486

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.
  • JIRA: PROF-10637

Track libc send/recv calls with inverse-transform sampling (PID rate
control, ~5000 events/min) and emit NativeSocketEvent JFR events.

- PLT-hook send/recv in all loaded native libraries via LibraryPatcher
- NativeSocketSampler: byte-weighted sampling, fd-to-addr cache, PID controller
- New JFR type: datadog.NativeSocketEvent (8 fields)
- Activated by 'nativesocket' profiler argument; Linux only, macOS no-op
- 12 integration tests (Netty NIO) + GTest unit tests for hook invocation

Resolves: PROF-10637

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts bot commented Apr 16, 2026

CI Test Results

Run: #24522147526 | Commit: 0656b90 | Duration: 30m 36s (longest job)

9 of 32 test jobs failed

Status Overview

JDK glibc-aarch64/debug glibc-amd64/debug musl-aarch64/debug musl-amd64/debug
8 - - -
8-ibm - - -
8-j9 - -
8-librca - -
8-orcl - - -
11 - - -
11-j9 - -
11-librca - -
17 - -
17-graal - -
17-j9 - -
17-librca - -
21 - -
21-graal - -
21-librca - -
25 - -
25-graal - -
25-librca - -

Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled

Failed Tests

musl-amd64/debug / 25-librca

Job: View logs

No detailed failure information available. Check the job logs.

musl-aarch64/debug / 21-librca

Job: View logs

No detailed failure information available. Check the job logs.

musl-amd64/debug / 11-librca

Job: View logs

No detailed failure information available. Check the job logs.

musl-aarch64/debug / 17-librca

Job: View logs

No detailed failure information available. Check the job logs.

musl-aarch64/debug / 11-librca

Job: View logs

No detailed failure information available. Check the job logs.

musl-amd64/debug / 21-librca

Job: View logs

No detailed failure information available. Check the job logs.

musl-amd64/debug / 17-librca

Job: View logs

No detailed failure information available. Check the job logs.

musl-aarch64/debug / 25-librca

Job: View logs

No detailed failure information available. Check the job logs.

glibc-aarch64/debug / 8-j9

Job: View logs

No detailed failure information available. Check the job logs.

Summary: Total: 32 | Passed: 23 | Failed: 9


Updated: 2026-04-16 17:11:59 UTC

Remove !Platform.isAarch64() guard: JDK17/21/25 on aarch64 already
skip via Platform.isJavaVersion(8). For 8-j9 on aarch64, J9's
libnet.so calls send/recv via PLT just as on amd64 — PLT hooking
works the same way on both architectures.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant