[None][infra] Reduce Docker image layer count in release stage#14972
[None][infra] Reduce Docker image layer count in release stage#14972tburt-nv wants to merge 2 commits into
Conversation
Consolidate all COPY and RUN steps in the release stage into a single RUN using bind mounts and cp. This replaces 7 COPY layers and 2 RUN layers with a single RUN layer. Also use a bind mount for the OSS attribution script instead of COPY+RUN+rm. Reduces the release stage from ~13 layers to 3 (pip install, copy+setup, OSS attribution), saving ~10 layers in the published image. Signed-off-by: jbernloehr <jbernloehr@nvidia.com>
|
/bot run --stage-list "Build-Docker-Images" |
|
PR_Github #52147 [ run ] triggered by Bot. Commit: |
|
PR_Github #52147 [ run ] completed with state
|
Signed-off-by: Tyler Burt <195370667+tburt-nv@users.noreply.github.com>
|
/bot run |
|
PR_Github #52374 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughThe PR optimizes the Docker release stage by replacing sequential COPY statements with BuildKit bind mounts for more efficient artifact assembly. It removes benchmark source files, validates library presence, creates package symlinks, and refactors OSS attribution generation. Jenkins CI image tag properties are updated to reference the newly built images. ChangesDocker Build and CI Image Updates
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docker/Dockerfile.multi`:
- Line 165: The ldd predicate can silently pass when there are no tensorrt_llm
lines; change the check to first assert the dependency exists and then assert it
is not "not found" — e.g. run ldd -v bin/executorWorker and test with grep -q
'tensorrt_llm' (fail if missing) and only then check that the matched lines do
NOT contain "not found" (e.g. ! grep tensorrt_llm | grep -q "not found");
reference the binary name bin/executorWorker and the dependency token
tensorrt_llm when updating the Dockerfile multi-line gate.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 14825cce-cb85-4f12-b17e-5176b75b12ae
📒 Files selected for processing (2)
docker/Dockerfile.multijenkins/current_image_tags.properties
| test -f lib/libnvinfer_plugin_tensorrt_llm.so && \ | ||
| echo "/app/tensorrt_llm/lib" > /etc/ld.so.conf.d/tensorrt_llm.conf && \ | ||
| ldconfig && \ | ||
| ! ( ldd -v bin/executorWorker | grep tensorrt_llm | grep -q "not found" ) && \ |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Demonstrate why the current predicate can pass when tensorrt_llm is absent.
set -euo pipefail
if ! ( printf 'libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6\n' | grep tensorrt_llm | grep -q "not found" ); then
echo "Current check passes even with no tensorrt_llm line (false negative)."
fiRepository: NVIDIA/TensorRT-LLM
Length of output: 133
🏁 Script executed:
#!/bin/bash
set -euo pipefail
nl -ba docker/Dockerfile.multi | sed -n '145,190p'Repository: NVIDIA/TensorRT-LLM
Length of output: 106
🏁 Script executed:
#!/bin/bash
set -euo pipefail
awk 'NR>=145 && NR<=190 { printf "%d:%s\n", NR, $0 }' docker/Dockerfile.multiRepository: NVIDIA/TensorRT-LLM
Length of output: 2347
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "SHELL|pipefail" docker/Dockerfile.multi || trueRepository: NVIDIA/TensorRT-LLM
Length of output: 92
Strengthen the ldd gate to fail when executorWorker has no tensorrt_llm dependency lines. (docker/Dockerfile.multi:165)
The current predicate can pass when ldd output contains no tensorrt_llm matches at all, since grep tensorrt_llm producing no input also makes grep -q "not found" return non-match, which is then inverted by !.
Suggested change
- ! ( ldd -v bin/executorWorker | grep tensorrt_llm | grep -q "not found" ) && \
+ ldd -v bin/executorWorker | grep -q "tensorrt_llm" && \
+ ! ( ldd -v bin/executorWorker | grep "tensorrt_llm" | grep -q "not found" ) && \📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ! ( ldd -v bin/executorWorker | grep tensorrt_llm | grep -q "not found" ) && \ | |
| ldd -v bin/executorWorker | grep -q "tensorrt_llm" && \ | |
| ! ( ldd -v bin/executorWorker | grep "tensorrt_llm" | grep -q "not found" ) && \ |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docker/Dockerfile.multi` at line 165, The ldd predicate can silently pass
when there are no tensorrt_llm lines; change the check to first assert the
dependency exists and then assert it is not "not found" — e.g. run ldd -v
bin/executorWorker and test with grep -q 'tensorrt_llm' (fail if missing) and
only then check that the matched lines do NOT contain "not found" (e.g. ! grep
tensorrt_llm | grep -q "not found"); reference the binary name
bin/executorWorker and the dependency token tensorrt_llm when updating the
Dockerfile multi-line gate.
|
/bot run --disable-fail-fast |
|
PR_Github #52374 [ run ] completed with state
|
|
PR_Github #52388 [ run ] triggered by Bot. Commit: |
|
PR_Github #52388 [ run ] completed with state |
|
/bot run --disable-fail-fast |
Description
This PR is intended to supersede @janbernloehr's PR #11987, since it needs some minor tweaks.
Here is Jan's original summary:
Follow-up to #11720 which already consolidated the devel stage layers using bind mounts.
This PR applies the same pattern to the release stage:
Replace 7 COPY layers + 2 RUN layers with a single RUN using --mount=type=bind and cp, consolidating all file copies, symlink creation, and cleanup into one layer
Replace the OSS attribution COPY + RUN + rm with a single RUN using a bind mount
Reduces the release stage from ~13 layers to 3 (pip install, copy+setup+cleanup, OSS attribution), saving ~10 layers in the published image and leaving more headroom under Docker's 128-layer overlay2 limit for downstream consumers.
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.Summary by CodeRabbit