ci: validate PR #150 fix against stress harness (x86_64)#2
Closed
neilsh wants to merge 6 commits into
Closed
Conversation
- noxfile.py: smoke runs on host Python (fast build/import feedback); stress runs in python:X.Y-slim Docker so the same nox -s stress-X.Y command works identically on macOS and Linux CI - pyproject.toml: minimal PEP 517 build-system block (no cibuildwheel) - .github/workflows/test.yml: matrix across Python 3.7–3.13 on ubuntu-latest, running both nox sessions per version - tests/smoke.py: install + start() check; tolerates missing GCP creds - scripts/_stress.py: asyncio + SIGPROF reproducer for issue GoogleCloudPlatform#142; always runs inside Linux container via nox - .gitignore: exclude .nox/, build artifacts, core dumps Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove 3.7 from matrix: ubuntu-24.04 (now ubuntu-latest) dropped support for it in actions/setup-python; 3.7 is also EOL since 2023 - Copy source to /tmp inside the stress container before building, so the C extension is always compiled fresh against the container's glibc. Without this, host-built .so artifacts from the smoke step (compiled on ubuntu-24.04 / glibc 2.39) were being picked up by pip inside the python:3.8-slim container, causing a GLIBC_2.38 mismatch at import time. - Mount the source volume :ro to make the isolation requirement explicit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sharing a workspace between smoke and stress let smoke's host-built _profiler*.so leak into the stress container via cp -r, causing stress-3.8 to fail with a glibc 2.38 mismatch in python:3.8-slim (bookworm, glibc 2.36) on the latest run. The host-side build artifacts are gone now that stress runs in a fresh job. Side effects: a smoke failure no longer skips stress on the same cell (e.g., the known smoke-3.13 break no longer hides stress-3.13 data). Stress's host setup-python is pinned to 3.12 since stress runs entirely inside python:X.Y-slim Docker; the host interpreter only needs to drive nox. Also adds a concurrency block to cancel superseded runs and drops the dead 'master' branch trigger (repo only has main).
Reviewer found three classes of issue in the harness: 1. _stress.py daemon thread silently swallowed Python exceptions from cpu_profiler.profile() — only true segfaults brought down the process. A C-extension regression that raised instead of segfaulting would print STAGE:stress:OK and exit 0. Capture exceptions via a module-level slot and SystemExit before the OK marker. 2. tests/smoke.py kept a 2-second sleep with the comment "give the agent thread a moment to crash if it's going to" — but without GCP creds, start() raises DefaultCredentialsError from setup_auth before any agent thread is created, so the sleep was dead weight and the docstring overstated coverage. Drop the sleep, narrow the catch from bare Exception to (DefaultCredentialsError, RefreshError) so future ValueError/TypeError regressions surface instead of being swallowed, and rewrite the docstring to match what's actually validated (import + start() preflight). 3. nox.options.error_on_missing_interpreters defaulted on, so bare `nox` failed for any contributor without all six pyenv versions. Set it to False; smoke skips missing interpreters and stress is Docker-based regardless. Also: loop the stress run STRESS_RUNS times (default 3 × 20s = 60s, matching the previous wall budget) for higher per-cell catch probability; add STAGE:install:OK marker to disambiguate install failures from SIGSEGVs in the matrix view; drop >/dev/null on apt-get to keep error context; reword two ambiguous docstrings; and update the cp -r comment to reflect the read-only-mount rationale (the smoke-artifact concern is gone now that smoke and stress are in separate jobs).
Owner
Author
|
Validation complete. PR GoogleCloudPlatform#150 still segfaults on x86_64 CI (stress-3.11 and stress-3.12, exit 139 on run 1). Results documented in PR #1. Closing. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Validation-only branch — do not merge.
Cherry-picks
b33c941(upstream PR GoogleCloudPlatform#150 — "Add defensive checks for guarding against null frames") on top of the hardened stress harness from PR #1.Purpose: confirm that
stress-3.11andstress-3.12turn green on x86_64 CI when the null-frame defensive checks are in place.mainOnce CI results are captured this branch will be deleted and the test-plan item in PR #1 will be checked off.
🤖 Generated with Claude Code