Skip to content

ci(fuzzer/postgres): build postgres-fuzzer from go-fuzzers branch for fix validation#4205

Open
ayush3160 wants to merge 5 commits into
mainfrom
ci/postgres-fuzzer-build-from-source
Open

ci(fuzzer/postgres): build postgres-fuzzer from go-fuzzers branch for fix validation#4205
ayush3160 wants to merge 5 commits into
mainfrom
ci/postgres-fuzzer-build-from-source

Conversation

@ayush3160
Copy link
Copy Markdown
Collaborator

Describe the changes that are made

  • Swap the postgres-fuzzer source from the S3-released tarball (releases/postgres-fuzzer/latest/postgres-fuzzer-linux-amd64.tar.gz) to an inline git clone --depth 1 --branch fix/postgres-fuzzer-determinism keploy/go-fuzzers + go build ./postgres step in .github/workflows/fuzzer_linux.yml.
  • Uses the existing PRO_ACCESS_TOKEN secret (already plumbed via prepare_and_run.yml:713 and consumed by the mongo fuzzer script the same way) — no new secrets required.
  • Other postgres-fuzzer steps (AWS setup, the actual Run Postgres Fuzzer Test step, artifact upload) are untouched so reverting back to the S3 download is a single-step change once the go-fuzzers PR is merged and the release pipeline re-uploads the binary.

Why this change is wanted

The postgres-fuzzer matrix on fuzzer_linux.yml has been failing for cross-version configs on recent PRs (most notable on #4203): record_latest_replay_build and record_build_replay_latest consistently take ~18 min and fail with Post "http://localhost:8080/fuzz": context deadline exceeded from keploy test's --api-timeout=1000. Local reproduction traced the root cause to the postgres fuzzer itself, not to keploy:

  1. pickRandomTable iterates s.tables with for k := range — Go randomizes map iteration, so even with seed=42 record and replay pick different tables once len(s.tables) >= 2.
  2. generateValueForType returns time.Now().Add(...) for TIMESTAMP / DATE columns. Wall-clock time changes between record and replay, so bind values never match — the bind values diverged from every recorded invocation flavour of error.
  3. activeQueries.Add(1) / Done() is not paired by defer. When pgx panics inside Rows.Next (runtime.goPanicIndex(0x5, 0x5) when a Keploy mock's row shape doesn't match the live query), Done() never runs, the WaitGroup stays at +1, and the deferred cleanupSessions.Wait() blocks until --api-timeout=1000 fires ~16.94 min later.

The fix sits on keploy/go-fuzzers#fix/postgres-fuzzer-determinism. Local validation against the same record_latest_replay_build config that has been failing went from a 16.94 min hang to a 20.23 s pass (test-set-0: PASSED, 2/2 testcases).

This PR is the minimal-blast-radius way to verify that fix on real CI runners (same eBPF capture, same matrix, same --api-timeout) before promoting the binary through the release pipeline.

Links & References

Closes: NA

🔗 Related PRs

🐞 Related Issues

  • NA

📄 Related Documents

  • NA

What type of PR is this? (check all applicable)

  • 📦 Chore
  • 🍕 Feature
  • 🐞 Bug Fix
  • 📝 Documentation Update
  • 🎨 Style
  • 🧑‍💻 Code Refactor
  • 🔥 Performance Improvements
  • ✅ Test
  • 🔁 CI
  • ⏩ Revert

Added e2e test pipeline?

  • 👍 yes — this PR itself runs the postgres-fuzzer matrix as the e2e signal; success criteria is "all three configs pass" instead of two failing at ~18 min

Added comments for hard-to-understand areas?

  • 👍 yes — the new step has an inline comment explaining the temporary nature of the build-from-source detour and how to revert

Added to documentation?

  • 🙅 no documentation needed

Are there any sample code or steps to test the changes?

  • 👍 yes, mentioned below

The PR's own CI run is the test. Compare these jobs on this PR vs the last main run / PR #4203:

Config main (stock fuzzer) This PR (patched fuzzer) Expected
Postgres Fuzzer (record_latest_replay_build) mixed should PASS in <5 min green
Postgres Fuzzer (record_build_replay_latest) failure ~18 min should PASS in <5 min green
Postgres Fuzzer (record_build_replay_build) failure ~18 min should PASS in <5 min green

If any of the three still fails, the residual signal is purely in the parser/integrations layer (no longer in the fuzzer's determinism / panic-safety).

Self Review done?

  • ✅ yes

Any relevant screenshots, recordings or logs?

Local run with the patched fuzzer against the same record_latest_replay_build script CI uses:

Status for test-set-0: PASSED
✅ All tests completed successfully.
    Total tests:        2
    Total test passed:  2
    Total test failed:  0
    Total time taken: 20.23 s

For comparison, the stock fuzzer on the same script:

Post "http://localhost:8080/fuzz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Total tests:        2
Total test passed:  1
Total test failed:  1
Total time taken: 16.94 min

The postgres-fuzzer matrix on the fuzzer_linux workflow has been failing
on `record_latest_replay_build` and `record_build_replay_latest` for
several PRs (notably #4203). Local reproduction showed the root cause
lives in the fuzzer itself — not in keploy — and the fix sits on
keploy/go-fuzzers#fix/postgres-fuzzer-determinism (deterministic table
pick + fixed TIMESTAMP/DATE base + panic-safe Add/Done so a pgx panic
no longer strands the WaitGroup for 17 minutes until --api-timeout
fires).

Temporarily swap the postgres-fuzzer download to a clone+build from
that go-fuzzers branch so CI can validate the fix end-to-end before
the go-fuzzers PR merges and re-uploads `releases/postgres-fuzzer/
latest/postgres-fuzzer-linux-amd64.tar.gz` to S3. Once that release
runs, this step should be reverted back to the S3 download in a
follow-up commit; the rest of the workflow (AWS setup, run step,
artifact upload) is left untouched so the revert is one-step.

The clone uses the existing PRO_ACCESS_TOKEN secret which is already
plumbed through `prepare_and_run.yml` and used by the mongo fuzzer
script the same way.

Signed-off-by: Ayush Sharma <kshitij3160@gmail.com>
Copilot AI review requested due to automatic review settings May 19, 2026 13:16
@ayush3160 ayush3160 requested a review from gouravkrosx as a code owner May 19, 2026 13:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Linux fuzzer CI workflow to build the postgres-fuzzer binary from the keploy/go-fuzzers repo (branch fix/postgres-fuzzer-determinism) instead of downloading the “latest” tarball from S3, enabling end-to-end CI validation of fuzzer-side determinism/panic-safety fixes before promoting a new released binary.

Changes:

  • Replace the Postgres fuzzer S3 download/extract step with a git clone + go build ./postgres build-from-source step.
  • Wire PRO_ACCESS_TOKEN and FUZZER_BRANCH into the build step and tighten shell safety (set -euo pipefail).
  • Keep downstream Postgres fuzzer test execution and artifact upload unchanged.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +238 to +247
- name: Build Postgres Fuzzer from go-fuzzers branch
id: postgres_fuzzer
env:
PRO_ACCESS_TOKEN: ${{ secrets.PRO_ACCESS_TOKEN }}
FUZZER_BRANCH: fix/postgres-fuzzer-determinism
run: |
set -e
KEY="releases/postgres-fuzzer/latest/postgres-fuzzer-linux-amd64.tar.gz"
aws s3 cp "s3://${{ vars.AWS_S3_BUCKET }}/${KEY}" .
tar -xzf postgres-fuzzer-linux-amd64.tar.gz
set -euo pipefail
if [[ -z "${PRO_ACCESS_TOKEN}" ]]; then
echo "::error::PRO_ACCESS_TOKEN secret is required to clone keploy/go-fuzzers"
exit 1
Comment on lines +241 to +254
PRO_ACCESS_TOKEN: ${{ secrets.PRO_ACCESS_TOKEN }}
FUZZER_BRANCH: fix/postgres-fuzzer-determinism
run: |
set -e
KEY="releases/postgres-fuzzer/latest/postgres-fuzzer-linux-amd64.tar.gz"
aws s3 cp "s3://${{ vars.AWS_S3_BUCKET }}/${KEY}" .
tar -xzf postgres-fuzzer-linux-amd64.tar.gz
set -euo pipefail
if [[ -z "${PRO_ACCESS_TOKEN}" ]]; then
echo "::error::PRO_ACCESS_TOKEN secret is required to clone keploy/go-fuzzers"
exit 1
fi
git clone --depth 1 --branch "${FUZZER_BRANCH}" \
"https://${PRO_ACCESS_TOKEN}@github.com/keploy/go-fuzzers.git" \
/tmp/go-fuzzers
cd /tmp/go-fuzzers
go build -trimpath -ldflags "-s -w" \
-o "$GITHUB_WORKSPACE/postgres-fuzzer" ./postgres
Comment on lines +249 to +251
git clone --depth 1 --branch "${FUZZER_BRANCH}" \
"https://${PRO_ACCESS_TOKEN}@github.com/keploy/go-fuzzers.git" \
/tmp/go-fuzzers
@github-actions
Copy link
Copy Markdown

🚀 Keploy Performance Test Results

Multi-Run Validation: Tests run 3 times, pipeline fails only if 2+ runs show regression.

Run P50 P90 P99 RPS Error Rate Status
1 2.66ms 3.3ms 4.64ms 100.00 0.00% ✅ PASS
2 2.6ms 3.21ms 4.41ms 100.02 0.00% ✅ PASS
3 2.65ms 3.43ms 4.88ms 100.03 0.00% ✅ PASS

Thresholds: P50 < 5ms, P90 < 15ms, P99 < 70ms, RPS >= 100 (±1% tolerance), Error Rate < 1%

Result: PASSED - Only 0 out of 3 runs failed (threshold: 2)

P50, P90, and P99 percentiles naturally filter out outliers

ayush3160 added 2 commits May 19, 2026 14:55
Empty commit to spawn a fresh workflow run so we get an independent
postgres-fuzzer sample. Will squash before merge.

Signed-off-by: Ayush Sharma <kshitij3160@gmail.com>
Empty commit for the final independent postgres-fuzzer sample.
Will squash before merge.

Signed-off-by: Ayush Sharma <kshitij3160@gmail.com>
@github-actions
Copy link
Copy Markdown

🚀 Keploy Performance Test Results

Multi-Run Validation: Tests run 3 times, pipeline fails only if 2+ runs show regression.

Run P50 P90 P99 RPS Error Rate Status
1 2.6ms 3.19ms 4.64ms 100.00 0.00% ✅ PASS
2 2.54ms 3.14ms 4.65ms 100.02 0.00% ✅ PASS
3 2.58ms 3.26ms 4.61ms 100.02 0.00% ✅ PASS

Thresholds: P50 < 5ms, P90 < 15ms, P99 < 70ms, RPS >= 100 (±1% tolerance), Error Rate < 1%

Result: PASSED - Only 0 out of 3 runs failed (threshold: 2)

P50, P90, and P99 percentiles naturally filter out outliers

Last empty commit for the flake-check series. Will squash before merge.

Signed-off-by: Ayush Sharma <kshitij3160@gmail.com>
@github-actions
Copy link
Copy Markdown

🚀 Keploy Performance Test Results

Multi-Run Validation: Tests run 3 times, pipeline fails only if 2+ runs show regression.

Run P50 P90 P99 RPS Error Rate Status
1 2.91ms 3.77ms 5.34ms 100.02 0.00% ✅ PASS
2 2.78ms 3.58ms 5.05ms 100.02 0.00% ✅ PASS
3 2.8ms 3.84ms 5.37ms 100.00 0.00% ✅ PASS

Thresholds: P50 < 5ms, P90 < 15ms, P99 < 70ms, RPS >= 100 (±1% tolerance), Error Rate < 1%

Result: PASSED - Only 0 out of 3 runs failed (threshold: 2)

P50, P90, and P99 percentiles naturally filter out outliers

@github-actions
Copy link
Copy Markdown

🚀 Keploy Performance Test Results

Multi-Run Validation: Tests run 3 times, pipeline fails only if 2+ runs show regression.

Run P50 P90 P99 RPS Error Rate Status
1 2.65ms 3.32ms 4.84ms 100.02 0.00% ✅ PASS
2 2.56ms 3.18ms 4.79ms 100.02 0.00% ✅ PASS
3 2.58ms 3.3ms 5.06ms 100.01 0.00% ✅ PASS

Thresholds: P50 < 5ms, P90 < 15ms, P99 < 70ms, RPS >= 100 (±1% tolerance), Error Rate < 1%

Result: PASSED - Only 0 out of 3 runs failed (threshold: 2)

P50, P90, and P99 percentiles naturally filter out outliers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants