@@ -740,17 +740,226 @@ The stats line now reports `cull[wall X | work: clr Y trv Z emt W upl U]`:
740740where CPU cycles went. ` IFC_CULL_THREADS=0 ` forces single-threaded mode
741741for comparison.
742742
743- #### 3E. GPU-side culling via compute (longer-term)
744-
745- Push the cull loop to a compute shader reading the per-instance SSBO +
746- frustum planes + HiZ pyramid, emitting the visible list and indirect
747- commands with atomic counters. Three compute dispatches per model: (1)
748- count survivors per ` (mesh, winding, LOD) ` bucket, (2) prefix-sum the
749- counts into ` baseInstance ` offsets and write the indirect command buffer,
750- (3) re-test and compact survivors into the dense visible list. HiZ moves
751- to a GPU depth texture sampled directly in the shader, eliminating the
752- Phase 3C readback. Lets culling scale to millions of instances and
753- single-model scenes where Phase 3D can't parallelise.
743+ #### 3E. GPU compute culling — experiments, results, and current state
744+
745+ ##### What we tried
746+
747+ ** Attempt 1: Full GPU-driven rendering (reverted).** Five commits
748+ (` 4fe32b54 ` ..` d5b7b87b ` ) moved the entire cull-to-draw pipeline onto
749+ the GPU: a compute shader performed frustum + contribution + HiZ
750+ culling, selected LOD0/LOD1, handled fwd/rev winding bucketing, wrote
751+ indirect draw commands via ` glMultiDrawElementsIndirectCount ` , and
752+ drove rendering without CPU readback. This was architecturally clean
753+ but complex — the GPU built per-model indirect command buffers with
754+ atomic counters, prefix sums, and per-bucket compaction. It worked
755+ correctly but introduced code smells (extension loaders for
756+ ` glMultiDrawElementsIndirectCount ` not exposed by Qt6's
757+ ` QOpenGLFunctions_4_5_Core ` , ad-hoc GPU readbacks for validation).
758+ All five commits were reverted as a single block to keep the codebase
759+ clean while preserving the AABB SSBO upload (` b2044737 ` ) and the
760+ frustum-only validation shader (` b17860fc ` ).
761+
762+ ** Attempt 2: GPU frustum-only validation shader.** A minimal compute
763+ shader (64 threads/workgroup) testing each instance's AABB against 6
764+ frustum planes. Used as a measurement baseline — no contribution,
765+ HiZ, LOD, or winding. Results on a 1.06 M-instance / 111-model scene
766+ (GTX 1650):
767+
768+ | Metric | GPU frustum-only | CPU BVH (parallel) |
769+ | --------| ------------------| --------------------|
770+ | Cull time | ** 0.82 ms** (GPU timestamp) | 9.6–15.2 ms wall |
771+ | Survivors | 279 k (frustum only) | 130 k (frustum + contribution + HiZ) |
772+
773+ The GPU brute-force scan of 1.06 M instances in 0.82 ms was 12–18×
774+ faster than the CPU BVH walk despite testing every instance.
775+
776+ ** Attempt 3: Hybrid GPU cull with synchronous readback.** Added
777+ contribution culling to the GPU shader (bounding-sphere screen-space
778+ radius test), then read back the compact survivor list to the CPU with
779+ ` glGetNamedBufferSubData ` . CPU retains HiZ, LOD selection, winding
780+ bucketing, indirect command building, and all GL draw calls.
781+
782+ | Phase | Time |
783+ | -------| ------|
784+ | GPU dispatch (frustum + contribution) | 0.92 ms |
785+ | Synchronous readback (` glGetNamedBufferSubData ` ) | ** 4.2–7.4 ms** |
786+ | CPU consume (HiZ + LOD + winding + emit) | 6.4–9.8 ms |
787+ | ** Total wall** | ** ~ 15 ms** |
788+
789+ The synchronous readback pipeline-stalled the GPU, adding 4–7 ms of
790+ idle wait. Total wall time was roughly equal to the CPU-only path,
791+ negating the GPU cull's speed advantage.
792+
793+ ** Attempt 4: Async one-frame-late readback (committed, ` 30e43ffe ` ).**
794+ Replaced synchronous readback with a persistent-mapped buffer
795+ (` GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT ` ) and a ` glFenceSync ` /
796+ ` glClientWaitSync ` fence. The GPU writes survivors this frame; the
797+ CPU reads them next frame. One frame of latency, but zero stalls.
798+
799+ | Phase | Time |
800+ | -------| ------|
801+ | GPU dispatch | 0.69–0.78 ms |
802+ | Async readback (fence poll) | ** 0.00 ms** |
803+ | CPU consume | 5.0–6.2 ms |
804+ | ** Total wall** | ** ~ 5.5 ms** |
805+
806+ vs the CPU-only path at 5.2–6.4 ms wall on the same scene. The GPU
807+ cull + async readback matches or slightly beats the parallel CPU BVH
808+ path, with headroom for scenes where the CPU path can't parallelise
809+ (single large model).
810+
811+ ** Attempt 5: Dirty-mesh tracking (committed, ` 01dd8d57 ` ).** Profiling
812+ the CPU consume phase revealed that ` clr ` (clearing per-mesh visibility
813+ buckets) and ` emit ` (building indirect commands) were O(total_meshes)
814+ = O(462 k), not O(survivors). Added a dirty-mesh list so only mesh
815+ buckets that received survivors are cleared and iterated.
816+
817+ Consume sub-phase breakdown (summed across parallel threads,
818+ ~ 128 k survivors):
819+
820+ | Sub-phase | Before | After | Scales with |
821+ | -----------| --------| -------| -------------|
822+ | bin (model binning) | 0.11 ms | 0.18 ms | O(survivors) |
823+ | clr (bucket clear) | 2.0 ms | ** 1.6 ms** | O(dirty meshes) |
824+ | class (HiZ + LOD + winding) | 5.1 ms | 5.3 ms | O(survivors) |
825+ | emit (indirect cmd build) | 4.2 ms | ** 2.2 ms** | O(dirty meshes) |
826+
827+ Emit improved ~ 48%, clr ~ 20%. The dominant cost shifted to ` class `
828+ (per-survivor HiZ + LOD + winding classification).
829+
830+ ##### What we learned
831+
832+ 1 . ** GPU brute-force beats CPU BVH for frustum + contribution.**
833+ 0.82 ms for 1.06 M instances vs 10–15 ms for the CPU BVH walk.
834+ The BVH's hierarchical skip advantage is overwhelmed by the GPU's
835+ raw parallelism — 1 M independent AABB-vs-frustum tests is a
836+ perfect compute workload.
837+
838+ 2 . ** Synchronous readback kills the advantage.** The 4–7 ms stall from
839+ ` glGetNamedBufferSubData ` on ~ 1 MB of data negated all GPU savings.
840+ A pipeline stall is worse than just doing the work on the CPU.
841+
842+ 3 . ** Async one-frame-late readback works well.** Persistent mapping +
843+ fence polling adds zero measurable overhead. The one-frame latency
844+ is imperceptible for culling — worst case, a few objects at the
845+ frustum edge pop in one frame late during fast camera motion.
846+
847+ 4 . ** CPU consume is now the bottleneck.** With GPU dispatch at <1 ms
848+ and readback at 0 ms, the 5–6 ms consume phase (HiZ test, LOD
849+ selection, winding classification, indirect command building)
850+ dominates. The ` class ` sub-phase alone is 5+ ms, scaling linearly
851+ with survivor count.
852+
853+ 5 . ** Dirty-mesh tracking helps but doesn't transform performance.**
854+ The 462 k total meshes → ~ 104 k active meshes reduction cut emit
855+ in half, but the per-survivor classification work is the true
856+ bottleneck.
857+
858+ ##### What remains
859+
860+ The hybrid path (` IFC_GPU_CULL=1 ` ) is functional and committed. It
861+ matches the CPU path's performance today and provides the foundation
862+ for further GPU offload. Remaining opportunities:
863+
864+ - Move HiZ + LOD + winding classification to the GPU (eliminates the
865+ 5 ms ` class ` sub-phase entirely — the GPU already has the AABBs and
866+ can sample the HiZ pyramid directly).
867+ - GPU BVH traversal to reduce dispatch from O(total) to O(visible +
868+ tree overhead) — matters when survivor ratio is low.
869+ - GPU-driven indirect command building (eliminates CPU emit entirely).
870+
871+ Each of these would chip away at the consume phase, but the sub_draw
872+ analysis below reveals a more fundamental bottleneck.
873+
874+ #### 3F. Sub-draw fragmentation analysis
875+
876+ ##### The problem
877+
878+ With GPU cull solving the * culling* bottleneck, the dominant cost
879+ shifts to the * drawing* side. On the 1.06 M-instance / 111-model
880+ scene, frame times are 48–63 ms despite only 24–47 M visible
881+ triangles — well within the GTX 1650's throughput. The culprit is
882+ the number of indirect sub-draws (individual ` DrawElementsIndirectCommand `
883+ entries inside each ` glMultiDrawElementsIndirect ` call).
884+
885+ ##### Measurement
886+
887+ Diagnostic instrumentation (` IFC_SUBDRAW_DIAG=1 ` ) revealed:
888+
889+ ** Mixed scene (111 models, 1.06 M instances):**
890+
891+ | instanceCount | sub_draws | % of total | instances | triangles |
892+ | ---------------| -----------| ------------| -----------| -----------|
893+ | 1 | 114,624 | ** 95.7%** | 114,624 | 16.9 M |
894+ | 2 | 2,269 | 1.9% | 4,538 | 1.3 M |
895+ | 3–4 | 1,127 | 0.9% | 3,873 | 1.6 M |
896+ | 5–8 | 1,106 | 0.9% | 6,407 | 1.9 M |
897+ | 9–16 | 376 | 0.3% | 4,315 | 0.8 M |
898+ | 17–64 | 264 | 0.2% | 7,766 | 8.0 M |
899+ | 65–256 | 29 | <0.1% | 3,331 | 2.0 M |
900+ | 257+ | 8 | <0.1% | 9,732 | 0.4 M |
901+
902+ ** Steel-only scene (18 models, 570 k instances):**
903+
904+ | instanceCount | sub_draws | % of total | instances | triangles |
905+ | ---------------| -----------| ------------| -----------| -----------|
906+ | 1 | 68,616 | ** 85.9%** | 68,616 | 12.5 M |
907+ | 2 | 5,385 | 6.7% | 10,770 | 2.7 M |
908+ | 3–4 | 2,581 | 3.2% | 9,100 | 1.3 M |
909+ | 5+ | 3,324 | 4.2% | 66,407 | 7.0 M |
910+
911+ ##### Consolidation potential
912+
913+ The mesh-level consolidation analysis found:
914+
915+ - ** 119,803 unique visible mesh IDs = 119,803 sub_draws** (perfect 1:1)
916+ - ** 0 meshes split by winding or LOD buckets** — no mesh_id appears in
917+ more than one (fwd/rev × lod0/lod1) bucket
918+ - ** 0% reduction** available from merging across winding/LOD
919+ - ** 114,624 meshes (95.7%)** are genuinely unique geometry placed
920+ exactly once — instancing provides zero benefit for these
921+
922+ This is a fundamental property of the IFC data, not a pipeline
923+ inefficiency. BIM models contain thousands of unique parametric
924+ shapes (custom brackets, unique beam profiles, one-off fittings) each
925+ placed at a single location. Only a minority of elements (standard
926+ doors, windows, pipe fittings) share geometry across placements.
927+
928+ ##### Conclusions
929+
930+ 1 . ** Instancing is maxed out.** The pipeline already groups all
931+ instances of each mesh into a single sub_draw. With 96% of meshes
932+ having exactly one visible instance, there is nothing more to
933+ group.
934+
935+ 2 . ** Per-draw overhead dominates frame time.** 95–120 k sub_draws at
936+ ~ 20 fps = 48–50 ms/frame, but only 24–33 M triangles. A GTX 1650
937+ can shade 1+ billion triangles/sec; the GPU is starving on
938+ per-command overhead (command fetch, baseInstance lookup, draw
939+ setup), not vertex/fragment throughput.
940+
941+ 3 . ** The path forward is static batching.** Merge the vertex and
942+ index data of multiple distinct single-instance meshes into
943+ combined VBO/EBO ranges, each issued as one sub_draw. Batches of
944+ 256–1024 spatially-coherent meshes would collapse 91–115 k
945+ sub_draws into 100–450, a 200–1000× reduction.
946+
947+ 4 . ** Trade-offs of static batching:**
948+ - Culling granularity degrades from per-mesh to per-batch. Batches
949+ must be spatially coherent (e.g., BVH subtree leaves) or invisible
950+ geometry gets drawn.
951+ - Per-instance attributes (object_id, colour_override) must move
952+ into the vertex stream or a per-vertex SSBO lookup, since
953+ instancing no longer applies to merged meshes.
954+ - The VBO/EBO layout changes at finalize time; existing instancing
955+ stays for multi-instance meshes (the 4% that benefit from it).
956+ - The sidecar format needs a version bump to cache batch membership.
957+
958+ 5 . ** The steel scene validates the hypothesis.** It has better
959+ instancing reuse (86% single-instance vs 96%) and correspondingly
960+ better fps (49 vs 20). The ~ 2.5× fps ratio tracks the sub_draw
961+ ratio (~ 80 k vs ~ 120 k), confirming per-draw overhead as the
962+ dominant cost.
754963
755964### Planned follow-ups (post-Phase-3)
756965
@@ -769,7 +978,8 @@ Scene size Bottleneck Fix
769978 + Phase 3B LOD (done)
770979multi-million + occluders redundant rasterisation Phase 3C HiZ (done, CPU readback)
771980many models, serial cull single-thread BVH trv Phase 3D parallel cull (done)
772- single giant model / <18 cores CPU BVH trv Phase 3E GPU cull (planned)
981+ single giant model / <18 cores CPU BVH trv Phase 3E GPU cull (hybrid, done)
982+ 90k+ unique visible meshes per-draw GPU overhead Phase 3F static batching (next)
773983```
774984
775985## Roadmap
@@ -793,6 +1003,7 @@ single giant model / <18 cores CPU BVH trv Phase 3E GPU cull (plann
7931003- [x] Phase 3D — Parallel per-model CPU cull (` std::async ` fan-out)
7941004- [x] Quantized VBO (16 B/vert, sidecar v6)
7951005- [x] Event-driven rendering (zero idle CPU/GPU, cull skipped on still frames)
796- - [ ] ** Phase 3E — GPU-side compute-shader culling** (next; replaces the HiZ readback)
1006+ - [x] Phase 3E — GPU compute-shader culling (hybrid: GPU frustum+contribution, async readback, CPU HiZ+LOD+emit)
1007+ - [ ] ** Phase 3F — Static batching of single-instance meshes** (next; reduces 90k+ sub_draws to hundreds)
7971008- [ ] Vulkan/MoltenVK backend for macOS
7981009- [ ] Embedded Python scripting console
0 commit comments