Skip to content

Commit 930678e

Browse files
Moultclaude
andcommitted
ifcviewer: motion-adaptive contribution culling + sub-draw diagnostics
During camera motion, use a larger pixel-radius threshold (IFC_MIN_PX_MOTION) to aggressively cull small objects, dramatically reducing sub_draws and improving orbit fps (e.g. 29→67 fps on 1M-instance scene). When the camera stops, automatically re-cull at the base threshold to restore full detail. Key behaviors: - IFC_MIN_PX_MOTION=N sets the motion threshold (0 = disabled) - Settle recull fires on the first still frame after motion - HiZ pyramid invalidated on settle (stale from sparse motion frame) - GPU cull results skipped on settle (dispatched at motion threshold) - requestUpdate() ensures the settle frame actually runs Also adds IFC_SUBDRAW_DIAG=1 diagnostic for sub-draw composition analysis and documents Phase 3E/3F experiment results in README. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 4e3cc63 commit 930678e

3 files changed

Lines changed: 435 additions & 21 deletions

File tree

src/ifcviewer/README.md

Lines changed: 224 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -740,17 +740,226 @@ The stats line now reports `cull[wall X | work: clr Y trv Z emt W upl U]`:
740740
where CPU cycles went. `IFC_CULL_THREADS=0` forces single-threaded mode
741741
for comparison.
742742

743-
#### 3E. GPU-side culling via compute (longer-term)
744-
745-
Push the cull loop to a compute shader reading the per-instance SSBO +
746-
frustum planes + HiZ pyramid, emitting the visible list and indirect
747-
commands with atomic counters. Three compute dispatches per model: (1)
748-
count survivors per `(mesh, winding, LOD)` bucket, (2) prefix-sum the
749-
counts into `baseInstance` offsets and write the indirect command buffer,
750-
(3) re-test and compact survivors into the dense visible list. HiZ moves
751-
to a GPU depth texture sampled directly in the shader, eliminating the
752-
Phase 3C readback. Lets culling scale to millions of instances and
753-
single-model scenes where Phase 3D can't parallelise.
743+
#### 3E. GPU compute culling — experiments, results, and current state
744+
745+
##### What we tried
746+
747+
**Attempt 1: Full GPU-driven rendering (reverted).** Five commits
748+
(`4fe32b54`..`d5b7b87b`) moved the entire cull-to-draw pipeline onto
749+
the GPU: a compute shader performed frustum + contribution + HiZ
750+
culling, selected LOD0/LOD1, handled fwd/rev winding bucketing, wrote
751+
indirect draw commands via `glMultiDrawElementsIndirectCount`, and
752+
drove rendering without CPU readback. This was architecturally clean
753+
but complex — the GPU built per-model indirect command buffers with
754+
atomic counters, prefix sums, and per-bucket compaction. It worked
755+
correctly but introduced code smells (extension loaders for
756+
`glMultiDrawElementsIndirectCount` not exposed by Qt6's
757+
`QOpenGLFunctions_4_5_Core`, ad-hoc GPU readbacks for validation).
758+
All five commits were reverted as a single block to keep the codebase
759+
clean while preserving the AABB SSBO upload (`b2044737`) and the
760+
frustum-only validation shader (`b17860fc`).
761+
762+
**Attempt 2: GPU frustum-only validation shader.** A minimal compute
763+
shader (64 threads/workgroup) testing each instance's AABB against 6
764+
frustum planes. Used as a measurement baseline — no contribution,
765+
HiZ, LOD, or winding. Results on a 1.06 M-instance / 111-model scene
766+
(GTX 1650):
767+
768+
| Metric | GPU frustum-only | CPU BVH (parallel) |
769+
|--------|------------------|--------------------|
770+
| Cull time | **0.82 ms** (GPU timestamp) | 9.6–15.2 ms wall |
771+
| Survivors | 279 k (frustum only) | 130 k (frustum + contribution + HiZ) |
772+
773+
The GPU brute-force scan of 1.06 M instances in 0.82 ms was 12–18×
774+
faster than the CPU BVH walk despite testing every instance.
775+
776+
**Attempt 3: Hybrid GPU cull with synchronous readback.** Added
777+
contribution culling to the GPU shader (bounding-sphere screen-space
778+
radius test), then read back the compact survivor list to the CPU with
779+
`glGetNamedBufferSubData`. CPU retains HiZ, LOD selection, winding
780+
bucketing, indirect command building, and all GL draw calls.
781+
782+
| Phase | Time |
783+
|-------|------|
784+
| GPU dispatch (frustum + contribution) | 0.92 ms |
785+
| Synchronous readback (`glGetNamedBufferSubData`) | **4.2–7.4 ms** |
786+
| CPU consume (HiZ + LOD + winding + emit) | 6.4–9.8 ms |
787+
| **Total wall** | **~15 ms** |
788+
789+
The synchronous readback pipeline-stalled the GPU, adding 4–7 ms of
790+
idle wait. Total wall time was roughly equal to the CPU-only path,
791+
negating the GPU cull's speed advantage.
792+
793+
**Attempt 4: Async one-frame-late readback (committed, `30e43ffe`).**
794+
Replaced synchronous readback with a persistent-mapped buffer
795+
(`GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT`) and a `glFenceSync` /
796+
`glClientWaitSync` fence. The GPU writes survivors this frame; the
797+
CPU reads them next frame. One frame of latency, but zero stalls.
798+
799+
| Phase | Time |
800+
|-------|------|
801+
| GPU dispatch | 0.69–0.78 ms |
802+
| Async readback (fence poll) | **0.00 ms** |
803+
| CPU consume | 5.0–6.2 ms |
804+
| **Total wall** | **~5.5 ms** |
805+
806+
vs the CPU-only path at 5.2–6.4 ms wall on the same scene. The GPU
807+
cull + async readback matches or slightly beats the parallel CPU BVH
808+
path, with headroom for scenes where the CPU path can't parallelise
809+
(single large model).
810+
811+
**Attempt 5: Dirty-mesh tracking (committed, `01dd8d57`).** Profiling
812+
the CPU consume phase revealed that `clr` (clearing per-mesh visibility
813+
buckets) and `emit` (building indirect commands) were O(total_meshes)
814+
= O(462 k), not O(survivors). Added a dirty-mesh list so only mesh
815+
buckets that received survivors are cleared and iterated.
816+
817+
Consume sub-phase breakdown (summed across parallel threads,
818+
~128 k survivors):
819+
820+
| Sub-phase | Before | After | Scales with |
821+
|-----------|--------|-------|-------------|
822+
| bin (model binning) | 0.11 ms | 0.18 ms | O(survivors) |
823+
| clr (bucket clear) | 2.0 ms | **1.6 ms** | O(dirty meshes) |
824+
| class (HiZ + LOD + winding) | 5.1 ms | 5.3 ms | O(survivors) |
825+
| emit (indirect cmd build) | 4.2 ms | **2.2 ms** | O(dirty meshes) |
826+
827+
Emit improved ~48%, clr ~20%. The dominant cost shifted to `class`
828+
(per-survivor HiZ + LOD + winding classification).
829+
830+
##### What we learned
831+
832+
1. **GPU brute-force beats CPU BVH for frustum + contribution.**
833+
0.82 ms for 1.06 M instances vs 10–15 ms for the CPU BVH walk.
834+
The BVH's hierarchical skip advantage is overwhelmed by the GPU's
835+
raw parallelism — 1 M independent AABB-vs-frustum tests is a
836+
perfect compute workload.
837+
838+
2. **Synchronous readback kills the advantage.** The 4–7 ms stall from
839+
`glGetNamedBufferSubData` on ~1 MB of data negated all GPU savings.
840+
A pipeline stall is worse than just doing the work on the CPU.
841+
842+
3. **Async one-frame-late readback works well.** Persistent mapping +
843+
fence polling adds zero measurable overhead. The one-frame latency
844+
is imperceptible for culling — worst case, a few objects at the
845+
frustum edge pop in one frame late during fast camera motion.
846+
847+
4. **CPU consume is now the bottleneck.** With GPU dispatch at <1 ms
848+
and readback at 0 ms, the 5–6 ms consume phase (HiZ test, LOD
849+
selection, winding classification, indirect command building)
850+
dominates. The `class` sub-phase alone is 5+ ms, scaling linearly
851+
with survivor count.
852+
853+
5. **Dirty-mesh tracking helps but doesn't transform performance.**
854+
The 462 k total meshes → ~104 k active meshes reduction cut emit
855+
in half, but the per-survivor classification work is the true
856+
bottleneck.
857+
858+
##### What remains
859+
860+
The hybrid path (`IFC_GPU_CULL=1`) is functional and committed. It
861+
matches the CPU path's performance today and provides the foundation
862+
for further GPU offload. Remaining opportunities:
863+
864+
- Move HiZ + LOD + winding classification to the GPU (eliminates the
865+
5 ms `class` sub-phase entirely — the GPU already has the AABBs and
866+
can sample the HiZ pyramid directly).
867+
- GPU BVH traversal to reduce dispatch from O(total) to O(visible +
868+
tree overhead) — matters when survivor ratio is low.
869+
- GPU-driven indirect command building (eliminates CPU emit entirely).
870+
871+
Each of these would chip away at the consume phase, but the sub_draw
872+
analysis below reveals a more fundamental bottleneck.
873+
874+
#### 3F. Sub-draw fragmentation analysis
875+
876+
##### The problem
877+
878+
With GPU cull solving the *culling* bottleneck, the dominant cost
879+
shifts to the *drawing* side. On the 1.06 M-instance / 111-model
880+
scene, frame times are 48–63 ms despite only 24–47 M visible
881+
triangles — well within the GTX 1650's throughput. The culprit is
882+
the number of indirect sub-draws (individual `DrawElementsIndirectCommand`
883+
entries inside each `glMultiDrawElementsIndirect` call).
884+
885+
##### Measurement
886+
887+
Diagnostic instrumentation (`IFC_SUBDRAW_DIAG=1`) revealed:
888+
889+
**Mixed scene (111 models, 1.06 M instances):**
890+
891+
| instanceCount | sub_draws | % of total | instances | triangles |
892+
|---------------|-----------|------------|-----------|-----------|
893+
| 1 | 114,624 | **95.7%** | 114,624 | 16.9 M |
894+
| 2 | 2,269 | 1.9% | 4,538 | 1.3 M |
895+
| 3–4 | 1,127 | 0.9% | 3,873 | 1.6 M |
896+
| 5–8 | 1,106 | 0.9% | 6,407 | 1.9 M |
897+
| 9–16 | 376 | 0.3% | 4,315 | 0.8 M |
898+
| 17–64 | 264 | 0.2% | 7,766 | 8.0 M |
899+
| 65–256 | 29 | <0.1% | 3,331 | 2.0 M |
900+
| 257+ | 8 | <0.1% | 9,732 | 0.4 M |
901+
902+
**Steel-only scene (18 models, 570 k instances):**
903+
904+
| instanceCount | sub_draws | % of total | instances | triangles |
905+
|---------------|-----------|------------|-----------|-----------|
906+
| 1 | 68,616 | **85.9%** | 68,616 | 12.5 M |
907+
| 2 | 5,385 | 6.7% | 10,770 | 2.7 M |
908+
| 3–4 | 2,581 | 3.2% | 9,100 | 1.3 M |
909+
| 5+ | 3,324 | 4.2% | 66,407 | 7.0 M |
910+
911+
##### Consolidation potential
912+
913+
The mesh-level consolidation analysis found:
914+
915+
- **119,803 unique visible mesh IDs = 119,803 sub_draws** (perfect 1:1)
916+
- **0 meshes split by winding or LOD buckets** — no mesh_id appears in
917+
more than one (fwd/rev × lod0/lod1) bucket
918+
- **0% reduction** available from merging across winding/LOD
919+
- **114,624 meshes (95.7%)** are genuinely unique geometry placed
920+
exactly once — instancing provides zero benefit for these
921+
922+
This is a fundamental property of the IFC data, not a pipeline
923+
inefficiency. BIM models contain thousands of unique parametric
924+
shapes (custom brackets, unique beam profiles, one-off fittings) each
925+
placed at a single location. Only a minority of elements (standard
926+
doors, windows, pipe fittings) share geometry across placements.
927+
928+
##### Conclusions
929+
930+
1. **Instancing is maxed out.** The pipeline already groups all
931+
instances of each mesh into a single sub_draw. With 96% of meshes
932+
having exactly one visible instance, there is nothing more to
933+
group.
934+
935+
2. **Per-draw overhead dominates frame time.** 95–120 k sub_draws at
936+
~20 fps = 48–50 ms/frame, but only 24–33 M triangles. A GTX 1650
937+
can shade 1+ billion triangles/sec; the GPU is starving on
938+
per-command overhead (command fetch, baseInstance lookup, draw
939+
setup), not vertex/fragment throughput.
940+
941+
3. **The path forward is static batching.** Merge the vertex and
942+
index data of multiple distinct single-instance meshes into
943+
combined VBO/EBO ranges, each issued as one sub_draw. Batches of
944+
256–1024 spatially-coherent meshes would collapse 91–115 k
945+
sub_draws into 100–450, a 200–1000× reduction.
946+
947+
4. **Trade-offs of static batching:**
948+
- Culling granularity degrades from per-mesh to per-batch. Batches
949+
must be spatially coherent (e.g., BVH subtree leaves) or invisible
950+
geometry gets drawn.
951+
- Per-instance attributes (object_id, colour_override) must move
952+
into the vertex stream or a per-vertex SSBO lookup, since
953+
instancing no longer applies to merged meshes.
954+
- The VBO/EBO layout changes at finalize time; existing instancing
955+
stays for multi-instance meshes (the 4% that benefit from it).
956+
- The sidecar format needs a version bump to cache batch membership.
957+
958+
5. **The steel scene validates the hypothesis.** It has better
959+
instancing reuse (86% single-instance vs 96%) and correspondingly
960+
better fps (49 vs 20). The ~2.5× fps ratio tracks the sub_draw
961+
ratio (~80 k vs ~120 k), confirming per-draw overhead as the
962+
dominant cost.
754963

755964
### Planned follow-ups (post-Phase-3)
756965

@@ -769,7 +978,8 @@ Scene size Bottleneck Fix
769978
+ Phase 3B LOD (done)
770979
multi-million + occluders redundant rasterisation Phase 3C HiZ (done, CPU readback)
771980
many models, serial cull single-thread BVH trv Phase 3D parallel cull (done)
772-
single giant model / <18 cores CPU BVH trv Phase 3E GPU cull (planned)
981+
single giant model / <18 cores CPU BVH trv Phase 3E GPU cull (hybrid, done)
982+
90k+ unique visible meshes per-draw GPU overhead Phase 3F static batching (next)
773983
```
774984

775985
## Roadmap
@@ -793,6 +1003,7 @@ single giant model / <18 cores CPU BVH trv Phase 3E GPU cull (plann
7931003
- [x] Phase 3D — Parallel per-model CPU cull (`std::async` fan-out)
7941004
- [x] Quantized VBO (16 B/vert, sidecar v6)
7951005
- [x] Event-driven rendering (zero idle CPU/GPU, cull skipped on still frames)
796-
- [ ] **Phase 3E — GPU-side compute-shader culling** (next; replaces the HiZ readback)
1006+
- [x] Phase 3E — GPU compute-shader culling (hybrid: GPU frustum+contribution, async readback, CPU HiZ+LOD+emit)
1007+
- [ ] **Phase 3F — Static batching of single-instance meshes** (next; reduces 90k+ sub_draws to hundreds)
7971008
- [ ] Vulkan/MoltenVK backend for macOS
7981009
- [ ] Embedded Python scripting console

0 commit comments

Comments
 (0)