Optimization: Add pruning for connected component 2 phase algorithm#846
Optimization: Add pruning for connected component 2 phase algorithm#846WeichenXu123 wants to merge 5 commits into
Conversation
|
This is an optimization that has been implemented in Databricks internal graphframe lib. Now we decide to contribute it to OSS. CC @SemyonSinchenko |
There was a problem hiding this comment.
Pull request overview
This PR adds a pruning-based optimization to the two-phase connected components implementation, aiming to speed up convergence on sparse graphs by temporarily shrinking the graph (removing certain “leaf” nodes) and then joining results back to the original graph.
Changes:
- Add leaf-node pruning (
pruneLeafNodes) and reconstruction logic (joinBack) toTwoPhase. - Add heuristic gating for when to attempt pruning during iterations, plus a checkpoint-retention tweak to keep required checkpoint data around.
- Add unit tests covering pruning behavior, shrinkage gating, and join-back reconstruction.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| core/src/main/scala/org/graphframes/lib/TwoPhase.scala | Implements pruning/join-back optimization and related convergence/iteration logic changes. |
| core/src/test/scala/org/graphframes/lib/ConnectedComponentsSuite.scala | Adds tests for pruning, shrinkage condition, and join-back behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #846 +/- ##
==========================================
+ Coverage 80.75% 81.92% +1.16%
==========================================
Files 78 78
Lines 4421 4447 +26
Branches 543 559 +16
==========================================
+ Hits 3570 3643 +73
+ Misses 851 804 -47 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
First of all, thanks a lot for the contribution @WeichenXu123 ! I run local benchmarks on one of small LDBC' graphs. The first results are very promising ❤️🔥❤️🔥❤️🔥 Wiki-Talk, 2M/5M -- the same is run in CI and published on website. This PR Main On the defaults in GraphFrames GraphFrames is very old project. And ConnectedComponents is a widely used algorithm. So, we are trying not to change defaults in the minor-version updates and we are considering changing defaults as a breaking changes. At the same time, a couple of versions ago I was experimenting with So, we have actually two different implementations of the P.S. and to run in "AQE-mode": And thanks a again for this contribution! The idea is brilliant and the results look very promising! |
| if ((edgeCnt < sparsityThreshold * numNodes) && (edgeCnt > 0) | ||
| && (iteration >= optStartIter) && (!triedToOptimize)) { | ||
| edgesBeforePruning = ee | ||
| pruneLeafNodes(ee, intermediateStorageLevel, numNodes, shrinkageThreshold) match { |
There was a problem hiding this comment.
Should we update the currSum here? I mean if we shrink the graph and convergence should happen next iteration we won't catch that algorithm is converged because the preSum at the next iteration will be currSum from the iteration when shrinking happened but that sum was computed before shrinking and it is always be bigger.
What changes were proposed in this pull request?
Optimization: Add pruning for connected component 2 phase algorithm.
Why are the changes needed?
boost performance