Optimization: Add pruning for connected component 2 phase algorithm by WeichenXu123 · Pull Request #846 · graphframes/graphframes

WeichenXu123 · 2026-06-08T14:19:18Z

What changes were proposed in this pull request?

Optimization: Add pruning for connected component 2 phase algorithm.

Why are the changes needed?

boost performance

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 · 2026-06-08T14:20:30Z

This is an optimization that has been implemented in Databricks internal graphframe lib. Now we decide to contribute it to OSS. CC @SemyonSinchenko

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

Copilot

Pull request overview

This PR adds a pruning-based optimization to the two-phase connected components implementation, aiming to speed up convergence on sparse graphs by temporarily shrinking the graph (removing certain “leaf” nodes) and then joining results back to the original graph.

Changes:

Add leaf-node pruning (pruneLeafNodes) and reconstruction logic (joinBack) to TwoPhase.
Add heuristic gating for when to attempt pruning during iterations, plus a checkpoint-retention tweak to keep required checkpoint data around.
Add unit tests covering pruning behavior, shrinkage gating, and join-back reconstruction.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
core/src/main/scala/org/graphframes/lib/TwoPhase.scala	Implements pruning/join-back optimization and related convergence/iteration logic changes.
core/src/test/scala/org/graphframes/lib/ConnectedComponentsSuite.scala	Adds tests for pruning, shrinkage condition, and join-back behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

codecov-commenter · 2026-06-08T14:55:27Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.92%. Comparing base (a28a4e8) to head (0390ea6).
⚠️ Report is 13 commits behind head on main.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #846      +/-   ##
==========================================
+ Coverage   80.75%   81.92%   +1.16%     
==========================================
  Files          78       78              
  Lines        4421     4447      +26     
  Branches      543      559      +16     
==========================================
+ Hits         3570     3643      +73     
+ Misses        851      804      -47

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

SemyonSinchenko · 2026-06-08T16:37:26Z

First of all, thanks a lot for the contribution @WeichenXu123 !

I run local benchmarks on one of small LDBC' graphs. The first results are very promising ❤️‍🔥❤️‍🔥❤️‍🔥

Wiki-Talk, 2M/5M -- the same is run in CI and published on website.
Spark 4.0.2

This PR

[info] Result "org.graphframes.benchmarks.ConnectedComponentsBenchmark.benchmarkConnectedComponents":
[info]   87.427 ±(99.9%) 38.537 s/op [Average]
[info]   (min, avg, max) = (85.203, 87.427, 89.406), stdev = 2.112
[info]   CI (99.9%): [48.890, 125.964] (assumes normal distribution)

Main

[info] Result "org.graphframes.benchmarks.ConnectedComponentsBenchmark.benchmarkConnectedComponents":
[info]   129.323 ±(99.9%) 32.521 s/op [Average]
[info]   (min, avg, max) = (127.966, 129.323, 131.341), stdev = 1.783
[info]   CI (99.9%): [96.802, 161.843] (assumes normal distribution)

On the defaults in GraphFrames

GraphFrames is very old project. And ConnectedComponents is a widely used algorithm. So, we are trying not to change defaults in the minor-version updates and we are considering changing defaults as a breaking changes. At the same time, a couple of versions ago I was experimenting with skewedJoin in CC and found that even on the graphs with gigantic component AQE handles skew better than manual collect + broadcast: that is the note in documentation "AQE-mode".

So, we have actually two different implementations of the two_phase: run and runAQE. Can I ask you to put the same leaf-nodes optimization to the runAQE as well? Because at the moment it is still 4x time faster than run with leaf-node optimization (and 6x faster than master)?

[info] Result "org.graphframes.benchmarks.ConnectedComponentsBenchmark.benchmarkConnectedComponents":
[info]   20.206 ±(99.9%) 12.791 s/op [Average]
[info]   (min, avg, max) = (19.401, 20.206, 20.685), stdev = 0.701
[info]   CI (99.9%): [7.415, 32.998] (assumes normal distribution)

P.S.
To reproduce:

sbt:graphframes> benchmarks/Jmh/run -p graphName=wiki-Talk -p algorithm=two_phase -p broadcastThreshold="10000" org.graphframes.benchmarks.ConnectedComponentsBenchmark

and to run in "AQE-mode":

sbt:graphframes> benchmarks/Jmh/run -p graphName=wiki-Talk -p algorithm=two_phase -p broadcastThreshold="-1" org.graphframes.benchmarks.ConnectedComponentsBenchmark

And thanks a again for this contribution! The idea is brilliant and the results look very promising!

SemyonSinchenko · 2026-06-08T17:00:42Z

+        if ((edgeCnt < sparsityThreshold * numNodes) && (edgeCnt > 0)
+          && (iteration >= optStartIter) && (!triedToOptimize)) {
+          edgesBeforePruning = ee
+          pruneLeafNodes(ee, intermediateStorageLevel, numNodes, shrinkageThreshold) match {


Should we update the currSum here? I mean if we shrink the graph and convergence should happen next iteration we won't catch that algorithm is converged because the preSum at the next iteration will be currSum from the iteration when shrinking happened but that sum was computed before shrinking and it is always be bigger.

WeichenXu123 added 3 commits June 8, 2026 19:24

init

71e6e1a

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

972e67d

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update

72fdf98

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

Copilot AI review requested due to automatic review settings June 8, 2026 14:19

Copilot started reviewing on behalf of WeichenXu123 June 8, 2026 14:19 View session

format

461ba12

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

Copilot AI reviewed Jun 8, 2026

View reviewed changes

Comment thread core/src/main/scala/org/graphframes/lib/TwoPhase.scala

update

0390ea6

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

SemyonSinchenko reviewed Jun 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization: Add pruning for connected component 2 phase algorithm#846

Optimization: Add pruning for connected component 2 phase algorithm#846
WeichenXu123 wants to merge 5 commits into
graphframes:mainfrom
WeichenXu123:connected-component-2phase-pruning

WeichenXu123 commented Jun 8, 2026

Uh oh!

WeichenXu123 commented Jun 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

codecov-commenter commented Jun 8, 2026 •

edited

Loading

Uh oh!

SemyonSinchenko commented Jun 8, 2026

Uh oh!

SemyonSinchenko Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

WeichenXu123 commented Jun 8, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Uh oh!

WeichenXu123 commented Jun 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

codecov-commenter commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

SemyonSinchenko commented Jun 8, 2026

Uh oh!

SemyonSinchenko Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Jun 8, 2026 •

edited

Loading