Skip to content

Commit 0d634b6

Browse files
Add Spark CBO config tips for boosting motif finding performance (#845)
* init Signed-off-by: Weichen Xu <weichen.xu@databricks.com> * Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> * update Signed-off-by: Weichen Xu <weichen.xu@databricks.com> * format Signed-off-by: Weichen Xu <weichen.xu@databricks.com> --------- Signed-off-by: Weichen Xu <weichen.xu@databricks.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
1 parent 6a203af commit 0d634b6

3 files changed

Lines changed: 58 additions & 4 deletions

File tree

core/src/main/scala/org/graphframes/GraphFrame.scala

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -592,6 +592,23 @@ class GraphFrame private (
592592
* This can return duplicate rows. E.g., a query `"(u)-[]->()"` will return a result for each
593593
* matching edge, even if those edges share the same vertex `u`.
594594
*
595+
* ==Performance==
596+
* Motif finding translates patterns into a series of joins. Enabling Spark's Cost-Based
597+
* Optimizer (CBO) and join reordering can significantly improve performance by letting Spark
598+
* choose more efficient join orderings based on table statistics:
599+
* {{{
600+
* spark.conf.set("spark.sql.cbo.enabled", "true")
601+
* spark.conf.set("spark.sql.cbo.joinReorder.enabled", "true")
602+
* }}}
603+
* The join reorder algorithm is bounded by `spark.sql.cbo.joinReorder.dp.threshold` (default:
604+
* `12`). If the estimated number of joins in your motif exceeds this threshold, increase it
605+
* accordingly:
606+
* {{{
607+
* spark.conf.set("spark.sql.cbo.joinReorder.dp.threshold", "20")
608+
* }}}
609+
* CBO relies on table statistics, so run `ANALYZE TABLE <tableName> COMPUTE STATISTICS` on the
610+
* vertices and edges tables to ensure accurate statistics are available.
611+
*
595612
* @param pattern
596613
* Pattern specifying a motif to search for.
597614
* @return

docs/src/04-user-guide/04-motif-finding.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,23 @@ can be expressed by applying filters to the result `DataFrame`.
5555
This can return duplicate rows. E.g., a query `"(u)-[]->()"` will return a result for each
5656
matching edge, even if those edges share the same vertex `u`.
5757

58+
## Performance
59+
60+
Motif finding translates structural patterns into a series of joins. Enabling Spark's Cost-Based Optimizer (CBO) and join reordering allows Spark to pick more efficient join orderings based on table statistics, which can significantly boost motif finding performance:
61+
62+
```
63+
spark.conf.set("spark.sql.cbo.enabled", "true")
64+
spark.conf.set("spark.sql.cbo.joinReorder.enabled", "true")
65+
```
66+
67+
The join reorder algorithm uses dynamic programming and is bounded by `spark.sql.cbo.joinReorder.dp.threshold` (default: `12`). If the estimated number of joins in your motif exceeds this threshold, increase it accordingly:
68+
69+
```
70+
spark.conf.set("spark.sql.cbo.joinReorder.dp.threshold", "20")
71+
```
72+
73+
Note that CBO relies on table statistics. Run `ANALYZE TABLE <tableName> COMPUTE STATISTICS` on the vertices and edges tables, or use `spark.sql("ANALYZE TABLE ...")`, to ensure accurate statistics are available.
74+
5875
## Python API
5976

6077
For API details, refer to the @:pydoc(graphframes.GraphFrame.find).

python/graphframes/graphframe.py

Lines changed: 24 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -401,12 +401,32 @@ def pregel(self) -> Pregel:
401401

402402
def find(self, pattern: str) -> DataFrame:
403403
"""
404-
Motif finding.
404+
Motif finding: searching the graph for structural patterns.
405405
406-
See Scala documentation for more details.
406+
Motif finding uses a simple Domain-Specific Language (DSL) for expressing structural
407+
queries. For example, ``graph.find("(a)-[e1]->(b); (b)-[e2]->(a)")`` will search for
408+
pairs of vertices ``a``, ``b`` connected by edges in both directions. It returns a
409+
:class:`DataFrame` of all such structures, with columns for each named element (vertex
410+
or edge) in the motif.
411+
412+
**Performance tip:** Motif finding translates patterns into a series of joins. Enabling
413+
Spark's Cost-Based Optimizer (CBO) and join reordering can significantly improve
414+
performance::
415+
416+
spark.conf.set("spark.sql.cbo.enabled", "true")
417+
spark.conf.set("spark.sql.cbo.joinReorder.enabled", "true")
418+
419+
The join reorder algorithm is bounded by ``spark.sql.cbo.joinReorder.dp.threshold``
420+
(default: ``12``). If the estimated number of joins in your motif exceeds this threshold,
421+
increase it accordingly::
422+
423+
spark.conf.set("spark.sql.cbo.joinReorder.dp.threshold", "20")
424+
425+
CBO relies on table statistics, so run ``ANALYZE TABLE <tableName> COMPUTE STATISTICS`` on
426+
the vertices and edges tables (or temp views) to ensure accurate statistics are available.
407427
408-
:param pattern: String describing the motif to search for.
409-
:return: DataFrame with one Row for each instance of the motif found
428+
:param pattern: String describing the motif to search for.
429+
:return: DataFrame with one Row for each instance of the motif found.
410430
"""
411431
return self._impl.find(pattern=pattern)
412432

0 commit comments

Comments
 (0)