Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
f4e9cdb
Converted tests to pytest. Build a Python package. Update requirement…
rjurney Feb 16, 2025
c256244
Restore Python .gitignore
rjurney Feb 16, 2025
6c3df0b
Extra newline removed
rjurney Feb 16, 2025
b2838d2
Merge branch 'master' of github.com:graphframes/graphframes into rjur…
rjurney Feb 16, 2025
caf5091
Added VERSION file set to 0.8.5
rjurney Feb 16, 2025
7cfa2d1
isort; fiex edgesDF variable name.
rjurney Feb 16, 2025
2ca9a15
Merge branch 'master' of github.com:graphframes/graphframes into rjur…
rjurney Feb 16, 2025
a8bf0be
Back out Dockerfile changes
rjurney Feb 16, 2025
54a942d
Back out version change in build.sbt
rjurney Feb 16, 2025
8b0e346
Backout changes to config and run-tests
rjurney Feb 16, 2025
46c2b93
Back out pytest conversion
rjurney Feb 16, 2025
18b5da0
Back out version changes to make nose tests pass
rjurney Feb 16, 2025
8eca097
Remove changes to requirements
rjurney Feb 16, 2025
277c06f
Put nose back in requirements.txt
rjurney Feb 16, 2025
b55ee48
Remove version bump to version.sbt
rjurney Feb 16, 2025
f8a8fd9
Remove packages related to testing
rjurney Feb 16, 2025
bc2cb36
Remove old setup.py / setup.cfg
rjurney Feb 16, 2025
728be33
New pyproject.toml and poetry.lock
rjurney Feb 16, 2025
3cea1a8
Short README for Python package, poetry won't allow a ../README.md path
rjurney Feb 16, 2025
87cc975
Remove requirements files in favor of pyproject.toml
rjurney Feb 16, 2025
6f84a5a
Try to poetrize CI build
rjurney Feb 16, 2025
9a8eef0
pyspark min 3.4
rjurney Feb 16, 2025
75ecd99
Local python README in pyproject.toml
rjurney Feb 16, 2025
80231d0
Trying to remove he working folder to debug scala issue
rjurney Feb 16, 2025
2a9170b
Set Python working directory again
rjurney Feb 16, 2025
3de2263
Accidental newline
rjurney Feb 16, 2025
4662717
Install Python for test...
rjurney Feb 17, 2025
1b7b9f8
Run tests from python/ folder
rjurney Feb 17, 2025
58da493
Try running tests from python/
rjurney Feb 17, 2025
9f4aa24
poetry run the unit tests
rjurney Feb 17, 2025
11b2782
poetry run the tests
rjurney Feb 17, 2025
9772344
Try just using 'python' instead of a path
rjurney Feb 17, 2025
d55dbfe
poetry run the last line, graphframes.main
rjurney Feb 17, 2025
2fc4d08
Remove test/ folder from style paths, it doesn't exist
rjurney Feb 17, 2025
8297a13
Remove .vscode
rjurney Feb 17, 2025
2035d98
VERSION back to 0.8.4
rjurney Feb 17, 2025
f9f4bd7
Remove tutorials reference
rjurney Feb 17, 2025
9ddd6b2
VERSION is a Python thing, it belongs in python/
rjurney Feb 17, 2025
7065647
Include the README.md and LICENSE in the Python package
rjurney Feb 17, 2025
a6c7e91
Some classifiers for pyproject.toml
rjurney Feb 17, 2025
51e3e6d
Trying poetry install action instead of manual install
rjurney Feb 17, 2025
272be06
Removing SPARK_HOME
rjurney Feb 17, 2025
4587999
Returned SPARK_HOME settings
rjurney Feb 17, 2025
2422b22
Minimized the PR to just these files
rjurney Feb 17, 2025
073dced
Merge in rjurney/build-upgrades and in turn master
rjurney Feb 17, 2025
0a1faba
Created tutorials dependency group to minimize main bloat
rjurney Feb 17, 2025
c0d6d7b
Make motif.py execute in whole again
rjurney Feb 17, 2025
5bb4c26
Minor isort format and cleanup of download.py
rjurney Feb 17, 2025
99e6a4d
Minor isort format and cleanup of utils.py
rjurney Feb 17, 2025
662e197
Removed case sensitivity from the script - that was confusing people …
rjurney Feb 17, 2025
beaa35d
motif.py now matches tutorial code, runs and handles case insensitivity.
rjurney Feb 17, 2025
1bf4a9e
Regenerate poetry.lock
rjurney Feb 21, 2025
ef19784
Setup a 'graphframes stackexchange' comand.
rjurney Feb 21, 2025
4400cb4
Make graphframes.tutorials.motif use a checkpoint dir unique, and fro…
rjurney Feb 21, 2025
d549c56
Use spark.sparkContext.setCheckpointDir directly instead of instantia…
rjurney Feb 21, 2025
b970636
Using 'from __future__ import annotations' intsead of List and Tuple
rjurney Feb 21, 2025
3788941
Now retry three times if we can't connect for any reason in 'graphfra…
rjurney Feb 21, 2025
e95bbbe
Merge master
rjurney Feb 25, 2025
413a915
Merge branch 'master' of github.com:graphframes/graphframes
rjurney Mar 8, 2025
37ff13a
Add missing image
rjurney Mar 10, 2025
ae3c90a
Final docs fixes pre-release
rjurney Mar 10, 2025
00d1bfb
Minor newline fix
rjurney Mar 10, 2025
b3e2ce9
Fix bash/python messup
rjurney Mar 10, 2025
67e9830
Another newline
rjurney Mar 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/img/Directed-Graphlet-G30.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 0 additions & 9 deletions docs/motif-tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,6 @@ from pyspark import SparkContext
from pyspark.sql import DataFrame, SparkSession

# Initialize a SparkSession

spark: SparkSession = (
SparkSession.builder.appName("Stack Overflow Motif Analysis")
# Lets the Id:(Stack Overflow int) and id:(GraphFrames ULID) coexist
Expand All @@ -103,7 +102,6 @@ sc: SparkContext = spark.sparkContext
sc.setCheckpointDir("/tmp/graphframes-checkpoints")

# Change me if you download a different stackexchange site

STACKEXCHANGE_SITE = "stats.meta.stackexchange.com"
BASE_PATH = f"python/graphframes/tutorials/data/{STACKEXCHANGE_SITE}"
{% endhighlight %}
Expand All @@ -118,21 +116,17 @@ Load the nodes and edges of the graph from the `data` folder and count the types
#

# We created these in stackexchange.py from Stack Exchange data dump XML files

NODES_PATH: str = f"{BASE_PATH}/Nodes.parquet"
nodes_df: DataFrame = spark.read.parquet(NODES_PATH)

# Repartition the nodes to give our motif searches parallelism

nodes_df = nodes_df.repartition(50).checkpoint().cache()

# We created these in stackexchange.py from Stack Exchange data dump XML files

EDGES_PATH: str = f"{BASE_PATH}/Edges.parquet"
edges_df: DataFrame = spark.read.parquet(EDGES_PATH)

# Repartition the edges to give our motif searches parallelism

edges_df = edges_df.repartition(50).checkpoint().cache()
{% endhighlight %}
</div>
Expand Down Expand Up @@ -243,7 +237,6 @@ def add_missing_columns(df: DataFrame, all_cols: List[Tuple[str, T.StructField]]
return df

# Now apply this function to each of your DataFrames to get a consistent schema

posts_df = add_missing_columns(posts_df, all_cols).select(all_column_names)
post_links_df = add_missing_columns(post_links_df, all_cols).select(all_column_names)
users_df = add_missing_columns(users_df, all_cols).select(all_column_names)
Expand Down Expand Up @@ -322,7 +315,6 @@ valid_edge_count = (
)

# Just up and die if we have edges that point to non-existent nodes

assert (
edge_count == valid_edge_count
), f"Edge count {edge_count} != valid edge count {valid_edge_count}"
Expand Down Expand Up @@ -359,7 +351,6 @@ A complete description of the graph query language is in the [GraphFrames User G
paths = g.find("(a)-[e1]->(b); (b)-[e2]->(c); (c)-[e3]->(a)")

# Show the first path

paths.show(3)
{% endhighlight %}
</div>
Expand Down
2 changes: 2 additions & 0 deletions docs/quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ val v = spark.createDataFrame(List(
("b", "Bob", 36),
("c", "Charlie", 30)
)).toDF("id", "name", "age")

// Create an Edge DataFrame with "src" and "dst" columns
val e = spark.createDataFrame(List(
("a", "b", "friend"),
Expand Down Expand Up @@ -96,6 +97,7 @@ v = spark.createDataFrame([
("b", "Bob", 36),
("c", "Charlie", 30),
], ["id", "name", "age"])

# Create an Edge DataFrame with "src" and "dst" columns
e = spark.createDataFrame([
("a", "b", "friend"),
Expand Down
16 changes: 13 additions & 3 deletions docs/user-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ val v = spark.createDataFrame(List(
("f", "Fanny", 36),
("g", "Gabby", 60)
)).toDF("id", "name", "age")

// Edge DataFrame
val e = spark.createDataFrame(List(
("a", "b", "friend"),
Expand Down Expand Up @@ -80,6 +81,7 @@ v = spark.createDataFrame([
("f", "Fanny", 36),
("g", "Gabby", 60)
], ["id", "name", "age"])

# Edge DataFrame
e = spark.createDataFrame([
("a", "b", "friend"),
Expand Down Expand Up @@ -172,8 +174,7 @@ from graphframes.examples import Graphs

g = Graphs(spark).friends() # Get example graph

# Display the vertex and edge DataFrames

# Display the vertex DataFrame
g.vertices.show()

# +--+-------+---+
Expand All @@ -188,6 +189,7 @@ g.vertices.show()
# | g| Gabby| 60|
# +--+-------+---+

# Display the edge DataFrame
g.edges.show()

# +---+---+------------+
Expand Down Expand Up @@ -368,6 +370,7 @@ from pyspark.sql.functions import col, lit, when
from pyspark.sql.types import IntegerType
from graphframes.examples import Graphs


g = Graphs(spark).friends() # Get example graph

chain4 = g.find("(a)-[ab]->(b); (b)-[bc]->(c); (c)-[cd]->(d)")
Expand Down Expand Up @@ -476,7 +479,6 @@ paths = g.find("(a)-[e]->(b)")\
.filter("a.age < b.age")

# "paths" contains vertex info. Extract the edges

e2 = paths.select("e.src", "e.dst", "e.relationship")

# In Spark 1.5+, the user may simplify this call
Expand Down Expand Up @@ -539,6 +541,7 @@ For API details, refer to the [API docs](api/python/graphframes.html#graphframes
{% highlight python %}
from graphframes.examples import Graphs


g = Graphs(spark).friends() # Get example graph

# Search from "Esther" for users of age < 32
Expand Down Expand Up @@ -630,6 +633,7 @@ For API details, refer to the [API docs](api/python/graphframes.html#graphframes
{% highlight python %}
from graphframes.examples import Graphs


sc.setCheckpointDir("/tmp/spark-checkpoints")

g = Graphs(spark).friends() # Get example graph
Expand Down Expand Up @@ -678,6 +682,7 @@ For API details, refer to the [API docs](api/python/graphframes.html#graphframes
{% highlight python %}
from graphframes.examples import Graphs


g = Graphs(spark).friends() # Get example graph

result = g.labelPropagation(maxIter=5)
Expand Down Expand Up @@ -741,6 +746,7 @@ For API details, refer to the [API docs](api/python/graphframes.html#graphframes
{% highlight python %}
from graphframes.examples import Graphs


g = Graphs(spark).friends() # Get example graph

# Run PageRank until convergence to tolerance "tol"
Expand Down Expand Up @@ -796,6 +802,7 @@ For API details, refer to the [API docs](api/python/graphframes.html#graphframes
{% highlight python %}
from graphframes.examples import Graphs


g = Graphs(spark).friends() # Get example graph

results = g.shortestPaths(landmarks=["a", "d"])
Expand Down Expand Up @@ -832,6 +839,7 @@ For API details, refer to the [API docs](api/python/graphframes.html#graphframes
{% highlight python %}
from graphframes.examples import Graphs


g = Graphs(spark).friends() # Get example graph

results = g.triangleCount()
Expand Down Expand Up @@ -875,6 +883,7 @@ val sameG = GraphFrame(sameV, sameE)
{% highlight python %}
from graphframes.examples import Graphs


g = Graphs(spark).friends() # Get example graph

# Save vertices and edges as Parquet to some location
Expand Down Expand Up @@ -946,6 +955,7 @@ from graphframes.lib import AggregateMessages as AM
from graphframes.examples import Graphs
from pyspark.sql.functions import sum as sqlsum


g = Graphs(spark).friends() # Get example graph

# For each user, sum the ages of the adjacent users
Expand Down