-
Notifications
You must be signed in to change notification settings - Fork 268
Adding motif finding tutorial using the stats.meta.stackexchange.com data dump #473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
f76668d
13c5e74
1145f6f
1319434
a50c5a3
e3a5ca3
76a7ea7
a013c5f
9a8d845
77f0233
333dc1b
ee843b3
058cdfb
8d84baa
d1cbfe4
0654a3e
d5becef
781a13e
e872602
a323879
e87a985
c18f06f
6dd0375
fc80c4e
f40afe2
f644db8
8dc6432
d65ff29
587c79f
ea8f4da
f4de3af
1fba393
17bb0cf
6548658
6943029
fe56706
d49d29f
d42f0bd
71fff1a
867361b
4633fce
1e503e5
9d0b761
dc1c26c
21aab04
ebbf9d6
7ee98e4
6376395
46ccb66
50c795d
de5f46e
37d46c7
afb95a2
29e7f33
4083ea4
73f14b0
5e64876
2be6c95
185186a
f074f87
64f23aa
5c6f051
0c3a36e
5d4fb8c
4074a16
f1523d5
f213805
0e66395
3875292
35c6a0c
61be774
a14d935
d9ced29
8cdcdf9
8c392bc
08ee330
82fe372
a27daf2
8f5ef9a
b2b3aa8
c1dc313
1d19683
3ec886d
1d5a1fa
9450f5d
2eed755
9ad573c
7278ebc
b6bb86d
76ab1cc
5064713
0910f65
9a6b11d
1515e7f
a3d4167
0d1964f
7353b07
31321b5
867839c
9196b10
a63913d
74e9c6a
f847ebb
8696b79
d87d5a0
0630b57
3c3ab2e
037da48
e172020
a9ec785
c275966
d95e8ee
ef2e6a1
c4f7cfa
091ef3e
0092b12
ffc7a2e
11530dc
7dd5c15
e38d081
99cee06
548a848
1ad3e09
b986a86
6e38579
d9742d1
f359fef
205985e
f98cdf3
c68e32f
5211cf7
ea89dac
eb57303
4dc9cc1
74432f7
e78c654
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -6,34 +6,165 @@ | |
|
|
||
| # GraphFrames: DataFrame-based Graphs | ||
|
|
||
| This is a package for DataFrame-based graphs on top of Apache Spark. | ||
| Users can write highly expressive queries by leveraging the DataFrame API, combined with a new | ||
| API for motif finding. The user also benefits from DataFrame performance optimizations | ||
| within the Spark SQL engine. | ||
| This is a package for DataFrame-based graphs on top of Apache Spark. Users can write highly expressive queries by leveraging the DataFrame API, combined with a new API for network motif finding. The user also benefits from DataFrame performance optimizations within the Spark SQL engine. GraphFrames works in Java, Scala, and Python. | ||
|
|
||
| You can find user guide and API docs at https://graphframes.github.io/graphframes. | ||
| You can find user guide and API docs at https://graphframes.github.io/graphframes | ||
|
|
||
| ## Installation and Quick-Start | ||
|
|
||
| The easiest way to start using GraphFrames is through the [Spark Packages system](https://spark-packages.org/package/graphframes/graphframes). Just run the following command: | ||
|
|
||
| ```bash | ||
| # Interactive Scala/Java | ||
| $ spark-shell --packages graphframes:graphframes:0.8.3-spark3.5-s_2.12 | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. graphframes:0.8.3 .4 I belive? |
||
|
|
||
| # Interactive Python | ||
| $ pyspark --packages graphframes:graphframes:0.8.3-spark3.5-s_2.12 | ||
|
|
||
| # Submit a script in Scala/Java/Python | ||
| $ spark-submit --packages graphframes:graphframes:0.8.3-spark3.5-s_2.12 script.py | ||
| ``` | ||
|
|
||
| Now you can create a GraphFrame as follows. | ||
|
|
||
| In Python: | ||
|
|
||
| ```python | ||
| from pyspark.sql import SparkSession | ||
| from graphframes import GraphFrame | ||
|
|
||
| spark = SparkSession.builder.getOrCreate() | ||
|
|
||
| nodes = [ | ||
| (1, "Alice", 30), | ||
| (2, "Bob", 25), | ||
| (3, "Charlie", 35) | ||
| ] | ||
| nodes_df = spark.createDataFrame(nodes, ["id", "name", "age"]) | ||
|
|
||
| edges = [ | ||
| (1, 2, "friend"), | ||
| (2, 1, "friend"), | ||
| (2, 3, "friend"), | ||
| (3, 2, "enemy") # eek! | ||
| ] | ||
| edges_df = spark.createDataFrame(edges, ["src", "dst", "relationship"]) | ||
|
|
||
| g = GraphFrame(nodes_df, edges_df) | ||
| ``` | ||
|
|
||
| Now let's run some graph algorithms at scale! | ||
|
|
||
| ```python | ||
| g.inDegrees.show() | ||
|
|
||
| # +---+--------+ | ||
| # | id|inDegree| | ||
| # +---+--------+ | ||
| # | 2| 2| | ||
| # | 1| 1| | ||
| # | 3| 1| | ||
| # +---+--------+ | ||
|
|
||
| g.outDegrees.show() | ||
|
|
||
| # +---+---------+ | ||
| # | id|outDegree| | ||
| # +---+---------+ | ||
| # | 1| 1| | ||
| # | 2| 2| | ||
| # | 3| 1| | ||
| # +---+---------+ | ||
|
|
||
| g.degrees.show() | ||
|
|
||
| # +---+------+ | ||
| # | id|degree| | ||
| # +---+------+ | ||
| # | 1| 2| | ||
| # | 2| 4| | ||
| # | 3| 2| | ||
| # +---+------+ | ||
|
|
||
| g2 = g.pageRank(resetProbability=0.15, tol=0.01) | ||
| g2.vertices.show() | ||
|
|
||
| # +---+-----+---+------------------+ | ||
| # | id| name|age| pagerank| | ||
| # +---+-----+---+------------------+ | ||
| # | 1| John| 30|0.7758750474847483| | ||
| # | 2|Alice| 25|1.4482499050305027| | ||
| # | 3| Bob| 35|0.7758750474847483| | ||
| # +---+-----+---+------------------+ | ||
|
|
||
| # GraphFrames' most used feature... | ||
| # Connected components can do big data entity resolution on billions or even trillions of records! | ||
| # First connect records with a similarity metric, then run connectedComponents. | ||
| # This gives you groups of identical records, which you then link by same_as edges or merge into list-based master records. | ||
| sc.setCheckpointDir("/tmp/graphframes-example-connected-components") # required by GraphFrames.connectedComponents | ||
| g.connectedComponents().show() | ||
|
|
||
| # +---+-----+---+---------+ | ||
| # | id| name|age|component| | ||
| # +---+-----+---+---------+ | ||
| # | 1| John| 30| 1| | ||
| # | 2|Alice| 25| 1| | ||
| # | 3| Bob| 35| 1| | ||
| # +---+-----+---+---------+ | ||
|
|
||
| # Find frenemies with network motif finding! See how graph and relational queries are combined? | ||
| ( | ||
| g.find("(a)-[e]->(b); (b)-[e2]->(a)") | ||
| .filter("e.relationship = 'friend' and e2.relationship = 'enemy'") | ||
| .show() | ||
| ) | ||
|
|
||
| # These are paths, which you can aggregate and count to find complex patterns. | ||
| # +------------+--------------+----------------+-------------+ | ||
| # | a| e| b| e2| | ||
| # +------------+--------------+----------------+-------------+ | ||
| # |{2, Bob, 25}|{2, 3, friend}|{3, Charlie, 35}|{3, 2, enemy}| | ||
| # +------------+--------------+----------------+-------------+ | ||
| ``` | ||
|
|
||
| ## Learn GraphFrames | ||
|
|
||
| To learn more about GraphFrames, check out these resources: | ||
|
|
||
| * [GraphFrames Network Motif Finding Tutorial](https://graphframes.github.io/graphframes/docs/_site/motif-tutorial.html) | ||
| * [Introducing GraphFrames](https://databricks.com/blog/2016/03/03/introducing-graphframes.html) | ||
| * [On-Time Flight Performance with GraphFrames for Apache Spark](https://databricks.com/blog/2016/03/16/on-time-flight-performance-with-graphframes-for-apache-spark.html) | ||
|
|
||
| ## GraphFrames on PyPI is Unofficial | ||
|
|
||
| The project is not in ownership or control of the [graphframes PyPI package](https://pypi.org/project/graphframes/) (installs 0.6.0) or [graphframes-latest PyPI package](https://pypi.org/project/graphframes-latest/) (installs 0.8.3). We recommend using the Spark Packages system to install the latest version of GraphFrames. The PyPI packages are not maintained by the GraphFrames project. | ||
|
|
||
| If you are in control of one of these packages, please reach out to us to discuss how we can work together to keep them up to date. Hopefully this situation will be addressed in the near future. | ||
|
|
||
| See [Installation and Quick-Start](#installation-and-quick-start) for the best way to install and use GraphFrames. | ||
|
|
||
| ## GraphFrames Internals | ||
|
|
||
| To learn how GraphFrames works internally to combine graph and relational queries, check out the paper [GraphFrames: An Integrated API for Mixing Graph and | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add a note about the google usergroup? |
||
| Relational Queries, Dave et al. 2016](https://people.eecs.berkeley.edu/~matei/papers/2016/grades_graphframes.pdf). | ||
|
|
||
| ## Building and running unit tests | ||
|
|
||
| To compile this project, run `build/sbt assembly` from the project home directory. | ||
| This will also run the Scala unit tests. | ||
| To compile this project, run `build/sbt assembly` from the project home directory. This will also run the Scala unit tests. | ||
|
|
||
| To run the Python unit tests, run the `run-tests.sh` script from the `python/` directory. | ||
| You will need to set `SPARK_HOME` to your local Spark installation directory. | ||
| To run the Python unit tests, run the `run-tests.sh` script from the `python/` directory. You will need to set `SPARK_HOME` to your local Spark installation directory. | ||
|
|
||
| ## Release new version | ||
|
|
||
| Please see guide `dev/release_guide.md`. | ||
|
|
||
| ## Spark version compatibility | ||
|
|
||
| This project is compatible with Spark 2.4+. However, significant speed improvements have been | ||
| made to DataFrames in more recent versions of Spark, so you may see speedups from using the latest | ||
| Spark version. | ||
| This project is compatible with Spark 2.4+. However, significant speed improvements have been made to DataFrames in more recent versions of Spark, so you may see speedups from using the latest Spark version. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Spark 3.4 or something.. |
||
|
|
||
| ## Contributing | ||
|
|
||
| GraphFrames is collaborative effort among UC Berkeley, MIT, and Databricks. | ||
| We welcome open source contributions as well! | ||
| GraphFrames is collaborative effort among UC Berkeley, MIT, Databricks and the open source community. We welcome open source contributions as well! | ||
|
|
||
| ## Releases: | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,28 +1,18 @@ | ||
| Welcome to the GraphFrames Spark Package documentation! | ||
|
|
||
| This readme will walk you through navigating and building the GraphFrames documentation, which is | ||
| included here with the source code. | ||
| This readme will walk you through navigating and building the GraphFrames documentation, which is included here with the source code. | ||
|
|
||
| Read on to learn more about viewing documentation in plain text (i.e., markdown) or building the | ||
| documentation yourself. Why build it yourself? So that you have the docs that correspond to | ||
| whichever version of GraphFrames you currently have checked out of revision control. | ||
| Read on to learn more about viewing documentation in plain text (i.e., markdown) or building the documentation yourself. Why build it yourself? So that you have the docs that correspond to whichever version of GraphFrames you currently have checked out of revision control. | ||
|
|
||
| ## Generating the Documentation HTML | ||
|
|
||
| We include the GraphFrames documentation as part of the source (as opposed to using a hosted wiki, such as | ||
| the github wiki, as the definitive documentation) to enable the documentation to evolve along with | ||
| the source code and be captured by revision control (currently git). This way the code automatically | ||
| includes the version of the documentation that is relevant regardless of which version or release | ||
| you have checked out or downloaded. | ||
| We include the GraphFrames documentation as part of the source (as opposed to using a hosted wiki, such as the github wiki, as the definitive documentation) to enable the documentation to evolve along with the source code and be captured by revision control (currently git). This way the code automatically | ||
| includes the version of the documentation that is relevant regardless of which version or release you have checked out or downloaded. | ||
|
|
||
| In this directory you will find textfiles formatted using Markdown, with an ".md" suffix. You can | ||
| read those text files directly if you want. Start with index.md. | ||
| In this directory you will find textfiles formatted using Markdown, with an ".md" suffix. You can read those text files directly if you want. Start with index.md. | ||
|
|
||
| The markdown code can be compiled to HTML using the [Jekyll tool](http://jekyllrb.com). | ||
| `Jekyll` and a few dependencies must be installed for this to work. We recommend | ||
| installing via the Ruby Gem dependency manager. Since the exact HTML output | ||
| varies between versions of Jekyll and its dependencies, we list specific versions here | ||
| in some cases: | ||
| `Jekyll` and a few dependencies must be installed for this to work. We recommend installing via the Ruby Gem dependency manager. Since the exact HTML output varies between versions of Jekyll and its dependencies, we list specific versions here in some cases: | ||
|
|
||
| $ sudo gem install jekyll | ||
| $ sudo gem install jekyll-redirect-from | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. remove use of root to install python packages. people to day are using some env. for python |
||
|
|
@@ -32,8 +22,7 @@ On macOS, with the default Ruby, please install Jekyll with Bundler as [instruct | |
| $ sudo gem install jekyll bundler | ||
| $ sudo gem install jekyll-redirect-from | ||
|
|
||
| Execute `jekyll build` from the `docs/` directory to compile the site. Compiling the site with Jekyll will create a directory | ||
| called `_site` containing index.html as well as the rest of the compiled files. | ||
| Execute `jekyll build` from the `docs/` directory to compile the site. Compiling the site with Jekyll will create a directory called `_site` containing index.html as well as the rest of the compiled files. | ||
|
|
||
| You can modify the default Jekyll build as follows: | ||
|
|
||
|
|
@@ -45,27 +34,23 @@ You can modify the default Jekyll build as follows: | |
| $ PRODUCTION=1 jekyll build | ||
|
|
||
| Note that `SPARK_HOME` must be set to your local Spark installation in order to generate the docs. | ||
|
|
||
| To manually point to a specific `Spark` installation, | ||
| $ SPARK_HOME=<your-path-to-spark-home> PRODUCTION=1 jekyll build | ||
|
|
||
| ## Sphinx | ||
|
|
||
| We use Sphinx to generate Python API docs, so you will need to install it by running | ||
| `sudo pip install sphinx`. | ||
|
|
||
| sudo pip install sphinx | ||
|
|
||
| ## API Docs (Scaladoc, Sphinx) | ||
|
|
||
| You can build just the scaladoc by running `build/sbt unidoc` from the GRAPHFRAMES_PROJECT_ROOT directory. | ||
|
|
||
| Similarly, you can build just the Python docs by running `make html` from the | ||
| GRAPHFRAMES_PROJECT_ROOT/python/docs directory. Documentation is only generated for classes that are listed as | ||
| public in `__init__.py`. | ||
| Similarly, you can build just the Python docs by running `make html` from the GRAPHFRAMES_PROJECT_ROOT/python/docs directory. Documentation is only generated for classes that are listed as public in `__init__.py`. | ||
|
|
||
| When you run `jekyll` in the `docs` directory, it will also copy over the scaladoc for the various | ||
| subprojects into the `docs` directory (and then also into the `_site` directory). We use a | ||
| jekyll plugin to run `build/sbt unidoc` before building the site so if you haven't run it (recently) it | ||
| may take some time as it generates all of the scaladoc. The jekyll plugin also generates the | ||
| When you run `jekyll` in the `docs` directory, it will also copy over the scaladoc for the various subprojects into the `docs` directory (and then also into the `_site` directory). We use a jekyll plugin to run `build/sbt unidoc` before building the site so if you haven't run it (recently) it may take some time as it generates all of the scaladoc. The jekyll plugin also generates the | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. have this and dev/release_guide.md and docs/_config.yml in a own PR -> update docs |
||
| Python docs [Sphinx](http://sphinx-doc.org/). | ||
|
|
||
| NOTE: To skip the step of building and copying over the Scala, Python API docs, run `SKIP_API=1 | ||
| jekyll build`. To skip building Scala API docs, run `SKIP_SCALADOC=1 jekyll build`; to skip building Python API docs, run `SKIP_PYTHONDOC=1 jekyll build`. | ||
| NOTE: To skip the step of building and copying over the Scala, Python API docs, run `SKIP_API=1 jekyll build`. To skip building Scala API docs, run `SKIP_SCALADOC=1 jekyll build`; to skip building Python API docs, run `SKIP_PYTHONDOC=1 jekyll build`. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -63,14 +63,21 @@ GraphFrames supplied as a package. | |
| * [Quick Start](quick-start.html): a quick introduction to the GraphFrames API; start here! | ||
| * [GraphFrames User Guide](user-guide.html): detailed overview of GraphFrames | ||
| in all supported languages (Scala, Java, Python) | ||
| * [Motif Finding Tutorial](motif-tutorial.html): learn to perform pattern recognition with GraphFrames using a technique called network motif finding over the knowledge graph for the `stackexchange.com` subdomain [data dump](https://archive.org/details/stackexchange) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. have one PR for Motif Finding Tutorial this seams to be big and we need to do it for us self before it is merged. |
||
|
|
||
| **API Docs:** | ||
|
|
||
| * [GraphFrames Scala API (Scaladoc)](api/scala/index.html#org.graphframes.package) | ||
| * [GraphFrames Python API (Sphinx)](api/python/index.html) | ||
|
|
||
| **Community Forums:** | ||
|
|
||
| * [GraphFrames Mailing List](https://groups.google.com/g/graphframes/): ask questions about GraphFrames here | ||
| * [#graphframes Discord Channel on GraphGeeks](https://discord.com/channels/1162999022819225631/1326257052368113674) | ||
|
|
||
| **External Resources:** | ||
|
|
||
| * [Apache Spark Homepage](http://spark.apache.org) | ||
| * [Apache Spark Wiki](https://cwiki.apache.org/confluence/display/SPARK) | ||
| * [Mailing Lists](http://spark.apache.org/mailing-lists.html): Ask questions about Spark here | ||
| * [Apache Spark Mailing Lists](http://spark.apache.org/mailing-lists.html) | ||
| * [GraphFrames on Stack Overflow](https://stackoverflow.com/questions/tagged/graphframes) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for a docker file that we don't use at here at github..
put this in one PR
Like update dockerFile..