You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a Python library that binds to [Apache Arrow](https://arrow.apache.org/) in-memory query engine [DataFusion](https://github.com/apache/arrow-datafusion).
26
26
27
-
Like pyspark, it allows you to build a plan through SQL or a DataFrame API against in-memory data, parquet or CSV files, run it in a multi-threaded environment, and obtain the result back in Python.
27
+
Like pyspark, it allows you to build a plan through SQL or a DataFrame API against in-memory data, parquet or CSV
28
+
files, run it in a multi-threaded environment, and obtain the result back in Python.
28
29
29
30
It also allows you to use UDFs and UDAFs for complex operations.
30
31
31
-
The major advantage of this library over other execution engines is that this library achieves zero-copy between Python and its execution engine: there is no cost in using UDFs, UDAFs, and collecting the results to Python apart from having to lock the GIL when running those operations.
32
+
The major advantage of this library over other execution engines is that this library achieves zero-copy between
33
+
Python and its execution engine: there is no cost in using UDFs, UDAFs, and collecting the results to Python apart
34
+
from having to lock the GIL when running those operations.
32
35
33
-
Its query engine, DataFusion, is written in [Rust](https://www.rust-lang.org/), which makes strong assumptions about thread safety and lack of memory leaks.
36
+
Its query engine, DataFusion, is written in [Rust](https://www.rust-lang.org/), which makes strong assumptions
37
+
about thread safety and lack of memory leaks.
34
38
35
39
Technically, zero-copy is achieved via the [c data interface](https://arrow.apache.org/docs/format/CDataInterface.html).
36
40
37
-
## How to use it
41
+
## Example Usage
38
42
39
-
Simple usage:
43
+
The following example demonstrates running a SQL query against a Parquet file using DataFusion, storing the results
44
+
in a Pandas DataFrame, and then plotting a chart.
40
45
41
-
```python
42
-
import datafusion
43
-
from datafusion import col
44
-
import pyarrow
45
-
46
-
# create a context
47
-
ctx = datafusion.SessionContext()
48
-
49
-
# create a RecordBatch and a new DataFrame from it
fig = df.plot(kind="bar", title="Trip Count by Number of Passengers").get_figure()
81
+
fig.savefig('chart.png')
82
+
```
144
83
145
-
result = df.collect()[0]
84
+
This produces the following chart:
146
85
147
-
assert result.column(0) == pyarrow.array([6.0])
148
-
```
86
+

149
87
150
88
## How to install (from pip)
151
89
152
90
### Pip
91
+
153
92
```bash
154
93
pip install datafusion
155
94
# or
156
95
python -m pip install datafusion
157
96
```
158
97
159
98
### Conda
99
+
160
100
```bash
161
101
conda install -c conda-forge datafusion
162
102
```
@@ -169,7 +109,6 @@ You can verify the installation by running:
169
109
'0.6.0'
170
110
```
171
111
172
-
173
112
## How to develop
174
113
175
114
This assumes that you have rust and cargo installed. We use the workflow recommended by [pyo3](https://github.com/PyO3/pyo3) and [maturin](https://github.com/PyO3/maturin).
0 commit comments