You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: posts/python_sql_engine.md
+7-5Lines changed: 7 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,7 +3,7 @@
3
3
4
4
## Introduction
5
5
6
-
When I first started writing SQLGlot in early 2021, my goal was just to translate SQL queries from SparkSQL to Presto and vice versa. However, over the last year and a half, I've ended up with a full-fledged SQL engine. SQLGlot can now parse and transpile between [18 SQL dialects](https://github.com/tobymao/sqlglot/blob/main/sqlglot/dialects/__init__.py) and can execute all 24 [TPC-H](https://www.tpc.org/tpch/) SQL queries. The parser and engine are all handwritten from scratch using Python.
6
+
When I first started writing SQLGlot in early 2021, my goal was just to translate SQL queries from SparkSQL to Presto and vice versa. However, over the last year and a half, I've ended up with a full-fledged SQL engine. SQLGlot can now parse and transpile between [18 SQL dialects](https://github.com/tobymao/sqlglot/blob/main/sqlglot/dialects/__init__.py) and can execute all 24 [TPC-H](https://www.tpc.org/tpch/) SQL queries. The parser and engine are all written from scratch using Python.
7
7
8
8
This post will cover [why](#why) I went through the effort of creating a Python SQL engine and [how](#how) a simple query goes from a string to actually transforming data. The following steps are briefly summarized:
9
9
@@ -14,7 +14,7 @@ This post will cover [why](#why) I went through the effort of creating a Python
14
14
*[Executing](#executing)
15
15
16
16
## Why?
17
-
I started working on SQLGlot because of my work on the [experimentation and metrics platform](https://netflixtechblog.com/reimagining-experimentation-analysis-at-netflix-71356393af21) at Netflix, where I built tools that allowed data scientists to define and compute SQL-based metrics. Netflix relied on multiple engines to query data (Spark, Presto, and Druid), so my team built the metrics platform around [PyPika](https://github.com/kayak/pypika), a Python SQL query builder. This way, definitions could be reused across multiple engines. However, it became quickly apparent that writing python code to programatically generate SQL was challenging for data scientists, especially those with academic backgrounds, since they were mostly familiar with R and SQL. But at the time, the only Python SQL parser was [sqlparse]([https://github.com/andialbrecht/sqlparse), which is not actually a parser but a tokenizer, so having users write raw SQL into the platform wasn't really an option. Some time later, I randomly stumbled across [Crafting Interpreters](https://craftinginterpreters.com/) and realized that I could use it as a guide towards creating my own SQL parser/transpiler.
17
+
I started working on SQLGlot because of my work on the [experimentation and metrics platform](https://netflixtechblog.com/reimagining-experimentation-analysis-at-netflix-71356393af21) at Netflix, where I built tools that allowed data scientists to define and compute SQL-based metrics. Netflix relied on multiple engines to query data (Spark, Presto, and Druid), so my team built the metrics platform around [PyPika](https://github.com/kayak/pypika), a Python SQL query builder. This way, definitions could be reused across multiple engines. However, it became quickly apparent that writing python code to programatically generate SQL was challenging for data scientists, especially those with academic backgrounds, since they were mostly familiar with R and SQL. At the time, the only Python SQL parser was [sqlparse]([https://github.com/andialbrecht/sqlparse), which is not actually a parser but a tokenizer, so having users write raw SQL into the platform wasn't really an option. Some time later, I randomly stumbled across [Crafting Interpreters](https://craftinginterpreters.com/) and realized that I could use it as a guide towards creating my own SQL parser/transpiler.
18
18
19
19
Why did I do this? Isn't a Python SQL engine going to be extremely slow?
20
20
@@ -42,7 +42,7 @@ There are many steps involved with actually running a simple query like:
42
42
```sql
43
43
SELECT
44
44
bar.a,
45
-
b +'y'AS b
45
+
b +1AS b
46
46
FROM bar
47
47
JOIN baz
48
48
ONbar.a=baz.a
@@ -94,7 +94,7 @@ Some example rules are:
94
94
- Ensure each column is unambiguous and expand stars.
95
95
96
96
```sql
97
-
SELECT*FROM x`;
97
+
SELECT*FROM x;
98
98
99
99
SELECT"db"."x"AS"x";
100
100
```
@@ -143,7 +143,7 @@ LEFT JOIN (
143
143
GROUP BYy.a
144
144
) AS"_u_0"
145
145
ONx.a="_u_0".a
146
-
WHERE ("_u_0".a = 1 AND NOT "_u_0".a IS NULL)'
146
+
WHERE ("_u_0".a =1AND NOT "_u_0".a IS NULL)
147
147
```
148
148
149
149
### pushdown_predicates
@@ -199,6 +199,8 @@ In order to keep things simple, it evaluates expressions with `eval`. Because SQ
199
199
## What's next
200
200
SQLGlot's main focus will always be on parsing/transpiling, but I plan to continue development on the execution engine. I'd like to pass [TPC-DS](https://www.tpc.org/tpcds/). If someone doesn't beat me to it, I may even take a stab at writing a Pandas/Arrow execution engine.
201
201
202
+
I'm hoping that over time, SQLGlot will spark the Python SQL ecosystem just like Calcite has for Java.
203
+
202
204
## Special thanks
203
205
SQLGlot would not be what it is without it's core contributors. In particular, the execution engine would not exist without [Barak Alon](https://github.com/barakalon) and [George Sittas](https://github.com/GeorgeSittas).
0 commit comments