Skip to content

Commit 5e703c0

Browse files
committed
cleanup
1 parent 627752b commit 5e703c0

1 file changed

Lines changed: 7 additions & 5 deletions

File tree

posts/python_sql_engine.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33

44
## Introduction
55

6-
When I first started writing SQLGlot in early 2021, my goal was just to translate SQL queries from SparkSQL to Presto and vice versa. However, over the last year and a half, I've ended up with a full-fledged SQL engine. SQLGlot can now parse and transpile between [18 SQL dialects](https://github.com/tobymao/sqlglot/blob/main/sqlglot/dialects/__init__.py) and can execute all 24 [TPC-H](https://www.tpc.org/tpch/) SQL queries. The parser and engine are all handwritten from scratch using Python.
6+
When I first started writing SQLGlot in early 2021, my goal was just to translate SQL queries from SparkSQL to Presto and vice versa. However, over the last year and a half, I've ended up with a full-fledged SQL engine. SQLGlot can now parse and transpile between [18 SQL dialects](https://github.com/tobymao/sqlglot/blob/main/sqlglot/dialects/__init__.py) and can execute all 24 [TPC-H](https://www.tpc.org/tpch/) SQL queries. The parser and engine are all written from scratch using Python.
77

88
This post will cover [why](#why) I went through the effort of creating a Python SQL engine and [how](#how) a simple query goes from a string to actually transforming data. The following steps are briefly summarized:
99

@@ -14,7 +14,7 @@ This post will cover [why](#why) I went through the effort of creating a Python
1414
* [Executing](#executing)
1515

1616
## Why?
17-
I started working on SQLGlot because of my work on the [experimentation and metrics platform](https://netflixtechblog.com/reimagining-experimentation-analysis-at-netflix-71356393af21) at Netflix, where I built tools that allowed data scientists to define and compute SQL-based metrics. Netflix relied on multiple engines to query data (Spark, Presto, and Druid), so my team built the metrics platform around [PyPika](https://github.com/kayak/pypika), a Python SQL query builder. This way, definitions could be reused across multiple engines. However, it became quickly apparent that writing python code to programatically generate SQL was challenging for data scientists, especially those with academic backgrounds, since they were mostly familiar with R and SQL. But at the time, the only Python SQL parser was [sqlparse]([https://github.com/andialbrecht/sqlparse), which is not actually a parser but a tokenizer, so having users write raw SQL into the platform wasn't really an option. Some time later, I randomly stumbled across [Crafting Interpreters](https://craftinginterpreters.com/) and realized that I could use it as a guide towards creating my own SQL parser/transpiler.
17+
I started working on SQLGlot because of my work on the [experimentation and metrics platform](https://netflixtechblog.com/reimagining-experimentation-analysis-at-netflix-71356393af21) at Netflix, where I built tools that allowed data scientists to define and compute SQL-based metrics. Netflix relied on multiple engines to query data (Spark, Presto, and Druid), so my team built the metrics platform around [PyPika](https://github.com/kayak/pypika), a Python SQL query builder. This way, definitions could be reused across multiple engines. However, it became quickly apparent that writing python code to programatically generate SQL was challenging for data scientists, especially those with academic backgrounds, since they were mostly familiar with R and SQL. At the time, the only Python SQL parser was [sqlparse]([https://github.com/andialbrecht/sqlparse), which is not actually a parser but a tokenizer, so having users write raw SQL into the platform wasn't really an option. Some time later, I randomly stumbled across [Crafting Interpreters](https://craftinginterpreters.com/) and realized that I could use it as a guide towards creating my own SQL parser/transpiler.
1818

1919
Why did I do this? Isn't a Python SQL engine going to be extremely slow?
2020

@@ -42,7 +42,7 @@ There are many steps involved with actually running a simple query like:
4242
```sql
4343
SELECT
4444
bar.a,
45-
b + 'y' AS b
45+
b + 1 AS b
4646
FROM bar
4747
JOIN baz
4848
ON bar.a = baz.a
@@ -94,7 +94,7 @@ Some example rules are:
9494
- Ensure each column is unambiguous and expand stars.
9595

9696
```sql
97-
SELECT * FROM x`;
97+
SELECT * FROM x;
9898

9999
SELECT "db"."x" AS "x";
100100
```
@@ -143,7 +143,7 @@ LEFT JOIN (
143143
GROUP BY y.a
144144
) AS "_u_0"
145145
ON x.a = "_u_0".a
146-
WHERE ("_u_0".a = 1 AND NOT "_u_0".a IS NULL)'
146+
WHERE ("_u_0".a = 1 AND NOT "_u_0".a IS NULL)
147147
```
148148

149149
### pushdown_predicates
@@ -199,6 +199,8 @@ In order to keep things simple, it evaluates expressions with `eval`. Because SQ
199199
## What's next
200200
SQLGlot's main focus will always be on parsing/transpiling, but I plan to continue development on the execution engine. I'd like to pass [TPC-DS](https://www.tpc.org/tpcds/). If someone doesn't beat me to it, I may even take a stab at writing a Pandas/Arrow execution engine.
201201

202+
I'm hoping that over time, SQLGlot will spark the Python SQL ecosystem just like Calcite has for Java.
203+
202204
## Special thanks
203205
SQLGlot would not be what it is without it's core contributors. In particular, the execution engine would not exist without [Barak Alon](https://github.com/barakalon) and [George Sittas](https://github.com/GeorgeSittas).
204206

0 commit comments

Comments
 (0)