Conversation
| } | ||
|
|
||
| rangeSQL := fmt.Sprintf( | ||
| "SELECT min(%[1]s) as `min`, max(%[1]s) as `max`, %[2]s as `watermark` FROM %[3]s %[4]s", |
There was a problem hiding this comment.
This is not an efficient query even when running on partition column
There was a problem hiding this comment.
An optimization can be done where we check if this is the partition column in the table and directly check on min/max partition metadata.
Given this is an often executed query I think it can done in a follow-up. @begelundmuller thoughts ?
There was a problem hiding this comment.
If the optimization can be done in a fast/cheap/safe way, then yeah it sounds good to me
There was a problem hiding this comment.
Can be fast but to ensure that we do not query information_schema again and again, we need to cache the information that this is the partition column in the table so require some changes. Will take it up separately .
| @@ -180,33 +181,157 @@ func (q *TableHead) generalExport(ctx context.Context, rt *runtime.Runtime, inst | |||
| } | |||
|
|
|||
| func (q *TableHead) buildTableHeadSQL(ctx context.Context, olap drivers.OLAPStore) (string, error) { | |||
There was a problem hiding this comment.
It seems like there's a huge complexity increase in this function. Two questions:
- We don't run
TableHeadvery often, so is it necessary to optimize it so hard? In general, I would assume people who connect a BI tool to a data warehouse are fine with aSELECT * FROM tbl LIMIT 100query being run. - If it really is necessary, is it possible to combine it into one nested query and push it into the dialect somehow?
There was a problem hiding this comment.
- It is used in data preview. On a 100 TB table this can cost a user 600 dollars. This can be a silent "trap" for a user given BigQuery returns result very fast (as reported by users running such queries on big tables).
I agree that users should not use bytes processed based pricing when connecting to a BI tool but we should not leave such traps for users.
For example, I found this issue in superset where the reporter refused to use superset with BigQuery till this kind of queries are removed : Select * Limit is DANGEROUS in BigQuery apache/superset#17299 - For partition pruning the filter has to be a static filter and using dynamic filter is not allowed.
If you are worried about dialect specific complexity in runtime/queries then we can take one of the following approaches:
- Disable data preview for BigQuery in UI and return an error in the API.
- Use preview table API which is free : https://docs.cloud.google.com/bigquery/docs/samples/bigquery-browse-table#bigquery_browse_table-go
Both approaches make this more optimised given we don't have to scan even 1 partition (which can still be big).
There was a problem hiding this comment.
That makes sense. Yeah I'm just a little worried about the driver-specificity in TableHead, especially given we are not adding many new OLAP drivers.
I don't think we should disable previews, but it would just be nice if we could push this into the driver somehow. I'm good with any of these:
- Rewrite
SELECT * FROM tbl LIMIT ninto preview API calls insideOLAPStore.Queryitself (similar to the code we have here:)rill/runtime/drivers/bigquery/warehouse.go
Lines 38 to 39 in 93b278f
- Add a
Headfunction on theOLAPStoreinterface (other drives can implement using a normalSELECT *) - Add to the
drivers.Dialectsomehow (will become clean with Naman's refactors)
There was a problem hiding this comment.
I implemented 2nd option. It leads to some duplicate code but seemed cleanest/safest.
| type: "object", | ||
| title: "BigQuery", | ||
| "x-category": "warehouse", | ||
| "x-category": "olap", |
There was a problem hiding this comment.
This might be problematic; most people will probably still wan't to use BigQuery as source for now? I believe applications would work on a way to give users a choice between OLAP and source for Snowflake and BigQuery, but if that hasn't landed yet, probably we should stick with the old default?
There was a problem hiding this comment.
Actually @nishantmonu51 asked to move Snowflake to OLAP engine for now and handle it as warehouse for duckdb in subsequent OLAP works.
I followed the same approach here.
But on reflection, given BigQuery OLAP connector has been a mixed experience, I am okay to leave it as warehouse.
Thoughts @begelundmuller @nishantmonu51 ?
There was a problem hiding this comment.
Are you aware of this thread? I think Applications had to patch that change as it was breaking some flows. https://rilldata.slack.com/archives/C093UBT5NLV/p1775582170699349?thread_ts=1775491222.470509&cid=C093UBT5NLV
However, I see their patch didn't involve this specific flag, so maybe you need to change something else. I'll let you look at the thread and the patch and make any necessary changes. I do think we should not break the BigQuery as warehouse flows just yet.
There was a problem hiding this comment.
No. I wasn't ware of this. Reverted all UI changes per the patch fix.
closes https://linear.app/rilldata/issue/PLAT-450/metrics-views-on-bigquery
Added
TODOs to be done with follow ups:
civil.Datetotime.Timein the rill driver and handle it wherever requiredChecklist: