Add medallion architecture demo#6389
Conversation
Ecommerce Medallion Architecture demo pipeline. Renamed the other demo files so that the medallion architecture demo shows up second Signed-off-by: Anand Raman <anand.raman@feldera.com>
mythical-fred
left a comment
There was a problem hiding this comment.
Nice addition — medallion architecture is a great story to tell, and the bronze→silver→gold layering with LOCAL VIEWS for silver and MATERIALIZED for gold is exactly the right shape. The demo prose at the top is well written and the INSERT INTO bronze_suppliers ... 10000 ... trick to flip gold_realtime_inventory_alerts is a clean way to make incrementality visible in milliseconds.
A few semantic things to fix before merging, plus some cosmetic items. None of them break the demo, but a couple of them make the metrics technically wrong, which is bad in a showcase.
Main issues
-
gold_inventory_risk.avg_daily_unitsisSUM(quantity) / 30.0over all-time orders, not over the last 30 days. There's no date filter on thesalessubquery —SUM(quantity)covers the entire history ofsilver_confirmed_order_items. Dividing by 30 is then meaningless: if the dataset has 90 days of history, the "average daily units" is 3× too high; if it has 10 days, 3× too low. Either filter to a recent window (WHERE order_created_at > NOW() - INTERVAL '30' DAY) or divide by the actual number of days observed (COUNT(DISTINCT DATE_TRUNC(order_created_at, DAY))). As written, the CRITICAL/WARNING/OK classification is calibrated against a fictitious denominator. Inline comment with details. -
gold_supplier_performancegroups on(supplier_name, supplier_country, lead_time_days)rather than onsupplier_id. This is the only join key all the way down — orders_by_supplier / products_by_supplier / orders_count_by_supplier / silver_inventory_by_supplier all hang off it. Two different suppliers that happen to share a (name, country, lead_time) tuple will silently collapse into one row, andsilver_inventory_by_supplierfurther only groups bysupplier_name, which compounds it. Usesupplier_idas the grain throughout and carry the descriptive columns along for display. -
The connector story disagrees with the connector config. The header says "fed by a Delta snapshot connector and then kept current by the CDC stream", and several Gold-view comments reference "MERGE commits transition order statuses (the core incremental demo)". But every
bronze_*table only has a singlemode: snapshotdelta_table_inputconnector — no follower CDC stream. After the initial snapshot the only way to see incrementality is the ad-hoc INSERT trick shown in the prose. Either drop the CDC language or add the CDC follower connector (mode: cdc/transaction_mode: cdc).
Smaller things
-
silver_orders_enrichedinner-joins to the aggregatedorder_itemssubquery, so orders with zero qualifying line items disappear before reaching gold — including the cancelled-but-empty case. If that's intentional (only emit orders that actually have line items), worth a one-line comment; otherwiseLEFT JOINandCOALESCEthe aggregates. -
silver_inventory_current.total_soldmixes two event types into one number:SUM(ABS(...)) WHERE event_type='sale_reserve'minusSUM(...) WHERE event_type='cancellation_restock'. The sign convention buried inquantity_change(sale_reserve presumably stored negative? cancellation_restock positive?) isn't documented anywhere in the SQL. Even one comment about the sign ofquantity_changeper event_type would save the next reader. -
Formatting is inconsistent — first column of several
CREATE TABLEs starts at column 0 (event_id,order_item_id,product_id,inventory_event_id) while later columns are indented 4 spaces; the'skip_unused_columns'/'connectors'keys also drift between column 0 and indented. Running the file through a formatter would help. The two ORDER BYs ingold_weekly_revenue_trendandgold_cancellation_impactwindow specs sit at column 0 inside an already-indentedOVER (...)clause — same fix. -
The DDL has no
WHEREfilter onbronze_clickstream_eventsanywhere downstream — none of silver or gold references it. If it's there as future material for someone to demo from, fine; if it's leftover, drop it (it has the largest row count and the heaviest connector, so it costs the most to ingest). -
Prose nit: "Append-and-update tables" in the bronze description, but the bronze tables are declared with
PRIMARY KEY(so they're really upsert tables). "Append-only" in the inline comment a few lines down then contradicts the prose. Pick one.
Will be happy to re-approve once 1, 2, and 3 are addressed.
| JOIN ( | ||
| SELECT product_id, | ||
| SUM(quantity) AS units_sold_recent, | ||
| SUM(quantity) / 30.0 AS avg_daily_units, |
There was a problem hiding this comment.
SUM(quantity) here is over all of silver_confirmed_order_items, with no WHERE order_created_at > NOW() - INTERVAL '30' DAY and no GROUP BY ... DATE_TRUNC(...). Dividing the lifetime total by 30 doesn't produce "average daily units" — it scales whatever history the demo dataset happens to contain by a fixed denominator. The downstream days_of_stock_remaining = total_stock / avg_daily_units and the CRITICAL/WARNING thresholds (< lead_time_days * 1.5 / * 3.0) inherit the same bias. Either bound the window or divide by the actual number of distinct days observed.
| SUM(total_restocked) AS total_restocked, | ||
| SUM(total_returned) AS total_returned | ||
| FROM silver_inventory_current | ||
| GROUP BY supplier_name; |
There was a problem hiding this comment.
Grouping on supplier_name alone (and on (supplier_name, supplier_country, lead_time_days) in gold_supplier_performance) silently merges any two suppliers that share those values. Use supplier_id as the grain throughout silver and gold; carry name/country/lead_time as descriptive columns.
| -- Instead of recomputing silver and gold layers on a schedule, Feldera will maintain everything incrementally, using a fraction of the compute required by batch engines. | ||
| -- | ||
| -- ### Bronze — Raw Ingestion (`CREATE TABLE`) | ||
| -- Append-and-update tables that mirror the source systems with no transformation. Each is fed by a Delta snapshot connector and then kept current by the CDC stream. |
There was a problem hiding this comment.
I assume the cdc stream is not always on right in the sandbox? How do we capture this visually?
There was a problem hiding this comment.
the cdc stream is a python script that reads the files from s3 and then pushes the changes to the pipeline. Makes it interactive for the users. I'll add a link to the docs which contains the script and the spark notebook for comparison shortly!
Added demo and demo assets - spark notebook, script to push cdc changes. Made docs updates to sidebar and added a 3 part demo Signed-off-by: Anand Raman <anand.raman@feldera.com>
Ecommerce Medallion Architecture demo pipeline. Renamed the other demo files so that the medallion architecture demo shows up second
Describe Manual Test Plan
Ran the
uv run crates/pipeline-manager/demos/run.py --api-url http://localhost:8080command successfullyChecklist
Breaking Changes?
Mark if you think the answer is yes for any of these components:
Describe Incompatible Changes