Skip to content

Add medallion architecture demo#6389

Open
anandbraman wants to merge 3 commits into
mainfrom
medallion-architecture-demo
Open

Add medallion architecture demo#6389
anandbraman wants to merge 3 commits into
mainfrom
medallion-architecture-demo

Conversation

@anandbraman
Copy link
Copy Markdown
Contributor

Ecommerce Medallion Architecture demo pipeline. Renamed the other demo files so that the medallion architecture demo shows up second

Describe Manual Test Plan

Ran the uv run crates/pipeline-manager/demos/run.py --api-url http://localhost:8080 command successfully

Checklist

  • Unit tests added/updated
  • Integration tests added/updated
  • Documentation updated
  • Changelog updated

Breaking Changes?

Mark if you think the answer is yes for any of these components:

Describe Incompatible Changes

Ecommerce Medallion Architecture demo pipeline. Renamed the other demo files so that the medallion architecture demo shows up second

Signed-off-by: Anand Raman <anand.raman@feldera.com>
Copy link
Copy Markdown

@mythical-fred mythical-fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition — medallion architecture is a great story to tell, and the bronze→silver→gold layering with LOCAL VIEWS for silver and MATERIALIZED for gold is exactly the right shape. The demo prose at the top is well written and the INSERT INTO bronze_suppliers ... 10000 ... trick to flip gold_realtime_inventory_alerts is a clean way to make incrementality visible in milliseconds.

A few semantic things to fix before merging, plus some cosmetic items. None of them break the demo, but a couple of them make the metrics technically wrong, which is bad in a showcase.

Main issues

  1. gold_inventory_risk.avg_daily_units is SUM(quantity) / 30.0 over all-time orders, not over the last 30 days. There's no date filter on the sales subquery — SUM(quantity) covers the entire history of silver_confirmed_order_items. Dividing by 30 is then meaningless: if the dataset has 90 days of history, the "average daily units" is 3× too high; if it has 10 days, 3× too low. Either filter to a recent window (WHERE order_created_at > NOW() - INTERVAL '30' DAY) or divide by the actual number of days observed (COUNT(DISTINCT DATE_TRUNC(order_created_at, DAY))). As written, the CRITICAL/WARNING/OK classification is calibrated against a fictitious denominator. Inline comment with details.

  2. gold_supplier_performance groups on (supplier_name, supplier_country, lead_time_days) rather than on supplier_id. This is the only join key all the way down — orders_by_supplier / products_by_supplier / orders_count_by_supplier / silver_inventory_by_supplier all hang off it. Two different suppliers that happen to share a (name, country, lead_time) tuple will silently collapse into one row, and silver_inventory_by_supplier further only groups by supplier_name, which compounds it. Use supplier_id as the grain throughout and carry the descriptive columns along for display.

  3. The connector story disagrees with the connector config. The header says "fed by a Delta snapshot connector and then kept current by the CDC stream", and several Gold-view comments reference "MERGE commits transition order statuses (the core incremental demo)". But every bronze_* table only has a single mode: snapshot delta_table_input connector — no follower CDC stream. After the initial snapshot the only way to see incrementality is the ad-hoc INSERT trick shown in the prose. Either drop the CDC language or add the CDC follower connector (mode: cdc / transaction_mode: cdc).

Smaller things

  1. silver_orders_enriched inner-joins to the aggregated order_items subquery, so orders with zero qualifying line items disappear before reaching gold — including the cancelled-but-empty case. If that's intentional (only emit orders that actually have line items), worth a one-line comment; otherwise LEFT JOIN and COALESCE the aggregates.

  2. silver_inventory_current.total_sold mixes two event types into one number: SUM(ABS(...)) WHERE event_type='sale_reserve' minus SUM(...) WHERE event_type='cancellation_restock'. The sign convention buried in quantity_change (sale_reserve presumably stored negative? cancellation_restock positive?) isn't documented anywhere in the SQL. Even one comment about the sign of quantity_change per event_type would save the next reader.

  3. Formatting is inconsistent — first column of several CREATE TABLEs starts at column 0 (event_id, order_item_id, product_id, inventory_event_id) while later columns are indented 4 spaces; the 'skip_unused_columns'/'connectors' keys also drift between column 0 and indented. Running the file through a formatter would help. The two ORDER BYs in gold_weekly_revenue_trend and gold_cancellation_impact window specs sit at column 0 inside an already-indented OVER (...) clause — same fix.

  4. The DDL has no WHERE filter on bronze_clickstream_events anywhere downstream — none of silver or gold references it. If it's there as future material for someone to demo from, fine; if it's leftover, drop it (it has the largest row count and the heaviest connector, so it costs the most to ingest).

  5. Prose nit: "Append-and-update tables" in the bronze description, but the bronze tables are declared with PRIMARY KEY (so they're really upsert tables). "Append-only" in the inline comment a few lines down then contradicts the prose. Pick one.

Will be happy to re-approve once 1, 2, and 3 are addressed.

JOIN (
SELECT product_id,
SUM(quantity) AS units_sold_recent,
SUM(quantity) / 30.0 AS avg_daily_units,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SUM(quantity) here is over all of silver_confirmed_order_items, with no WHERE order_created_at > NOW() - INTERVAL '30' DAY and no GROUP BY ... DATE_TRUNC(...). Dividing the lifetime total by 30 doesn't produce "average daily units" — it scales whatever history the demo dataset happens to contain by a fixed denominator. The downstream days_of_stock_remaining = total_stock / avg_daily_units and the CRITICAL/WARNING thresholds (< lead_time_days * 1.5 / * 3.0) inherit the same bias. Either bound the window or divide by the actual number of distinct days observed.

SUM(total_restocked) AS total_restocked,
SUM(total_returned) AS total_returned
FROM silver_inventory_current
GROUP BY supplier_name;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grouping on supplier_name alone (and on (supplier_name, supplier_country, lead_time_days) in gold_supplier_performance) silently merges any two suppliers that share those values. Use supplier_id as the grain throughout silver and gold; carry name/country/lead_time as descriptive columns.

Comment thread crates/pipeline-manager/demos/sql/01-medallion-architecture.sql Outdated
-- Instead of recomputing silver and gold layers on a schedule, Feldera will maintain everything incrementally, using a fraction of the compute required by batch engines.
--
-- ### Bronze — Raw Ingestion (`CREATE TABLE`)
-- Append-and-update tables that mirror the source systems with no transformation. Each is fed by a Delta snapshot connector and then kept current by the CDC stream.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the cdc stream is not always on right in the sandbox? How do we capture this visually?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cdc stream is a python script that reads the files from s3 and then pushes the changes to the pipeline. Makes it interactive for the users. I'll add a link to the docs which contains the script and the spark notebook for comparison shortly!

Added demo and demo assets - spark notebook, script to push cdc changes. Made docs updates to sidebar and added a 3 part demo

Signed-off-by: Anand Raman <anand.raman@feldera.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants