diff --git a/blog/ETL-pipeline-tutorial/Img/adf-dataset.png b/blog/ETL-pipeline-tutorial/Img/adf-dataset.png
new file mode 100644
index 00000000..734f4792
Binary files /dev/null and b/blog/ETL-pipeline-tutorial/Img/adf-dataset.png differ
diff --git a/blog/ETL-pipeline-tutorial/Img/adf-elt-flow.png b/blog/ETL-pipeline-tutorial/Img/adf-elt-flow.png
new file mode 100644
index 00000000..08b8387d
Binary files /dev/null and b/blog/ETL-pipeline-tutorial/Img/adf-elt-flow.png differ
diff --git a/blog/ETL-pipeline-tutorial/Img/adf-linked-service.png b/blog/ETL-pipeline-tutorial/Img/adf-linked-service.png
new file mode 100644
index 00000000..11dba622
Binary files /dev/null and b/blog/ETL-pipeline-tutorial/Img/adf-linked-service.png differ
diff --git a/blog/ETL-pipeline-tutorial/Img/adf-monitor.png b/blog/ETL-pipeline-tutorial/Img/adf-monitor.png
new file mode 100644
index 00000000..4b5672fb
Binary files /dev/null and b/blog/ETL-pipeline-tutorial/Img/adf-monitor.png differ
diff --git a/blog/ETL-pipeline-tutorial/Img/adf-pipeline-debug.png b/blog/ETL-pipeline-tutorial/Img/adf-pipeline-debug.png
new file mode 100644
index 00000000..b0fff1e2
Binary files /dev/null and b/blog/ETL-pipeline-tutorial/Img/adf-pipeline-debug.png differ
diff --git a/blog/ETL-pipeline-tutorial/Img/adf-pipeline-overview.png b/blog/ETL-pipeline-tutorial/Img/adf-pipeline-overview.png
new file mode 100644
index 00000000..ddf55d30
Binary files /dev/null and b/blog/ETL-pipeline-tutorial/Img/adf-pipeline-overview.png differ
diff --git a/blog/ETL-pipeline-tutorial/Img/adf_flow.png b/blog/ETL-pipeline-tutorial/Img/adf_flow.png
new file mode 100644
index 00000000..1c27b687
Binary files /dev/null and b/blog/ETL-pipeline-tutorial/Img/adf_flow.png differ
diff --git a/blog/ETL-pipeline-tutorial/Img/pipeline.png b/blog/ETL-pipeline-tutorial/Img/pipeline.png
new file mode 100644
index 00000000..7ec60c3c
Binary files /dev/null and b/blog/ETL-pipeline-tutorial/Img/pipeline.png differ
diff --git a/blog/ETL-pipeline-tutorial/Img/step-1.png b/blog/ETL-pipeline-tutorial/Img/step-1.png
new file mode 100644
index 00000000..1c629ce7
Binary files /dev/null and b/blog/ETL-pipeline-tutorial/Img/step-1.png differ
diff --git a/blog/ETL-pipeline-tutorial/index.md b/blog/ETL-pipeline-tutorial/index.md
new file mode 100644
index 00000000..065a2f2d
--- /dev/null
+++ b/blog/ETL-pipeline-tutorial/index.md
@@ -0,0 +1,365 @@
+---
+title: "Azure Data Factory Pipeline: Build Your First ETL in 10 Minutes"
+authors: [Aditya-Singh-Rathore]
+sidebar_label: "Azure Data Factory — Build Your First ETL"
+tags: [azure-data-factory, adf, etl, data-pipeline, data-engineering, azure, blob-storage, adls, copy-activity, linked-service, dataset, trigger]
+date: 2026-05-06
+
+description: Azure Data Factory is Microsoft's cloud-native ETL service — a visual, no-code platform for moving and transforming data at scale. This step-by-step guide walks you through building your first real pipeline in under 10 minutes, explaining every concept along the way.
+
+draft: false
+canonical_url: https://www.recodehive.com/blog/azure-data-factory-build-first-etl
+
+meta:
+ - name: "robots"
+ content: "index, follow"
+ - property: "og:title"
+ content: "Azure Data Factory Pipeline: Build Your First ETL in 10 Minutes"
+ - property: "og:description"
+ content: "Azure Data Factory is Microsoft's cloud-native ETL service. Here's how to build your first real pipeline in under 10 minutes — step by step."
+ - property: "og:type"
+ content: "article"
+ - property: "og:url"
+ content: "https://www.recodehive.com/blog/azure-data-factory-build-first-etl"
+ - property: "og:image"
+ content: "/img/blogs/adf-cover_image.png"
+ - name: "twitter:card"
+ content: "summary_large_image"
+ - name: "twitter:title"
+ content: "Azure Data Factory Pipeline: Build Your First ETL in 10 Minutes"
+ - name: "twitter:description"
+ content: "Azure Data Factory is Microsoft's cloud-native ETL service. Here's how to build your first real pipeline in under 10 minutes."
+ - name: "twitter:image"
+ content: "/img/blogs/adf-cover_image.png"
+
+---
+
+
+
+# Azure Data Factory Pipeline: Build Your First ETL in 10 Minutes
+
+The first time someone asked me to "build an ETL pipeline," I nodded confidently and then quietly searched "what is ETL" on my second monitor.
+
+Extract. Transform. Load.
+
+Three words that describe something every data team does dozens of times a day — pulling data from somewhere, doing something to it, and putting it somewhere more useful. Simple idea. Historically, painful to implement.
+
+You'd write Python scripts that broke when the source schema changed. You'd schedule them with cron jobs that nobody monitored. You'd debug failures at 2am by reading raw logs.
+
+**Azure Data Factory** (ADF) exists to replace all of that with a visual, managed, scalable pipeline service, one where you can build a working ETL in minutes, not days, and monitor it from a dashboard instead of a terminal.
+
+This guide walks you through everything, the concepts, the components, and a complete step-by-step pipeline you can build right now.
+
+
+
+## What Is Azure Data Factory?
+
+Azure Data Factory is Microsoft's cloud-native ETL and data integration service. It lets you build **data pipelines**, workflows that move data from one place to another, transform it along the way, and load it into a destination where it's actually useful.
+
+The key word is *visual*. ADF gives you a drag-and-drop canvas where you connect activities, configure sources and destinations, and build complex workflows without writing infrastructure code.
+
+Under the hood, it handles:
+- Connecting to 90+ data sources (databases, APIs, files, SaaS apps)
+- Moving data at scale using managed compute
+- Scheduling and triggering pipeline runs
+- Monitoring, alerting, and retry logic
+
+Think of it as the **orchestration layer** of your Azure data stack, the thing that decides what data moves where, when, and how.
+
+
+
+
+## The 4 Concepts You Need to Know First
+
+Before you touch the UI, these four concepts need to click. Everything in ADF is built on them.
+
+### 1. Linked Service: The Connection
+
+A **Linked Service** is a connection string. It tells ADF how to connect to an external resource — a storage account, a database, an API.
+
+Think of it as the key to a door. Before ADF can read from your Blob Storage or write to your SQL database, it needs a Linked Service that holds the credentials and connection details for that resource.
+
+You create a Linked Service once, then reuse it across as many datasets and pipelines as you need.
+
+
+
+**Examples:**
+- `AzureStorageLinkedService` → connects to your ADLS Gen2 account
+- `AzureSqlLinkedService` → connects to your Azure SQL Database
+- `RestApiLinkedService` → connects to an external HTTP API
+
+### 2. Dataset: The Pointer
+
+A **Dataset** points to the specific data within a Linked Service.
+
+If the Linked Service is the key to the building, the Dataset is the directions to a specific room inside it. It tells ADF: *"The data I care about is in this container, in this folder, in this file format."*
+
+**Examples:**
+- A Dataset pointing to `bronze/sales/2024/jan/*.csv` in your ADLS Gen2 account
+- A Dataset pointing to the `[dbo].[orders]` table in your SQL database
+- A Dataset describing a Parquet file with a known schema
+
+### 3. Activity: The Work
+
+An **Activity** is a single step of work inside a pipeline. ADF has three categories:
+
+- **Data Movement** — Copy data from source to destination. The **Copy Activity** is the most common one you'll use.
+- **Data Transformation** — Transform data using Mapping Data Flows, Databricks notebooks, or stored procedures.
+- **Control Flow** — Logic and orchestration: If/Else conditions, ForEach loops, Wait activities, Execute Pipeline (call another pipeline).
+
+### 4. Pipeline — The Workflow
+
+A **Pipeline** is a logical grouping of activities that together perform a unit of work.
+
+Your pipeline might have three activities: a Copy Activity to land raw data, a Data Flow activity to clean it, and a Stored Procedure activity to update a watermark table. Together they form one repeatable workflow.
+
+
+
+## The ETL Flow in ADF: Visualised
+
+Here's how all four concepts connect in a real pipeline:
+
+
+
+
+
+
+
+## Build Your First Pipeline: Step by Step
+
+Let's build a real pipeline: copy a CSV file from Azure Blob Storage into ADLS Gen2, landing it in a `bronze/` folder.
+
+**What you need before starting:**
+- An Azure account (free trial works fine)
+- A Storage Account with hierarchical namespace enabled (ADLS Gen2)
+- A CSV file uploaded to a container called `source/`
+
+
+
+### Step 1: Create an Azure Data Factory
+
+1. Go to the [Azure Portal](https://portal.azure.com)
+2. Search for **Data Factory** → click **Create**
+3. Fill in the details:
+ - Resource Group: your existing one or create new
+ - Name: `sales-data-factory` (must be globally unique)
+ - Region: same as your storage account
+4. Click **Review + Create** → **Create**
+5. Once deployed, click **Launch Studio**
+
+You're now in **ADF Studio**, the visual authoring environment.
+
+
+
+
+
+### Step 2: Create a Linked Service for Your Storage Account
+
+1. In ADF Studio, click **Manage** (toolbox icon, left sidebar)
+2. Click **Linked Services** → **New**
+3. Search for **Azure Data Lake Storage Gen2** → Select → Continue
+4. Fill in:
+ - Name: `ADLSGen2LinkedService`
+ - Authentication: Account Key (simplest for now)
+ - Storage Account: select yours from the dropdown
+5. Click **Test Connection** — you should see ✅ Connection successful
+6. Click **Create**!
+
+
+
+
+
+
+### Step 3: Create the Source Dataset
+
+This dataset points to the CSV file in your `source/` container.
+
+1. Click **Author** (pencil icon, left sidebar)
+2. Click **+** → **Dataset**
+3. Search for **Azure Data Lake Storage Gen2** → Continue
+4. Select **Delimited Text** (CSV format) → Continue
+5. Fill in:
+ - Name: `SourceCSVDataset`
+ - Linked Service: `ADLSGen2LinkedService`
+ - File path: `source/` → browse and select your CSV file
+ - First row as header: ✅ checked
+6. Click **OK**
+
+
+
+
+
+### Step 4: Create the Sink Dataset
+
+This dataset points to where the file should land, your `bronze/` folder.
+
+1. Click **+** → **Dataset** again
+2. Same steps — **Azure Data Lake Storage Gen2** → **Delimited Text**
+3. Fill in:
+ - Name: `BronzeCSVDataset`
+ - Linked Service: `ADLSGen2LinkedService`
+ - File path: `bronze/sales/` (type this manually, it doesn't need to exist yet, ADF will create it)
+4. Click **OK**
+
+
+
+### Step 5: Build the Pipeline
+
+1. Click **+** → **Pipeline** → name it `CopySalesToBronze`
+2. From the **Activities** panel on the left, expand **Move & Transform**
+3. Drag **Copy data** onto the canvas
+4. Click the Copy Activity to open its settings:
+
+**Source tab:**
+- Source dataset: `SourceCSVDataset`
+
+**Sink tab:**
+- Sink dataset: `BronzeCSVDataset`
+- Copy behavior: `PreserveHierarchy`
+
+**Mapping tab:**
+- Click **Import schemas** - ADF reads your CSV headers and maps columns automatically
+
+5. Click **Validate** (toolbar) - you should see no errors
+6. Click **Debug** - this runs the pipeline immediately without publishing
+
+
+
+
+### Step 6: Publish and Add a Trigger
+
+Once Debug runs successfully:
+
+1. Click **Publish All** (top toolbar) - this saves everything to ADF
+2. Click **Add trigger** → **New/Edit**
+3. Click **New** → configure:
+ - Type: **Schedule**
+ - Start: today's date
+ - Recurrence: **Every 1 Day** at `02:00 AM`
+4. Click **OK** → **OK**
+5. Click **Publish All** again
+
+Your pipeline now runs automatically every night at 2am, copying new sales data into your bronze layer.
+
+
+
+### Step 7: Monitor Your Pipeline
+
+1. Click **Monitor** (chart icon, left sidebar)
+2. You'll see all pipeline runs - status, duration, rows copied
+3. Click any run to see activity-level details
+4. If something fails, click the error icon to see exactly which activity failed and why
+
+
+
+
+
+
+## What Just Happened: The Full Picture
+
+Let's step back and look at what you built:
+
+
+
+This is the **Extract and Load** part of ETL. The file is extracted from the source container and loaded into the bronze layer, untouched, exactly as it arrived.
+
+
+
+## What Comes Next: Transform
+
+The pipeline you built moves data. To transform it, you add one of two things after the Copy Activity:
+
+**Option 1 — Mapping Data Flow** (no-code)
+A visual transformation canvas inside ADF. Drag and drop Filter, Join, Aggregate, Derived Column activities. Runs on Spark under the hood. Great for teams that don't want to write code.
+
+**Option 2 — Databricks Notebook Activity**
+Call an existing Databricks notebook from your ADF pipeline. The notebook runs your Python/Spark transformation logic and writes cleaned data to the silver layer. Best for complex transformations that need code.
+
+The full Medallion Architecture flow in ADF looks like this:
+
+```
+Source API / Database
+ ↓
+Copy Activity → bronze/ (raw data, as-is)
+ ↓
+Mapping Data Flow / Databricks Notebook → silver/ (cleaned, validated)
+ ↓
+Mapping Data Flow / Databricks Notebook → gold/ (aggregated, business-ready)
+ ↓
+Power BI DirectLake → Dashboard
+```
+
+
+
+## Triggers: When Does Your Pipeline Run?
+
+ADF gives you three trigger types:
+
+| Trigger Type | When it fires | Use case |
+|---|---|---|
+| **Schedule** | At a fixed time/frequency | Nightly batch loads |
+| **Tumbling Window** | Fixed intervals with state | Hourly incremental loads |
+| **Storage Event** | When a file arrives in storage | File-arrival driven pipelines |
+| **Manual** | On demand | One-time loads, testing |
+
+For production pipelines, **Storage Event triggers** are the most powerful, your pipeline fires automatically the moment a new file lands in your container, with no polling or scheduling lag.
+
+
+
+## Common Mistakes Beginners Make
+
+**1. Using the same Linked Service for every environment**
+Create separate Linked Services for dev, staging, and production. Use ADF's **parameterisation** to swap them out without changing pipeline logic.
+
+**2. Not testing with Debug before publishing**
+Always Debug first. Publishing without testing means failures hit production. Debug runs don't count against your trigger history.
+
+**3. Hardcoding file paths in datasets**
+Parameterise your datasets so the same pipeline can process different files dynamically. One pipeline, many files, not one pipeline per file.
+
+**4. No monitoring alerts**
+Set up Azure Monitor alerts for pipeline failures. You shouldn't find out a pipeline failed when someone asks why last night's data is missing.
+
+
+
+## Key Takeaways
+
+**1. ADF is built on four concepts.** Linked Services (connections), Datasets (pointers), Activities (work), Pipelines (workflows). Everything else is a variation of these four.
+
+**2. The Copy Activity is your workhorse.** It supports 90+ source/sink combinations and handles schema mapping, file format conversion, and retry logic out of the box.
+
+**3. ADF is the orchestration layer, not the transformation layer.** For heavy transformations, ADF calls Databricks or Data Flows, it doesn't do the transformation itself.
+
+**4. Triggers make pipelines production-ready.** A pipeline without a trigger is just a script you run manually. Add a trigger and it becomes infrastructure.
+
+**5. ADF fits naturally into Medallion Architecture.** Copy Activity lands data in bronze. Data Flows or Databricks jobs process silver and gold. ADF orchestrates the whole sequence.
+
+
+## References & Further Reading
+
+- [Microsoft Docs: Introduction to Azure Data Factory](https://learn.microsoft.com/en-us/azure/data-factory/introduction)
+- [Microsoft Docs: Copy Activity in ADF](https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview)
+- [Microsoft Docs - ADF Tutorial: Copy data using Azure portal](https://learn.microsoft.com/en-us/azure/data-factory/tutorial-copy-data-portal)
+- [Microsoft Docs: Mapping Data Flows](https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-overview)
+- [Microsoft Docs: Triggers in ADF](https://learn.microsoft.com/en-us/azure/data-factory/concepts-pipeline-execution-triggers)
+- [RecodeHive - Azure Storage & ADLS Gen2: Where Does Your Data Actually Live?](https://www.recodehive.com/blog/azure-storage-options)
+- [RecodeHive - Microsoft Fabric: One Platform, One Lake](https://www.recodehive.com/blog/microsoft-fabric-explained)
+
+
+## About the Author
+
+I'm **Aditya Singh Rathore**, a Data Engineer passionate about building modern, scalable data platforms on Azure. I write about data engineering, cloud architecture, and real-world pipelines on [RecodeHive](https://www.recodehive.com/) breaking down complex concepts into things you can actually use.
+
+🔗 [LinkedIn](https://www.linkedin.com/in/aditya-singh-rathore0017/) | [GitHub](https://github.com/Adez017)
+
+📩 Stuck on a specific ADF activity or pipeline pattern? Drop your question in the comments.
+
+
diff --git a/blog/azure-storage-options/img/azure-storage-four-types.png b/blog/azure-storage-options/img/azure-storage-four-types.png
new file mode 100644
index 00000000..5717ba6a
Binary files /dev/null and b/blog/azure-storage-options/img/azure-storage-four-types.png differ
diff --git a/blog/azure-storage-options/img/azure-storage-full-pipeline.png b/blog/azure-storage-options/img/azure-storage-full-pipeline.png
new file mode 100644
index 00000000..0d21c81e
Binary files /dev/null and b/blog/azure-storage-options/img/azure-storage-full-pipeline.png differ
diff --git a/blog/azure-storage-options/img/blob-vs-adls-comparison.png b/blog/azure-storage-options/img/blob-vs-adls-comparison.png
new file mode 100644
index 00000000..88caa795
Binary files /dev/null and b/blog/azure-storage-options/img/blob-vs-adls-comparison.png differ
diff --git a/blog/azure-storage-options/img/blob_types.png b/blog/azure-storage-options/img/blob_types.png
new file mode 100644
index 00000000..84da686f
Binary files /dev/null and b/blog/azure-storage-options/img/blob_types.png differ
diff --git a/blog/azure-storage-options/img/queue_storage.png b/blog/azure-storage-options/img/queue_storage.png
new file mode 100644
index 00000000..60ae2fb6
Binary files /dev/null and b/blog/azure-storage-options/img/queue_storage.png differ
diff --git a/blog/azure-storage-options/index.md b/blog/azure-storage-options/index.md
new file mode 100644
index 00000000..6172ae85
--- /dev/null
+++ b/blog/azure-storage-options/index.md
@@ -0,0 +1,379 @@
+---
+title: "Azure Storage & ADLS Gen2: Where Does Your Data Actually Live?"
+authors: [Aditya-Singh-Rathore]
+sidebar_label: "Azure Storage & ADLS Gen2: Complete Guide"
+tags: [azure-storage, blob-storage, adls-gen2, azure-data-lake, queue-storage, table-storage, file-storage, data-engineering, azure, big-data, medallion-architecture]
+date: 2026-05-06
+
+description: Every Azure data pipeline needs a place to store data. But Azure gives you four different storage types and choosing the wrong one is easier than you think. This guide explains all four, shows how they work together in a real pipeline, and goes deep on ADLS Gen2, the storage layer that powers modern Azure data engineering.
+
+draft: false
+canonical_url: https://www.recodehive.com/blog/azure-storage-adls-gen2-complete-guide
+
+meta:
+ - name: "robots"
+ content: "index, follow"
+ - property: "og:title"
+ content: "Azure Storage & ADLS Gen2: Where Does Your Data Actually Live?"
+ - property: "og:description"
+ content: "Azure gives you four storage types and choosing the wrong one is easier than you think. Here's how they all fit together — and why ADLS Gen2 is the one that matters most for data engineering."
+ - property: "og:type"
+ content: "article"
+ - property: "og:url"
+ content: "https://www.recodehive.com/blog/azure-storage-adls-gen2-complete-guide"
+ - property: "og:image"
+ content: "./img/azure-storage-cover.png"
+ - name: "twitter:card"
+ content: "summary_large_image"
+ - name: "twitter:title"
+ content: "Azure Storage & ADLS Gen2: Where Does Your Data Actually Live?"
+ - name: "twitter:description"
+ content: "Azure gives you four storage types. Here's how they all fit together — and why ADLS Gen2 is the one that matters most for data engineering."
+ - name: "twitter:image"
+ content: "./img/azure-storage-cover.png"
+
+---
+
+
+
+# Azure Storage & ADLS Gen2: Where Does Your Data Actually Live?
+
+My first week working with Azure, I broke a pipeline before it even started.
+
+I had a simple job: land some raw CSV files from a sales API into Azure so a Spark job could pick them up later. I searched "Azure storage", saw five different options staring back at me, panicked slightly, and clicked the first one that sounded sensible - **Azure Table Storage**.
+
+Three hours later, I was staring at an error I didn't understand, in a service that was never designed for files.
+
+Table Storage is a NoSQL key-value store. It stores entities and properties, not CSV files. My data had nowhere to go.
+
+That confusion is more common than most Azure tutorials admit. And it happens because nobody explains the one question that actually matters before anything else:
+
+**Where does your data actually live in Azure and why?**
+
+This blog answers that. We'll walk through all four Azure storage types, show exactly where each one fits in a real data pipeline, and then go deep on the one that changes everything for data engineering: **Azure Data Lake Storage Gen2**.
+
+
+## Azure Has Four Storage Types. Here's the Map.
+
+Before we build anything, let's get oriented.
+
+Azure bundles all storage services under a single **Storage Account**, one entry point, one namespace, one billing account. Inside that account, you get access to four distinct storage services, each built for a different job.
+
+
+
+
+Here's the quick map before we go deeper:
+
+| Storage Type | Think of it as | Stores | Used in pipelines for |
+|---|---|---|---|
+| **Blob Storage** | A file cabinet | Any file CSV, JSON, Parquet, images, logs | Raw data landing zone |
+| **Queue Storage** | A mailbox | Messages between services | Triggering pipeline steps |
+| **Table Storage** | A ledger | Structured key-value rows | Tracking run state, metadata |
+| **File Storage** | A shared network drive | Files accessed over SMB | Legacy app file shares |
+
+None of these is "better." They serve different stages of the same pipeline. The mistake most beginners make, including me is picking one at random instead of understanding the job each one does.
+
+Let's walk through them in the order they matter for a real data engineering workflow.
+
+## Blob Storage: The Foundation of Everything
+
+When data arrives in Azure, it almost always lands in **Blob Storage** first.
+
+Blob stands for **Binary Large Object** which is just a fancy way of saying "any file." CSV, JSON, Parquet, images, videos, audio, ZIP archives, raw log dumps, Blob Storage holds all of it without caring about structure or format.
+
+There's no schema enforcement, no type checking. You put a file in, you get it back out. At any scale.
+
+### The three blob types
+
+Depending on how your data is written, you'll use one of three blob types:
+
+
+
+- **Block Blob :** Upload a file all at once. This covers 95% of data engineering use cases, your CSVs, Parquet files, JSON exports all go here.
+- **Append Blob :** Add data continuously without modifying what's already there. Perfect for log files that grow over time.
+- **Page Blob :** Optimised for random read/write operations. Used mainly for VM disks. You'll rarely touch this directly.
+
+### Access tiers: storage that adjusts to how often you actually need the data
+
+One of Blob Storage's most underrated features is **access tiering**:
+
+- **Hot :** Data you access daily. Higher storage cost, lowest read cost.
+- **Cool :** Data you access occasionally. Cheaper to store, slightly more to read. 30-day minimum.
+- **Archive :** Data you almost never access. Extremely cheap to store, but takes hours to retrieve. Think old compliance records.
+
+You can set **lifecycle policies** to move data automatically between tiers as it ages. Last month's raw files move from hot to cool. Last year's move to archive. You save money without touching anything manually.
+
+### Where Blob Storage fits in a pipeline
+
+In Medallion Architecture, Blob Storage is the natural home for the **Bronze layer**, the raw, unprocessed data exactly as it arrived from source systems. Nothing is cleaned. Nothing is validated. It just lands and waits.
+
+But here's where things get interesting.
+
+Plain Blob Storage works perfectly for general file storage. But for big data analytics pipelines, the kind where you're processing millions of files, running Spark jobs, and building Bronze/Silver/Gold layers, it has a critical limitation that most tutorials don't mention until you've already hit it.
+
+
+## The Problem with Plain Blob Storage at Scale
+
+Here's something I found out the hard way six months into working with Azure pipelines.
+
+I had a container full of raw sales data — about 40,000 Parquet files organised under a path that looked like `raw/2024/`. My team decided to rename it to `bronze/2024/` to match our Medallion Architecture convention. Simple enough, right?
+
+It took **47 minutes**.
+
+Not because Azure was slow. Because what looked like a folder called `raw/` was never actually a folder. In plain Blob Storage, everything lives at the same flat level, the slashes in a path like
+`raw/2024/jan/file.parquet` are just characters in a key name, the same way a filename on your desktop could technically be called `raw-2024-jan-file.parquet` with dashes instead.
+
+There is no directory underneath. So renaming means Azure copies each file to the new key name and deletes the old one,one file at a time, 40,000 times in a row.
+
+At big data scale where you're managing millions of files across Bronze, Silver, and Gold layers that's not a minor inconvenience. It's a pipeline blocker.
+
+This is the exact problem **ADLS Gen2** was built to fix.
+
+
+
+## ADLS Gen2: Blob Storage, Evolved
+
+**Azure Data Lake Storage Gen2 (ADLS Gen2)** is not a separate service. It's Blob Storage with one critical feature enabled: the **Hierarchical Namespace**.
+
+With hierarchical namespace turned on, folders become real. A directory with ten million files inside it can be renamed or deleted in a **single atomic operation**, instant, regardless of how many files it contains.
+
+That one change makes ADLS Gen2 fast enough for serious analytics workloads. It's the storage layer that Databricks, Synapse, Azure Data Factory, and Microsoft Fabric are all built to work with.
+
+
+
+
+### The full ADLS Gen2 structure
+
+ADLS Gen2 organises data in three real levels:
+
+```
+Storage Account
+ └── Container (called a File System in ADLS)
+ └── Directories (real, nested folders)
+ └── Files (your actual data)
+```
+
+In practice, for a Medallion Architecture pipeline:
+
+```
+my-datalake/
+ └── data/
+ ├── bronze/
+ │ └── sales/
+ │ └── 2024/jan/raw_orders.parquet
+ ├── silver/
+ │ └── sales/
+ │ └── 2024/jan/cleaned_orders.parquet
+ └── gold/
+ └── sales/
+ └── 2024/jan/monthly_revenue.parquet
+```
+
+Bronze, Silver, Gold are real directories. Spark jobs move data between them. ADF pipelines write to them. Power BI reads from them. The Medallion pattern isn't an abstract concept it's a folder structure in ADLS Gen2 with transformation logic connecting the layers.
+
+### The ABFS driver: why this matters for Spark
+
+When Spark, Databricks, Synapse, or Fabric connect to ADLS Gen2, they use the **Azure Blob File System (ABFS) driver**, accessed via the `abfss://` protocol.
+
+This driver was purpose-built for analytics workloads. It's significantly faster than the old WASB driver for directory-heavy operations, and it's the reason tools like Databricks can list, read, and write millions of files in ADLS Gen2 efficiently.
+
+Every time you see `abfss://container@storageaccount.dfs.core.windows.net/` in a notebook or pipeline config, that's ADLS Gen2 being accessed via the ABFS driver.
+
+### Fine-grained access control with POSIX ACLs
+
+Regular Blob Storage gives you Role-Based Access Control (RBAC) at the container level. ADLS Gen2 goes further with [**POSIX-style Access Control Lists (ACLs)**](https://www.komprise.com/glossary_terms/posix-acls/), the same permission model used in Linux file systems.
+
+This means you can grant a data science team read access to only the `silver/` directory, without exposing `bronze/` (raw, potentially sensitive data) or `gold/` (business metrics). Fine-grained, at the folder and file level.
+
+For regulated industries - finance, healthcare, government, this isn't a nice-to-have. It's a requirement.
+
+### Storage tiers work at directory level
+
+Just like Blob Storage, ADLS Gen2 supports Hot, Cool, and Archive tiers. But now you can apply lifecycle policies at the **directory level** automatically archiving `bronze/2023/` partitions when they're more than a year old, while keeping `bronze/2024/` hot for active pipeline use.
+
+### ADLS Gen2 is what OneLake is built on
+
+If you've read about [Microsoft Fabric](https://www.recodehive.com/blog/microsoft-fabric-explained), you know that OneLake is Fabric's unified data lake, the single storage layer that every Fabric workload reads from and writes to.
+
+OneLake is fundamentally ADLS Gen2 with a unified namespace across your entire Fabric workspace. Understanding ADLS Gen2 means you understand the storage engine that powers Fabric, Synapse, Databricks on Azure, and every serious Azure data platform.
+
+| Azure Service | How it uses ADLS Gen2 |
+|---|---|
+| **Azure Data Factory** | Reads source files, writes pipeline outputs |
+| **Azure Databricks** | Reads/writes Delta tables via ABFS driver |
+| **Azure Synapse Analytics** | Queries files directly with SQL serverless |
+| **Microsoft Fabric / OneLake** | OneLake IS ADLS Gen2 unified namespace |
+| **Azure Machine Learning** | Stores training datasets and model artifacts |
+| **Power BI** | DirectLake mode reads Delta files from ADLS Gen2 |
+
+
+
+## The Supporting Cast: Queue and Table Storage
+
+ADLS Gen2 stores your data. But a pipeline isn't just storage, it's coordination, state management, and event triggering. That's where Queue Storage and Table Storage come in.
+
+They're not glamorous. But remove them from a production pipeline and things fall apart quickly.
+
+### Queue Storage: The Pipeline Trigger
+
+Queue Storage stores **messages**, small packets of information passed between services asynchronously.
+
+
+
+In a data pipeline context, Queue Storage is typically used as a **trigger mechanism**. When a new file lands in ADLS Gen2, Azure Blob Storage can emit an event that drops a message into a Queue. Azure Data Factory (or an Azure Function) listens to that Queue and kicks off the pipeline automatically.
+
+```
+New file lands in ADLS Gen2 bronze/
+ → Event triggers a Queue message: "new file: sales_2024_jan.parquet"
+ → ADF pipeline picks up the message
+ → Pipeline runs transformation
+ → Cleaned data written to silver/
+```
+
+Without Queue Storage, you'd either poll for new files on a schedule (wasteful) or trigger pipelines manually (not scalable).
+
+**Key facts:**
+- Messages up to **64 KB** in size
+- Queue holds up to **200 TB** of messages
+- Messages expire after **7 days** if unconsumed
+- Built-in retry logic if a consumer fails, the message reappears for another attempt
+
+
+### Table Storage: The Pipeline Memory
+
+Table Storage is Azure's **NoSQL key-value store**, schemaless rows of properties, queried by partition and row key.
+
+In data pipelines, Table Storage earns its place as the **watermark store**, the place that remembers where a pipeline left off.
+
+Imagine your ADF pipeline runs every night and ingests new rows from a source database. It can't re-read everything from day one every night. Instead, it records the `last_run_timestamp` in a Table Storage entity:
+
+```
+PartitionKey: "sales_pipeline"
+RowKey: "last_run"
+Timestamp: "2024-01-15T02:00:00Z"
+```
+
+Next run, the pipeline reads this value, queries only rows updated since then, and updates the watermark when done. This is called **incremental ingestion** and Table Storage is the simplest, cheapest place to track it.
+
+**Other pipeline uses for Table Storage:**
+- Pipeline run metadata (status, row counts, duration)
+- Configuration values shared across pipeline activities
+- Simple lookup tables for reference data enrichment
+
+
+## File Storage: A Quick Note
+
+Azure File Storage provides a **managed SMB file share** in the cloud, the kind you mount as a network drive in Windows (`\\server\share`).
+
+For data engineering pipelines, you'll rarely reach for File Storage. It's primarily useful for **lift-and-shift migrations**, moving on-premises applications to Azure when those applications expect to read from a network file share and you don't want to refactor them.
+
+If you're building a new pipeline from scratch, ADLS Gen2 is almost always the right choice over File Storage for analytics workloads.
+
+
+
+## ADLS Gen2 vs Plain Blob Storage — When to Use Which
+
+| Scenario | Use |
+|---|---|
+| Raw file landing zone for a big data pipeline | **ADLS Gen2** |
+| Serving images or videos to a web application | **Blob Storage** |
+| VM disk backups or snapshots | **Blob Storage** |
+| Spark / Databricks / Synapse analytics workloads | **ADLS Gen2** |
+| Bronze / Silver / Gold Medallion layers | **ADLS Gen2** |
+| Simple static file hosting | **Blob Storage** |
+| ML training datasets and model artifacts | **ADLS Gen2** |
+| Microsoft Fabric / OneLake backend | **ADLS Gen2** |
+
+The pricing is identical. The difference is entirely in the **hierarchical namespace** and the performance characteristics it unlocks for analytics.
+
+
+## The Full Picture: One Pipeline, All Four Storage Types
+
+Here's how everything we've covered fits into a single, real data engineering pipeline — the kind you'd actually build in Azure:
+
+
+
+
+```
+REST API (sales data source)
+ ↓
+Azure Data Factory (orchestration)
+ ↓ writes raw Parquet
+ADLS Gen2 — bronze/sales/2024/
+ ↓
+Azure Databricks (Spark: clean, deduplicate, validate)
+ ↓ writes Delta tables
+ADLS Gen2 — silver/sales/2024/
+ ↓
+Azure Databricks (Spark: aggregate, calculate metrics)
+ ↓ writes business-ready Delta tables
+ADLS Gen2 — gold/sales/2024/
+ ↓
+Power BI (DirectLake mode — no import, always current)
+ ↓
+Business dashboard
+
+Supporting roles:
+├── Queue Storage → ADF pipeline triggered by file arrival event
+└── Table Storage → watermark ("last ingested: 2024-01-15T02:00:00Z")
+```
+
+Every storage type has one job. None of them overlap. And ADLS Gen2 is the spine the whole thing runs on.
+
+
+## The Decision Guide: One Question at a Time
+
+When you're building a pipeline and need to decide where something lives, ask these questions in order:
+
+**Is it a file that a Spark job or analytics tool needs to read?**
+→ ADLS Gen2
+
+**Is it a file served to end users (images, videos, downloads)?**
+→ Blob Storage
+
+**Is it a message that needs to trigger something downstream?**
+→ Queue Storage
+
+**Is it small structured data - a config value, a watermark, a metadata record?**
+→ Table Storage
+
+**Is it a file share that a VM or legacy app needs to mount over SMB?**
+→ File Storage
+
+
+
+## The Key Lessons
+
+**1. Azure storage is four different things.** Each one has a specific job. Using the wrong one is a surprisingly easy mistake to make on day one and a frustrating one to debug.
+
+**2. ADLS Gen2 is Blob Storage with one upgrade that changes everything.** The hierarchical namespace turns flat object storage into a real file system. That single feature is why every serious Azure analytics service is built on top of it.
+
+**3. ADLS Gen2 is the Bronze/Silver/Gold spine of Medallion Architecture.** The layers aren't abstract concepts, they're real directories in a container, with Spark jobs and ADF pipelines connecting them.
+
+**4. Queue and Table Storage are the glue.** They're not glamorous, but production pipelines depend on them for event triggering and state management.
+
+**5. OneLake is ADLS Gen2.** When you use Microsoft Fabric, you're using ADLS Gen2 underneath. Understanding the storage layer means you understand what every Azure data platform is actually built on.
+
+
+
+## References & Further Reading
+
+- [Microsoft Docs — Introduction to Azure Data Lake Storage Gen2](https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction)
+- [Microsoft Docs — Azure Storage Overview](https://learn.microsoft.com/en-us/azure/storage/common/storage-introduction)
+- [Microsoft Docs — Storage Account Overview](https://learn.microsoft.com/en-us/azure/storage/common/storage-account-overview)
+- [Microsoft Docs — ABFS Driver for ADLS Gen2](https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-abfs-driver)
+- [RecodeHive — Medallion Architecture Explained](https://www.recodehive.com/blog/medallion-architecture)
+- [RecodeHive — Microsoft Fabric: One Platform, One Lake](https://www.recodehive.com/blog/microsoft-fabric-one-platform-one-lake-every-data-workload)
+- [RecodeHive — Lakehouse vs Data Warehouse](https://www.recodehive.com/blog/lakehouse-vs-data-warehouse)
+
+
+
+## About the Author
+
+I'm **Aditya Singh Rathore**, a Data Engineer passionate about building modern, scalable data platforms on Azure. I write about data engineering, cloud architecture, and real-world pipelines on [RecodeHive](https://www.recodehive.com/) — breaking down complex concepts into things you can actually use.
+
+🔗 [LinkedIn](https://www.linkedin.com/in/aditya-singh-rathore0017/) | [GitHub](https://github.com/Adez017)
+
+📩 Building something on Azure and stuck on storage decisions? Drop your question in the comments.
+
+
diff --git a/blog/azure-synapse-analytics/img/azure-synapse-cover.png b/blog/azure-synapse-analytics/img/azure-synapse-cover.png
new file mode 100644
index 00000000..1664a83d
Binary files /dev/null and b/blog/azure-synapse-analytics/img/azure-synapse-cover.png differ
diff --git a/blog/azure-synapse-analytics/img/synapse-architecture.png b/blog/azure-synapse-analytics/img/synapse-architecture.png
new file mode 100644
index 00000000..62f6de56
Binary files /dev/null and b/blog/azure-synapse-analytics/img/synapse-architecture.png differ
diff --git a/blog/azure-synapse-analytics/img/synapse-vs-fabric-decision.png b/blog/azure-synapse-analytics/img/synapse-vs-fabric-decision.png
new file mode 100644
index 00000000..56c4856c
Binary files /dev/null and b/blog/azure-synapse-analytics/img/synapse-vs-fabric-decision.png differ
diff --git a/blog/azure-synapse-analytics/index.md b/blog/azure-synapse-analytics/index.md
new file mode 100644
index 00000000..6f6b31d9
--- /dev/null
+++ b/blog/azure-synapse-analytics/index.md
@@ -0,0 +1,274 @@
+---
+title: "Azure Synapse Analytics: When to Use It (And When to Choose Fabric Instead)"
+authors: [Aditya-Singh-Rathore]
+sidebar_label: "Azure Synapse Analytics"
+tags: [azure-synapse-analytics, data-engineering, sql-pools, apache-spark, microsoft-fabric, data-warehouse, adls-gen2, azure, big-data, etl]
+date: 2026-05-06
+
+description: Azure Synapse Analytics is one of the most powerful tools in the Azure data stack. But in 2026, with Microsoft Fabric growing fast, the question isn't just "what is Synapse?" — it's "when should you still use it, and when should you move to Fabric?" Here's the honest answer.
+
+draft: false
+canonical_url: https://www.recodehive.com/blog/azure-synapse-analytics-when-to-use-it
+
+meta:
+ - name: "robots"
+ content: "index, follow"
+ - property: "og:title"
+ content: "Azure Synapse Analytics: When to Use It (And When to Choose Fabric Instead)"
+ - property: "og:description"
+ content: "Synapse is powerful. But in 2026, Microsoft Fabric is growing fast. Here's when to still use Synapse, when to move to Fabric, and how to think about the transition."
+ - property: "og:type"
+ content: "article"
+ - property: "og:url"
+ content: "https://www.recodehive.com/blog/azure-synapse-analytics-when-to-use-it"
+ - property: "og:image"
+ content: "./img/azure-synapse-cover.png"
+ - name: "twitter:card"
+ content: "summary_large_image"
+ - name: "twitter:title"
+ content: "Azure Synapse Analytics: When to Use It (And When to Choose Fabric Instead)"
+ - name: "twitter:description"
+ content: "Synapse is powerful. Fabric is the future. Here's the honest breakdown of when to use each in 2026."
+ - name: "twitter:image"
+ content: "./img/azure-synapse-cover.png"
+
+---
+
+
+
+# Azure Synapse Analytics: When to Use It (And When to Choose Fabric Instead)
+
+When I first started working seriously with Azure, Synapse was the answer to almost every data question.
+
+Need a SQL warehouse? Synapse. Need Spark for big data? Synapse. Need pipelines to move data? Synapse. Need to query files sitting in ADLS Gen2 without loading them anywhere? Synapse.
+
+It was genuinely impressive, one workspace that brought together SQL, Spark, pipelines, and storage into a single studio. I built three production pipelines on it and it worked well.
+
+Then Microsoft Fabric arrived.
+
+And now the question I get asked most often is: *"Should I still use Synapse, or should I move to Fabric?"*
+
+The honest answer is: **it depends on where you are in your Azure journey.** This blog gives you the full picture, what Synapse actually is, when it's the right call, when Fabric is the better choice, and how to think about the transition if you're already on Synapse.
+
+
+## What Azure Synapse Analytics Actually Is
+
+Azure Synapse Analytics started as the next step beyond Azure SQL Data Warehouse, but over time it evolved into a much broader analytics platform rather than remaining just a cloud data warehouse solution.
+
+What changed significantly was the addition of multiple processing engines and integrated tooling within a single workspace. Instead of working only with SQL-based warehousing, teams could now combine:
+- large-scale Spark processing
+- SQL analytics
+- real-time exploration capabilities
+- orchestration pipelines
+- integrated data lake access
+
+This shift made Synapse more of a unified analytics ecosystem on Azure, where data engineering, big data processing, and reporting workloads could coexist within the same platform experience.
+
+One of the biggest differences compared to the earlier SQL Data Warehouse model is that Synapse tries to reduce the fragmentation between storage, transformation, orchestration, and analytics services that previously had to be managed separately.
+
+In plain terms: it's a unified analytics platform that brings together four things that used to require four separate Azure services:
+
+- **SQL analytics** - for querying structured data at scale
+- **Apache Spark** - for big data processing, ML, and complex transformations
+- **Data integration (Synapse Pipelines)** - for moving and transforming data across systems
+- **A unified workspace (Synapse Studio)** - where all of the above live together
+
+
+
+The key architectural principle underneath all of this is the **separation of compute and storage**. This decoupling allows organizations to scale their processing power independently of their data volume, compute resources can be ramped up to handle peak query loads and then scaled down or even paused during periods of inactivity, all without affecting the underlying data stored in ADLS Gen2.
+
+That's a big deal in practice. You pay for compute only when you use it.
+
+
+
+## The Four Core Components - What Each One Does
+
+### 1. Dedicated SQL Pools: High-Performance Data Warehousing
+
+Dedicated SQL Pools are Synapse's data warehousing engine. You provision a fixed amount of compute capacity measured in **Data Warehouse Units (DWUs)**, and in return you get consistent, predictable query performance.
+
+Dedicated SQL pools provision reserved compute capacity measured in Data Warehouse Units. They deliver consistent performance for production workloads, scheduled reports, and dashboards that need predictable response times.
+
+This is the right choice when:
+- You have large, structured datasets that are queried repeatedly by BI tools
+- You need consistent sub-second query performance for dashboards
+- Your team works primarily in T-SQL
+- You're migrating from an on-premises SQL Server or Oracle data warehouse
+
+The trade-off: you pay for the provisioned DWUs whether you're running queries or not. It's expensive to leave a Dedicated SQL Pool running 24/7 for workloads that only query it during business hours.
+
+**The practical fix:** pause your Dedicated SQL Pool outside business hours. Synapse lets you do this programmatically via Azure Automation or ADF pipelines — you only pay for compute when it's actually running.
+
+### 2. Serverless SQL Pool: Query Without Loading
+
+Serverless SQL Pool is probably one of the most practical and underrated capabilities inside Azure Synapse.
+
+What makes it interesting is how quickly you can start querying data directly from your data lake without provisioning dedicated infrastructure upfront. Instead of maintaining a constantly running cluster, the engine dynamically allocates compute only when a query is executed.
+
+Under the hood, queries are distributed across multiple compute resources and processed in parallel, which makes it surprisingly efficient for exploratory analysis and lightweight analytical workloads.
+
+The pricing model is also very different from traditional warehouses. Since billing is based on the amount of data scanned per query, it works particularly well for:
+- ad-hoc analysis
+- one-time investigations
+- querying historical files
+- lightweight reporting workloads
+- infrequently accessed datasets
+
+The first time I used it, the biggest surprise was how quickly I could run SQL directly on files sitting in ADLS without setting up ingestion pipelines or persistent compute.
+
+In practice: you can write a SQL query directly against Parquet, CSV, or Delta files sitting in ADLS Gen2 **without loading them into any database first**.
+
+```sql
+-- Query a Parquet file in ADLS Gen2 directly — no loading required
+SELECT
+ region,
+ SUM(amount) AS total_revenue,
+ COUNT(order_id) AS total_orders
+FROM
+ OPENROWSET(
+ BULK 'https://mylake.dfs.core.windows.net/silver/sales/2024/**',
+ FORMAT = 'PARQUET'
+ ) AS sales_data
+GROUP BY region
+ORDER BY total_revenue DESC;
+```
+
+You pay for the bytes scanned by that query. Nothing more.
+
+This is the right choice when:
+- You need to explore raw data in ADLS Gen2 before deciding how to model it
+- You have analysts who know SQL but don't want to write Spark code
+- You're running occasional ad-hoc queries that don't justify provisioning a dedicated warehouse
+- You want to build a **logical data warehouse** on top of your data lake without moving data
+
+### 3. Apache Spark Pools: Big Data and ML Workloads
+
+Azure Synapse Analytics includes deeply integrated Apache Spark capabilities, allowing teams to work with large-scale data processing directly within the Synapse workspace instead of managing separate big data platforms.
+
+Spark Pools provide a managed Spark environment where engineers and data scientists can build ETL pipelines, prepare large datasets, process semi-structured or unstructured data, and develop machine learning workflows using familiar notebook-based development.
+
+One thing I found particularly useful is that infrastructure management is mostly abstracted away. You can write notebooks using Python, Scala, SQL, or R while Synapse handles much of the operational overhead like cluster provisioning, scaling, and session management behind the scenes.
+
+This makes Spark Pools especially practical for workloads that go beyond traditional SQL transformations and require distributed computation at scale.
+
+This is the right choice when:
+- Your transformations are too complex for SQL alone
+- You're building ML pipelines or training models on large datasets
+- You need to process semi-structured data (JSON, nested arrays) at scale
+- Your data engineering team is comfortable in PySpark or Scala
+
+The key advantage over standalone Spark clusters: Spark Pools share the same workspace as your SQL Pools and Pipelines. A Spark notebook can write a Delta table that a SQL analyst can immediately query without any data movement or cross-service configuration.
+
+### 4. Synapse Pipelines: Data Integration and Orchestration
+
+Synapse Pipelines is the data integration layer. It uses the same engine as Azure Data Factory, which means teams already using ADF will recognize the interface and functionality. Pipelines handle the movement and transformation of data across systems connecting to sources, extracting data, applying transformations, and loading results into destinations.
+
+If you've used Azure Data Factory, Synapse Pipelines will feel immediately familiar. It's the same visual, activity-based orchestration tool with 95+ connectors to external systems, built directly into the Synapse workspace.
+
+The advantage over standalone ADF: your pipelines live in the same workspace as your SQL and Spark workloads. You can trigger a Spark notebook, run a SQL script, and copy data to ADLS Gen2, all within a single pipeline, without leaving Synapse Studio.
+
+
+## What Synapse Studio Actually Looks Like
+
+Synapse Studio is the unified web-based interface that ties everything together. From one interface, teams can write and execute SQL queries against data warehouse tables, build and run Apache Spark notebooks, design data pipelines using visual drag-and-drop tools, monitor jobs, manage resources, and configure security settings. Data engineers building pipelines and analysts writing reports work in the same environment with access to the same underlying data.
+
+In practice, this means less context-switching. When I was building pipelines on Synapse, the biggest quality-of-life win was being able to debug a Spark notebook, run a SQL query against its output, and check the pipeline that triggered it, all in the same browser tab.
+
+
+## Real-World Use Cases - When Synapse Is the Right Call
+
+### Use Case 1: Enterprise Data Warehouse Migration
+
+Organizations moving from on-premises data warehouses like SQL Server or Oracle to Azure Synapse benefit from enhanced scalability, cost savings, and better performance.
+
+If your team is deeply invested in T-SQL, has existing stored procedures and reporting logic, and is migrating from SQL Server or Azure SQL DW — Synapse's Dedicated SQL Pool is the most natural landing spot. The syntax is familiar, the tooling is mature, and the migration path is well-documented.
+
+### Use Case 2: Ad-Hoc Exploration on a Data Lake
+
+You've landed months of raw data in ADLS Gen2 and need to understand what's in it before building a formal pipeline. Serverless SQL Pool lets analysts write SQL against those files immediately without waiting for a data engineer to model the data first.
+
+This is genuinely one of Synapse's strongest differentiators. No other Azure service lets SQL analysts query raw Parquet files on a data lake this directly, this cheaply.
+
+### Use Case 3: Mixed SQL + Spark Workloads
+
+Your team has SQL analysts querying a data warehouse and data engineers running Spark transformation jobs. In most stacks, these two groups work in separate tools with separate data copies.
+
+In Synapse, Spark can write a Delta table that the SQL pool reads, and SQL results can feed back into Spark notebooks without data movement between services. Both groups work against the same underlying data in ADLS Gen2.
+
+### Use Case 4: Regulated Industries Requiring Network Isolation
+
+Synapse has mature support for managed virtual networks and private endpoints. For teams in finance, healthcare, or government where strict data residency and network isolation are non-negotiable requirements, Synapse's mature networking controls are a significant advantage over Fabric, whose networking story is still evolving.
+
+
+## Synapse vs Fabric: The Honest Comparison
+
+Azure Synapse Analytics is a platform-as-a-service (PaaS) solution that provides modular components giving fine-grained control over data workflows. Microsoft Fabric represents a software-as-a-service (SaaS) approach bringing everything together into a single unified platform with shared governance, compute, and storage through OneLake.
+
+| Dimension | Azure Synapse | Microsoft Fabric |
+|---|---|---|
+| **Deployment model** | PaaS - you manage compute resources | SaaS - fully managed |
+| **Storage** | ADLS Gen2 (you manage) | OneLake (unified, managed for you) |
+| **SQL engine** | Dedicated + Serverless SQL Pools | Fabric Warehouse + SQL analytics endpoint |
+| **Spark** | Apache Spark Pools | Fabric Spark (same engine, newer experience) |
+| **Pipelines** | Synapse Pipelines (ADF engine) | Fabric Data Factory (next-gen ADF) |
+| **Real-time** | Data Explorer (partially retired) | Eventstreams + Eventhouse (KQL) |
+| **Network isolation** | Mature - managed VNet, private endpoints | Still evolving |
+| **T-SQL support** | Full | Some gaps (OPENROWSET and others) |
+| **AI / Copilot** | Limited | Built-in Copilot across all workloads |
+| **Direction** | Maintenance mode | Active investment - new features land here first |
+| **Best for** | Existing investments, regulated industries, SQL-heavy teams | Greenfield projects, unified analytics, AI workloads |
+
+
+## Should You Migrate from Synapse to Fabric?
+
+If you're already on Synapse, here's the pragmatic framework:
+
+**Migrate these workloads to Fabric now:**
+- Spark-based data engineering notebooks and jobs
+- Synapse Pipelines (the migration assistant handles most of this automatically)
+- Real-time analytics workloads (Fabric's Eventhouse is better than Data Explorer)
+- Power BI-connected workloads (DirectLake mode is a significant upgrade)
+
+**Keep these on Synapse for now:**
+- Workloads that depend heavily on Dedicated SQL Pool features
+- Pipelines that require complex network isolation or private endpoints
+- Anything using features that don't have a Fabric equivalent yet (OPENROWSET, Synapse Link for some sources)
+
+A phased approach works best: migrate greenfield workloads to Fabric immediately, then build a roadmap for existing Synapse workloads as Fabric's feature gaps close.
+
+The good news: the migration assistant automatically migrates core Spark artifacts from Azure Synapse Analytics into Fabric Data Engineering, bringing over Spark pools, notebooks, and Spark job definitions with no data moved during the process.
+
+
+## The Key Lessons
+
+**1. Synapse is not dead but it's not the future either.** It's a fully supported, production-ready platform that will be around for years. But Microsoft's innovation is going into Fabric, not Synapse.
+
+**2. Serverless SQL Pool is genuinely underrated.** The ability to query raw files in ADLS Gen2 with SQL, paying only for bytes scanned, is one of the most cost-efficient features in the entire Azure data stack. Even if you move to Fabric, this pattern is worth understanding.
+
+**3. For greenfield projects in 2026, start with Fabric.** The OneLake architecture, the unified experience, and the Copilot integration make it the better starting point for anything new.
+
+**4. For existing Synapse investments, migrate in phases.** Don't rush a full migration. Move Spark workloads and pipelines first. Evaluate Dedicated SQL Pool workloads carefully before touching them.
+
+**5. The separation of compute and storage matters.** Whether you're on Synapse or Fabric, the underlying principle is the same, your data lives in ADLS Gen2 / OneLake, and your compute scales independently. Understanding this makes both platforms easier to reason about.
+
+
+## References & Further Reading
+
+- [Microsoft Docs - Azure Synapse Analytics Overview](https://learn.microsoft.com/en-us/azure/synapse-analytics/overview-what-is)
+- [Microsoft Docs - Serverless SQL Pool](https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/on-demand-workspace-overview)
+- [Microsoft Fabric Blog - Migrating from Synapse to Fabric](https://community.fabric.microsoft.com/t5/Fabric-Updates-Blogs/From-Azure-Synapse-and-Azure-Data-Factory-to-Microsoft-Fabric/ba-p/5172227)
+- [Microsoft Docs - Migrate Synapse Pipelines to Fabric](https://learn.microsoft.com/en-us/fabric/data-engineering/migrate-synapse-data-pipelines)
+- [RecodeHive - Microsoft Fabric: One Platform, One Lake](https://www.recodehive.com/blog/microsoft-fabric-explained)
+- [RecodeHive - Azure Storage & ADLS Gen2](https://www.recodehive.com/blog/azure-storage-options)
+- [RecodeHive - Lakehouse vs Data Warehouse](https://www.recodehive.com/blog/lakehouse-vs-warehouse)
+
+
+## About the Author
+
+I'm **Aditya Singh Rathore**, a Data Engineer passionate about building modern, scalable data platforms on Azure. I write about data engineering, cloud architecture, and real-world pipelines on [RecodeHive](https://www.recodehive.com/) breaking down complex concepts into things you can actually use.
+
+🔗 [LinkedIn](https://www.linkedin.com/in/aditya-singh-rathore0017/) | [GitHub](https://github.com/Adez017)
+
+📩 Still on Synapse and thinking about Fabric? Drop your questions in the comments, happy to help you think through the migration.
+
+
diff --git a/blog/batch-vs-stream-processing/img/batch-streaming-combined-architecture.png b/blog/batch-vs-stream-processing/img/batch-streaming-combined-architecture.png
new file mode 100644
index 00000000..99398ec6
Binary files /dev/null and b/blog/batch-vs-stream-processing/img/batch-streaming-combined-architecture.png differ
diff --git a/blog/batch-vs-stream-processing/index.md b/blog/batch-vs-stream-processing/index.md
new file mode 100644
index 00000000..bac49582
--- /dev/null
+++ b/blog/batch-vs-stream-processing/index.md
@@ -0,0 +1,273 @@
+---
+title: "Why We Rolled Back Our Kafka Pipeline to Batch After 6 Months"
+authors: [Aditya-Singh-Rathore]
+sidebar_label: "Why We Rolled Back Our Kafka Pipeline to Batch After 6 Monthss"
+tags: [batch-processing, stream-processing, data-engineering, apache-kafka, apache-flink, apache-spark, data-pipeline, real-time, azure, medallion-architecture, data-architecture]
+date: 2026-05-06
+
+description: Everyone talks about the benefits of streaming pipelines — real-time insights, millisecond latency, live dashboards. Nobody talks about what it actually costs you. I rebuilt a working batch pipeline as a streaming system. Here's what I learned the hard way.
+
+draft: false
+canonical_url: https://www.recodehive.com/blog/hidden-cost-of-streaming-pipelines
+
+meta:
+ - name: "robots"
+ content: "index, follow"
+ - property: "og:title"
+ content: "The Hidden Cost of Streaming Pipelines Nobody Talks About"
+ - property: "og:description"
+ content: "Everyone talks about the benefits of real-time streaming. Nobody talks about what it actually costs. Here's the honest breakdown from someone who built both."
+ - property: "og:type"
+ content: "article"
+ - property: "og:url"
+ content: "https://www.recodehive.com/blog/hidden-cost-of-streaming-pipelines"
+ - property: "og:image"
+ content: "./img/streaming-hidden-cost-cover.png"
+ - name: "twitter:card"
+ content: "summary_large_image"
+ - name: "twitter:title"
+ content: "The Hidden Cost of Streaming Pipelines Nobody Talks About"
+ - name: "twitter:description"
+ content: "Everyone talks about real-time streaming benefits. Nobody talks about what it costs. Here's the honest breakdown."
+ - name: "twitter:image"
+ content: "./img/streaming-hidden-cost-cover.png"
+
+---
+
+
+
+# The Hidden Cost of Streaming Pipelines Nobody Talks About
+
+Everyone in data engineering is obsessed with real time.
+
+Kafka. Flink. Event-driven architectures. Millisecond latency. Live dashboards. It's the direction every conference talk points, every job description asks for, every architecture diagram proudly features.
+
+And I bought into it completely.
+
+About a year into my data engineering career, our product team came to us with a request: customers wanted to see their order status update in real time. Our existing batch pipeline ran at 2am every night, customers were calling support asking where their orders were.
+
+Reasonable ask. So we rebuilt the pipeline as a streaming system.
+
+Six months later, I had learned more about the real cost of streaming than any blog post or conference talk had ever prepared me for.
+
+This is that story — and the honest breakdown I wish someone had given me before I started.
+
+
+## What We Had Before (And Why It Worked)
+
+Our original order pipeline was batch. It ran every night at 2am via Azure Data Factory, pulled 24 hours of orders from our SQL database, ran a Spark transformation job, and wrote clean Delta tables to ADLS Gen2.
+
+```
+Every night at 2am:
+ ↓
+ADF Pipeline triggers
+ ↓
+Pull all orders from the last 24 hours
+ ↓
+Spark: clean → deduplicate → join product catalog
+ ↓
+Write to Silver layer (Delta table on ADLS Gen2)
+ ↓
+Aggregate into Gold layer
+ ↓
+Power BI refreshes — customers see updated status
+```
+
+It ran in 45 minutes. Our Spark cluster spun up, did its job, and shut down. We paid for 45 minutes of compute per day. The pipeline was simple, debuggable, and recoverable, if something broke, we fixed it and replayed from Bronze.
+
+The only problem: customers saw data that was 6 to 30 hours old depending on when they ordered.
+
+For most use cases, that's fine. For order status, it wasn't.
+
+
+## Hidden Cost #1 - Infrastructure That Never Sleeps
+
+The first thing that surprised me about our streaming pipeline was the infrastructure bill.
+
+Our batch Spark cluster ran 45 minutes a day. Our Kafka + Flink setup runs **every minute of every day** - 24 hours, 7 days a week, whether there are 10 events per second or 10,000.
+
+Streaming infrastructure requires 24/7 uptime. You can't spin it down overnight to save money. You can't schedule it during off-peak hours. The pipeline is always on, always consuming resources, always incurring cost.
+
+For our team, the monthly compute cost for the streaming pipeline was roughly **4x** what the equivalent batch job cost and that was before accounting for the additional engineering time to maintain it.
+
+> **The question to ask before going streaming:** Is the business value of real-time data worth 4x the infrastructure cost? Sometimes the answer is yes. Often it isn't.
+
+
+## Hidden Cost #2 - Late-Arriving Data Will Break Your Logic
+
+In a batch pipeline, late data is not a problem. If an event arrives 3 hours late, it's in the next batch. The pipeline processes it, life goes on.
+
+In a streaming pipeline, late-arriving data is one of the hardest problems in distributed systems.
+
+Events can arrive out of order due to network delays, retries, or clock skew between services. Your Flink job is processing event #1,000 when event #987 suddenly arrives 45 seconds late. What do you do?
+
+The answer involves **watermarking**, telling your stream processor "wait X seconds after the event time before closing a window, to account for late arrivals." But choosing the right watermark is a balance:
+
+- Too short: you miss late-arriving events, your aggregations are wrong
+- Too long: you hold state in memory longer, increasing latency and memory pressure
+
+We got this wrong twice before landing on a configuration that worked. Both times, our order counts were silently off by 1-3%, small enough to look like noise, large enough to cause problems in financial reconciliation.
+
+```
+Late data problem illustrated:
+
+Event time: 10:00 10:01 10:02 10:03 10:04
+Arrived at: 10:00 10:01 10:04 10:03 10:05
+ ↑
+ event #3 arrived 2 minutes late
+ — already missed the 10:02 window
+ — your aggregate is wrong
+```
+
+In batch, this doesn't exist as a problem. In streaming, it's a constant engineering challenge.
+
+
+## Hidden Cost #3 - Exactly-Once Is Harder Than It Sounds
+Handling failures in batch pipelines is usually predictable.
+If a batch job fails, you typically resolve the issue and rerun the pipeline from the beginning. Since the processing happens on bounded data, recovery is relatively straightforward.
+
+Streaming systems work very differently.
+
+In platforms like Kafka and Flink, data is continuously flowing through the system. If a streaming job crashes midway through processing, recovery becomes much more complex than simply restarting the job.
+
+For example, after recovery:
+- Should previously processed events be replayed?
+- Could some records get skipped unintentionally?
+- Is there a possibility that certain events are processed more than once?
+
+This challenge is commonly addressed through **exactly-once processing guarantees**, where the goal is to ensure that every event affects the system exactly one time even during failures and restarts.
+
+Achieving reliable exactly-once behavior usually depends on several components working together correctly:
+
+- Proper Kafka offset management
+- Reliable Flink checkpointing and state recovery
+- Idempotent writes to downstream systems
+- Consistent state synchronization during failover scenarios
+
+In practice, recovery bugs in streaming systems can have real operational impact. A single restart issue can lead to duplicate event processing, inconsistent downstream data, repeated customer notifications, or inaccurate analytics until the state is corrected.
+
+Unlike batch systems, where failures often leave datasets untouched until rerun, streaming failures can leave systems in partially updated states that are significantly harder to debug and recover from.
+
+
+## Hidden Cost #4 - Testing Is a Different Discipline
+
+Testing a batch pipeline is relatively straightforward. You have a dataset, you run the transformation, you check the output. Deterministic, reproducible, easy to validate.
+
+Testing a streaming pipeline requires simulating event streams with realistic timing, ordering, and volume. You need to test:
+
+- What happens when events arrive out of order?
+- What happens when a consumer crashes and restarts?
+- What happens when Kafka lag builds up during a traffic spike?
+- What happens when an upstream service sends a malformed event?
+
+We discovered most of our edge cases in production, not in testing. Not because we were careless, but because accurately simulating a live event stream in a test environment is genuinely difficult.
+
+Our batch pipeline had a test suite that ran in 8 minutes. Our streaming pipeline's test suite took 40 minutes and still missed three production bugs in the first month.
+
+
+
+## Hidden Cost #5 - Your Team Needs Streaming-Specific Skills
+
+This one is easy to underestimate.
+
+Batch data engineering skills - Spark, SQL, dbt, ADF are well-understood, well-documented, and widely held. If someone on your team leaves, finding a replacement with those skills is manageable.
+
+Streaming-specific skills Kafka internals, Flink state management, watermarking strategies, consumer group management, exactly-once configuration are genuinely harder to find and take longer to develop.
+
+When we hit our first major Flink issue (a state backend misconfiguration causing memory pressure under load), our team spent three days debugging something that an experienced Flink engineer would have spotted in 20 minutes. We didn't have one. We learned on the job, which is fine but it was expensive learning.
+
+> Before committing to a streaming architecture, ask: does your team have the skills to maintain it? And if not, what's the cost of developing those skills or hiring them?
+
+
+
+## So When Is Streaming Actually Worth It?
+
+None of this means streaming is wrong. It means streaming has a real cost that should be weighed against a real business need.
+
+Streaming is worth it when the business problem **genuinely cannot tolerate batch latency.** Here's a clear test:
+
+**Reach for streaming when:**
+- Fraud needs to be detected **before** a transaction completes — batch latency means the fraud already happened
+- A customer's app needs to reflect a change **within seconds** of it occurring
+- A system needs to **react** to an event automatically — alerts, triggers, automated responses
+- You're processing IoT sensor data where stale readings are dangerous, not just inconvenient
+
+**Stick with batch when:**
+- You're building monthly reports, financial summaries, or historical analyses
+- Your stakeholders check dashboards in the morning, not the second
+- Your transformations involve complex aggregations over large historical datasets
+- Your team is small and operational simplicity matters more than latency
+
+The tech industry is currently obsessed with "real-time," which has led many organizations to over-engineer their stacks implementing complex stream-processing frameworks where a simple batch job would have sufficed. A well-built batch pipeline is more reliable, cheaper, and easier to maintain than a poorly-justified streaming one.
+
+## The Architecture That Actually Works: Both
+
+Here's what I'd tell myself before starting that project:
+
+**You probably need both, not either/or.**
+
+Our final architecture uses batch for everything that can tolerate it, and streaming only for the specific cases that genuinely can't:
+
+```
+Streaming layer (Kafka + Flink):
+ Order events → real-time status updates (Cassandra)
+ Fraud signals → real-time alerts (notification service)
+
+Batch layer (Spark + ADF):
+ Nightly order aggregations → Silver → Gold (Power BI)
+ Monthly revenue reports (finance team)
+ ML training datasets (data science team)
+```
+
+
+
+
+The streaming layer handles the 5% of use cases where seconds matter. The batch layer handles the 95% where they don't , more reliably, more cheaply, with less operational overhead.
+
+[Microsoft Fabric](https://www.recodehive.com/blog/microsoft-fabric-explained) is built around exactly this pattern, Eventstreams for real-time ingestion, ADF Pipelines and Spark Notebooks for batch transformation, both writing to the same OneLake. You don't have to choose one architecture. You choose the right tool for each use case within the same platform.
+
+
+## The Honest Summary
+
+| | Batch | Streaming |
+|---|---|---|
+| **Infrastructure cost** | Low - runs on schedule | High - always on |
+| **Latency** | Minutes to hours | Milliseconds to seconds |
+| **Late data** | Not a problem | Significant engineering challenge |
+| **Failure recovery** | Fix and rerun | Complex - risk of duplicates or data loss |
+| **Testing** | Straightforward | Requires stream simulation |
+| **Team skills needed** | Spark, SQL, ADF | Kafka, Flink, state management |
+| **Best for** | Analytics, reporting, ML | Fraud detection, live status, alerts |
+| **Operational complexity** | Low | High |
+
+Streaming pipelines are powerful. They enable product experiences that batch simply can't deliver.
+
+But they come with real costs - infrastructure that never sleeps, late-data handling that never stops being tricky, failure recovery that's genuinely hard to get right, and a skills requirement that's easy to underestimate.
+
+The next time someone on your team says "we should make this real time", ask the question first:
+
+**How long can the business actually wait for this data?**
+
+If the honest answer is "overnight is fine" — keep the batch job. It's not boring. It's the right call.
+
+
+## References & Further Reading
+
+- [Databricks - Batch vs Streaming](https://docs.databricks.com/aws/en/data-engineering/batch-vs-streaming)
+- [Apache Flink - Watermarks and Late Data](https://nightlies.apache.org/flink/flink-docs-stable/docs/concepts/time/)
+- [Apache Kafka Documentation](https://kafka.apache.org/documentation/)
+- [Microsoft Fabric - Real-Time Intelligence](https://learn.microsoft.com/en-us/fabric/real-time-intelligence/overview)
+- [RecodeHive - How Netflix Handles Millions of Events Every Minute](https://www.recodehive.com/blog/netflix-data-engineering)
+- [RecodeHive - Medallion Architecture Explained](https://www.recodehive.com/blog/medallion-architecture)
+- [RecodeHive - Microsoft Fabric: One Platform, One Lake](https://www.recodehive.com/blog/microsoft-fabric-explained)
+
+
+## About the Author
+
+I'm **Aditya Singh Rathore**, a Data Engineer passionate about building modern, scalable data platforms. I write about data engineering, Azure, and real-world pipeline design on [RecodeHive](https://www.recodehive.com/), turning hard-won lessons into content anyone can learn from.
+
+🔗 [LinkedIn](https://www.linkedin.com/in/aditya-singh-rathore0017/) | [GitHub](https://github.com/Adez017)
+
+📩 Have you been burned by a streaming pipeline that didn't need to be? Drop it in the comments.
+
+
diff --git a/blog/medallion-architecture/Img/medallion-architecture-flow.png b/blog/medallion-architecture/Img/medallion-architecture-flow.png
new file mode 100644
index 00000000..eaba2611
Binary files /dev/null and b/blog/medallion-architecture/Img/medallion-architecture-flow.png differ
diff --git a/blog/medallion-architecture/Img/medallion-extended-layers.png b/blog/medallion-architecture/Img/medallion-extended-layers.png
new file mode 100644
index 00000000..3a2288ae
Binary files /dev/null and b/blog/medallion-architecture/Img/medallion-extended-layers.png differ
diff --git a/blog/medallion-architecture/index.md b/blog/medallion-architecture/index.md
new file mode 100644
index 00000000..736da68d
--- /dev/null
+++ b/blog/medallion-architecture/index.md
@@ -0,0 +1,344 @@
+---
+title: "Medallion Architecture: How to Stop Your Data Pipeline from Becoming a Nightmare"
+authors: [Aditya-Singh-Rathore]
+sidebar_label: "Medallion Architecture Explained"
+tags: [medallion-architecture, data-engineering, bronze-silver-gold, data-pipeline, delta-lake, spark, databricks, microsoft-fabric, data-quality]
+date: 2026-05-07
+
+description: Most data pipelines don't fail because of bad technology. They fail because raw data flows directly into reports with no checkpoints, no validation, and no clear ownership. Medallion Architecture fixes exactly this — here's how it works, why it matters, and how to implement it in practice.
+
+draft: false
+canonical_url: https://www.recodehive.com/blog/medallion-architecture
+
+meta:
+ - name: "robots"
+ content: "index, follow"
+ - property: "og:title"
+ content: "Medallion Architecture: How to Stop Your Data Pipeline from Becoming a Nightmare"
+ - property: "og:description"
+ content: "Most pipelines fail not because of bad technology but bad organization. Medallion Architecture fixes this with Bronze, Silver, and Gold layers. Here's how it works."
+ - property: "og:type"
+ content: "article"
+ - property: "og:url"
+ content: "https://www.recodehive.com/blog/medallion-architecture"
+ - property: "og:image"
+ content: "./img/medallion-architecture-cover.png"
+ - name: "twitter:card"
+ content: "summary_large_image"
+ - name: "twitter:title"
+ content: "Medallion Architecture: How to Stop Your Data Pipeline from Becoming a Nightmare"
+ - name: "twitter:description"
+ content: "Most pipelines fail not because of bad tech but bad organization. Here's how Medallion Architecture fixes that."
+ - name: "twitter:image"
+ content: "./img/medallion-architecture-cover.png"
+
+---
+
+
+
+# Medallion Architecture: How to Stop Your Data Pipeline from Becoming a Nightmare
+
+It was a Tuesday afternoon when our analytics lead sent a message that made my stomach drop.
+
+*"The revenue numbers in the dashboard don't match what finance is reporting. We're off by $180,000. Can you check the pipeline?"*
+
+I spent the next four hours tracing data through a tangled mess of transformations, none of them documented, some running directly on raw API responses, others written six months ago by someone who had since left the team. By the time I found the issue (a deduplication step that had silently stopped working after a schema change upstream), the damage was done. Three teams had been working off wrong numbers for two weeks.
+
+That incident is what introduced me to **Medallion Architecture**.
+
+Not as a concept from a blog post. As a solution to a real, expensive, embarrassing problem that could have been caught immediately if we'd had any structure in how data moved through our pipeline.
+
+
+## So, What Is It?
+
+Think of Medallion Architecture like a water filtration system.
+
+Water from a river (your raw data) goes through multiple stages of filtering before it's safe to drink (your final reports). You wouldn't drink straight from the river — and you shouldn't build reports directly on raw, unvalidated data either.
+
+The architecture divides your data journey into three layers:
+
+> **Bronze → Silver → Gold**
+
+Each layer has one job. Each layer makes the data a little more trustworthy. By the time data reaches the end, it's reliable, consistent, and ready to power real business decisions.
+
+
+
+
+## 🥉 Bronze: The "Keep Everything" Layer
+
+Bronze is where data arrives, exactly as it came from the source. No cleaning, no filtering, no judgment.
+
+APIs, databases, logs, CSV exports, it all lands here, untouched.
+
+After the revenue incident, the first thing we did was create a Bronze layer in ADLS Gen2, a dedicated folder where every raw API response landed as-is, timestamped, and never overwritten.
+
+**Why not clean it immediately?**
+
+Because you *will* make mistakes in your pipeline. And when you do, you need to be able to go back to the original data and start over, without re-calling the API, without re-pulling from a source that may have already changed.
+
+Bronze is your safety net. It's immutable, append-only, and complete.
+
+> **Think of it as your data's long-term memory**, messy, raw, but irreplaceable.
+
+### What Bronze looks like in practice
+
+```
+adls-gen2/
+ └── bronze/
+ └── sales/
+ └── 2024/
+ ├── 01/raw_orders_20240115.parquet
+ ├── 02/raw_orders_20240201.parquet
+ └── 03/raw_orders_20240305.parquet
+```
+
+Files land here partitioned by date. Nothing is modified after landing. If the pipeline fails three steps later, you don't re-ingest, you reprocess from Bronze.
+
+### Key rules for Bronze
+
+- **Append only**: never overwrite or delete records
+- **No transformation**: store exactly what the source sent, including bad records
+- **Schema as-received**: don't enforce structure here, even if the source changes its format
+- **Partition by ingestion date**: makes reprocessing specific time ranges simple
+
+
+## 🥈 Silver: Where the Real Work Happens
+
+This is where data engineering gets interesting and where most of the actual work lives.
+
+In the Silver layer, you take everything from Bronze and make it usable:
+
+- **Deduplicate** - remove duplicate records from retry logic or overlapping ingestion windows
+- **Standardize** - dates in ISO format, currencies in base units, strings trimmed and consistent
+- **Validate** - flag or quarantine records that fail business rules (negative prices, missing required fields)
+- **Enforce schema** - write Delta tables with defined column types and constraints
+- **Enrich** - join raw records with reference data (product names, region codes, customer tiers)
+
+Most of the heavy lifting in a data pipeline lives here. It's not glamorous work but it's what separates trustworthy analytics from chaos.
+
+> **Think of it as the editorial desk**, messy raw material goes in, clean, consistent content comes out.
+
+### What Silver looks like in practice
+
+Here's a simple PySpark transformation from Bronze to Silver:
+- [Reference code](https://oneuptime.com/blog/post/2026-02-17-how-to-build-a-data-lakehouse-architecture-on-gcp-using-cloud-storage-dataproc-and-bigquery/view)
+
+```python
+from pyspark.sql import SparkSession
+from pyspark.sql.functions import col, to_date, lower, trim, when
+
+spark = SparkSession.builder.appName("BronzeToSilver").getOrCreate()
+
+# Read from Bronze
+bronze_df = spark.read.format("parquet").load(
+ "abfss://data@mylake.dfs.core.windows.net/bronze/sales/2024/"
+)
+
+# Clean and validate
+silver_df = (
+ bronze_df
+ .dropDuplicates(["order_id"]) # deduplicate
+ .withColumn("order_date", to_date(col("order_date"), "yyyy-MM-dd"))
+ .withColumn("region", lower(trim(col("region")))) # standardize
+ .withColumn("product", lower(trim(col("product"))))
+ .withColumn(
+ "is_valid",
+ when(col("amount") > 0, True).otherwise(False) # validate
+ )
+ .filter(col("order_id").isNotNull()) # remove nulls
+)
+
+# Write to Silver as Delta table
+(
+ silver_df.write
+ .format("delta")
+ .mode("overwrite")
+ .option("overwriteSchema", "true")
+ .save("abfss://data@mylake.dfs.core.windows.net/silver/sales/")
+)
+
+print(f"Silver layer written: {silver_df.count()} records")
+```
+
+The deduplication step alone would have prevented our $180,000 revenue discrepancy. The raw Bronze data had duplicate order records from a retry bug in the API client. Silver catches them. Gold never sees them.
+
+One big win beyond fixing bugs: multiple teams can now pull from the *same* Silver datasets instead of each building their own version of the truth. That alone eliminates an enormous amount of duplicate work and conflicting numbers.
+
+### What Silver looks like in storage
+
+```
+adls-gen2/
+ └── silver/
+ └── sales/
+ ├── _delta_log/ ← Delta Lake transaction log
+ ├── part-00000.parquet
+ └── part-00001.parquet
+```
+
+Unlike Bronze (raw files), Silver is a proper **Delta table** with ACID guarantees, time travel, and schema enforcement.
+
+
+## 🥇 Gold: Built for Business, Not Engineers
+
+Gold is what your stakeholders actually see.
+
+This layer takes clean Silver data and shapes it for specific use cases, sales dashboards, executive reports, product metrics. It's aggregated, optimized, and structured for fast queries.
+
+You're not building for flexibility here. You're building for **clarity**.
+
+> **Think of it as the finished product on the shelf**, packaged, polished, and ready to use.
+
+### What Gold looks like in practice
+
+```python
+from pyspark.sql.functions import sum, count, avg, col
+
+# Read from Silver
+silver_df = spark.read.format("delta").load(
+ "abfss://data@mylake.dfs.core.windows.net/silver/sales/"
+)
+
+# Build Gold: monthly revenue by region
+gold_df = (
+ silver_df
+ .filter(col("is_valid") == True)
+ .groupBy("region", "order_date")
+ .agg(
+ count("order_id").alias("total_orders"),
+ sum("amount").alias("total_revenue"),
+ avg("amount").alias("avg_order_value")
+ )
+ .orderBy("order_date", "region")
+)
+
+# Write to Gold
+(
+ gold_df.write
+ .format("delta")
+ .mode("overwrite")
+ .save("abfss://data@mylake.dfs.core.windows.net/gold/revenue_by_region/")
+)
+```
+
+The Gold table is what Power BI connects to. Pre-aggregated, fast, shaped exactly for the business question it answers.
+
+### What Gold looks like in storage
+
+```
+adls-gen2/
+ └── gold/
+ ├── revenue_by_region/ ← one table per business use case
+ ├── customer_summary/
+ └── product_performance/
+```
+
+Notice: Gold is not one big table. Each Gold table answers one specific business question.
+
+
+## Why This Actually Matters
+
+Here's what Medallion Architecture would have changed about our Tuesday afternoon incident:
+
+| Problem we had | Without Medallion | With Medallion |
+|---|---|---|
+| Duplicate orders from API retry bug | Silently corrupted revenue reports | Caught and removed in Silver |
+| Couldn't find where numbers went wrong | Four hours of undocumented rabbit holes | Isolated to exactly one layer |
+| Re-ingesting data after the fix | Re-called the API (data had since changed) | Replayed from Bronze (data preserved) |
+| Finance and analytics had different numbers | Both teams built their own transforms | Both teams use the same Silver table |
+| Schema changed upstream, broke pipeline | Broke everything simultaneously | Bronze absorbed it, Silver flagged it |
+
+The pattern isn't just about organization, it's about **trust**. When your team knows exactly where data came from and how it was transformed at each step, confidence in analytics goes up. Decisions improve. Four-hour debugging sessions stop happening.
+
+
+## It's Not Always Perfect
+
+Let's be honest: Medallion Architecture does add complexity.
+
+More layers = more storage, more pipelines, more things to maintain. For a small team doing simple reporting, it might genuinely be overkill.
+
+**It's a great fit when:**
+- You have multiple data sources with varying quality
+- Multiple teams consume the same data
+- Data quality is non-negotiable
+- Pipelines need to be recoverable and replayable
+- You need to audit exactly where a number came from
+
+**It's probably overkill when:**
+- You have one small, clean dataset
+- It's a one-time analysis
+- You're just building a proof of concept
+
+
+## Beyond the Three Layers
+
+In practice, teams often extend the model:
+
+- **Landing / Staging layer** — temporary storage before Bronze, used when data needs to be decrypted, unzipped, or format-converted before it can be stored
+- **Feature layer** — prepared datasets for ML model training, maintained by data science teams on top of Silver
+- **Semantic layer** — business-friendly models sitting between Gold and end users for self-serve BI
+
+
+
+The three-tier model is a starting point, not a ceiling. The right number of layers is whatever your team actually needs.
+
+
+## The Full Folder Structure
+
+Here's what a complete Medallion Architecture implementation looks like in ADLS Gen2:
+
+```text
+adls-gen2/
+ └── data/
+ ├── bronze/
+ │ ├── sales/2024/01/raw_orders_20240115.parquet
+ │ └── customers/2024/01/raw_customers_20240115.json
+ │
+ ├── silver/
+ │ ├── sales/
+ │ │ ├── _delta_log/
+ │ │ └── part-00000.parquet
+ │ └── customers/
+ │ ├── _delta_log/
+ │ └── part-00000.parquet
+ │
+ └── gold/
+ ├── revenue_by_region/
+ ├── customer_summary/
+ └── product_performance/
+```
+
+This is the exact structure we adopted after the revenue incident. Bronze preserved everything. Silver caught the duplicates. Gold gave the business team numbers they could trust.
+
+
+## The Key Lessons
+
+**1. Raw data and report data should never live in the same layer.** The moment raw data flows directly into a dashboard, you've lost the ability to catch errors before they reach stakeholders.
+
+**2. Bronze is not a dumping ground, it's a source of truth.** Its value is that it's complete and immutable. The messiness is the point.
+
+**3. Most data engineering work happens in Silver.** Deduplication, validation, standardization this is where pipeline quality is actually built.
+
+**4. Gold tables are specific, not flexible.** One table per business use case. Pre-aggregated, fast, and shaped exactly for the question it answers.
+
+**5. When something breaks, you replay from Bronze.** You never re-ingest from source. Bronze is your checkpoint.
+
+
+## References & Further Reading
+
+- [Databricks - Medallion Architecture](https://www.databricks.com/glossary/medallion-architecture)
+- [Microsoft Learn - Medallion Lakehouse Architecture](https://learn.microsoft.com/en-us/azure/databricks/lakehouse/medallion)
+- [Delta Lake - What is Delta Lake?](https://docs.delta.io/)
+- [RecodeHive - Lakehouse vs Data Warehouse](https://www.recodehive.com/blog/lakehouse-vs-warehouse)
+- [RecodeHive - Microsoft Fabric: One Platform, One Lake](https://www.recodehive.com/blog/microsoft-fabric-explained)
+- [RecodeHive - Azure Storage & ADLS Gen2](https://www.recodehive.com/blog/azure-storage-options)
+- [OneUptime - Build a Data Lakehouse on GCP](https://oneuptime.com/blog/post/2026-02-17-how-to-build-a-data-lakehouse-architecture-on-gcp-using-cloud-storage-dataproc-and-bigquery/view)
+
+## About the Author
+
+I'm **Aditya Singh Rathore**, a Data Engineer passionate about building modern, scalable data platforms. I write about data engineering, Azure, and real-world pipeline design on [RecodeHive](https://www.recodehive.com/) — turning hard-won lessons into content anyone can learn from.
+
+🔗 [LinkedIn](https://www.linkedin.com/in/aditya-singh-rathore0017/) | [GitHub](https://github.com/Adez017)
+
+📩 Had a similar pipeline disaster? Drop it in the comments I'd love to hear how you solved it.
+
+
diff --git a/blog/single-sign-on/index.md b/blog/single-sign-on/index.md
index 406c297f..467b3884 100644
--- a/blog/single-sign-on/index.md
+++ b/blog/single-sign-on/index.md
@@ -1,7 +1,7 @@
---
-title: "How SSO Actually Works"
+title: "How SSO Works - Case Study"
authors: [Aditya-Singh-Rathore, sanjay-kv]
-sidebar_label: "How SSO Actually Works"
+sidebar_label: "How SSO Works - Case Study"
tags: [sso, single-sign-on, authentication, identity-provider, oauth, openid-connect, saml, security, web]
date: 2026-05-04
diff --git a/docusaurus.config.ts b/docusaurus.config.ts
index 74c1b3af..40d02595 100644
--- a/docusaurus.config.ts
+++ b/docusaurus.config.ts
@@ -151,8 +151,10 @@ const config: Config = {
value: `