Foundations of a Bank-Grade Lakehouse
This interactive guide establishes the definitive standards for data modeling within the Databricks Lakehouse. It provides a prescriptive, auditable framework to ensure the bank's data architecture is performant, scalable, and compliant with stringent regulations like BCBS 239, SOX, and AML/KYC. The goal is to translate complex regulatory principles into tangible, enforceable technical controls for delivery teams.
The Medallion Data Flow
The Medallion Architecture logically organizes data into three layers—Bronze, Silver, and Gold—representing progressively higher levels of quality, structure, and value. This structured approach is fundamental to satisfying regulatory demands for data quality, traceability, and usability.
Bronze: Raw & Immutable
The permanent, auditable archive of all source data, captured exactly as received. This ensures full replayability and source fidelity.
Silver: Conformed & Governed
A validated, consistent view of core business entities. Data is cleansed, conformed, and mastered into a canonical model.
Gold: Curated & Optimized
Purpose-built, aggregated data marts for specific business functions, regulatory reporting, and machine learning.
Data Quality Progression
As data moves through the Medallion layers, its quality, reliability, and business value increase significantly. This chart illustrates the conceptual improvement in key data governance metrics at each stage.
🥉 Bronze Layer: The Immutable Landing Zone
The Bronze layer's philosophy is absolute fidelity to the source. It serves as a permanent, immutable archive, ensuring the bank can always trace data to its origin and reprocess pipelines from a known state. This is the foundation of auditability.
Key Principles
- Immutable & Append-Only: Data is never modified in place.
- Source Fidelity: Store data "as-is" with original payload.
- Full Replayability: Enables reprocessing from a trusted state.
- Automated Ingestion: Use Databricks Auto Loader for scalability.
Standard Table Schema
All Bronze tables must include these columns for lineage.
- payload (BINARY/STRING): The original, unparsed record.
- metadata_struct (STRUCT): Container for source system, path, and timestamps.
Implementation Checklist
Click to expand ▼- ✓Use Auto Loader with file notification mode.
- ✓Enforce data contracts with Great Expectations.
- ✓Quarantine invalid records from contract validation.
- ✓Enable Delta table properties for auto-compaction.
- ✓Partition tables by ingestion date (`meta_ingestion_date`).
- ✓Follow naming convention: `[env]_bronze_catalog.[source].[table]_raw`.
🥈 Silver Layer: The Conformed Enterprise View
The Silver layer transforms raw data into a clean, integrated, and business-centric view of core entities like customers, accounts, and transactions. This layer is the enterprise's single source of truth, providing a stable foundation for all downstream analytics.
Core Objectives
- Data Quality Enforcement: Use Delta Live Tables (DLT) Expectations.
- Canonical Modeling: Conform data into a standardized business model.
- Entity Resolution: Create a single view of each customer/party.
- Historical Preservation: Model key entities as SCD Type 2 tables.
SCD Type 2 Schema
Required for auditable history on dimensional tables.
- effective_start_date (TIMESTAMP): When the record version became active.
- effective_end_date (TIMESTAMP): When the record version was superseded.
- is_current (BOOLEAN): Flag to easily find the active record.
Implementation Checklist
Click to expand ▼- ✓Implement all pipelines using Delta Live Tables (DLT).
- ✓Define DLT Expectations for all critical data quality rules.
- ✓Use surrogate keys (`_sk`) as primary keys for all entities.
- ✓Implement SCD Type 2 logic for dimensional entities (`party`, `account`).
- ✓Use `DECIMAL(38, 18)` for all monetary values.
- ✓Follow naming convention: `[env]_silver_catalog.canonical.[entity]`.
🥇 Gold Layer: Curated & Optimized Datamarts
The Gold layer contains highly refined, purpose-built data models for specific consumption patterns. It serves business analysts, data scientists, and regulatory reporting engines with data that is aggregated, denormalized, and easy to query.
Consumption Patterns
- Star Schemas: For BI and regulatory reporting (e.g., BCBS 239 Risk Mart).
- One Big Table (OBT): Highly denormalized tables for ML feature engineering.
- Semantic Layer: Governed views defining official business metrics and KPIs.
Performance Optimization
Gold tables are read-heavy and must be aggressively optimized.
- Liquid Clustering: Preferred method for data layout optimization.
- Materialization: All core facts and dimensions are physical tables.
- Photon Engine: Enable for all SQL warehouses for vectorized query speed.
Implementation Checklist
Click to expand ▼- ✓Model follows an approved pattern (Star Schema or OBT).
- ✓Use Liquid Clustering on frequently filtered columns.
- ✓Ensure business-friendly column names and documented metrics.
- ✓Apply all required security controls (RLS, Column Masking).
- ✓Perform impact analysis using lineage before any changes.
- ✓Follow naming convention: `[env]_gold_catalog.[domain].[mart]`.
Unified Governance & Security
A robust governance framework is integral to the Lakehouse. Unity Catalog provides the tools to implement security and privacy policies in a centralized, scalable, and auditable manner, ensuring compliance by design.
Attribute-Based Access Control (ABAC)
The primary strategy for managing access. Policies are driven by centrally managed tags (e.g., `sensitivity`, `jurisdiction`) applied to data assets, which is more scalable than managing permissions for individual users.
Row-Level Security (RLS) & Dynamic Data Masking
Fine-grained controls are implemented directly on tables using SQL UDFs. RLS filters which rows a user can see (e.g., a manager sees only their team's data), while masking redacts sensitive column data (e.g., `***-**-****` for SSN) for unauthorized users.
Automated Lineage for Audit & Impact Analysis
Unity Catalog automatically captures column-level lineage, providing an irrefutable audit trail for regulators. This graph is also mandatory for performing impact analysis before any schema or pipeline logic changes are deployed.
Operational Readiness Guide
This section provides the tactical resources for delivery teams to build, deploy, and operate the Lakehouse according to the defined standards, covering naming conventions, optimization patterns, and handling common data issues.
Performance & Cost Optimization
- Use Job Clusters: All production pipelines must run on lower-cost, ephemeral Job clusters.
- Enable Autoscaling: Clusters must scale up and down automatically to match workload.
- Activate Photon Engine: Use the vectorized engine for significant performance gains.
- Run `OPTIMIZE` Periodically: Compact small files on frequently updated tables.
Handling Late-Arriving Data
When a fact record (e.g., a transaction) arrives before its corresponding dimension (e.g., a new customer), it must be handled gracefully. The fact is loaded with a placeholder key (e.g., -1 for "Unknown Customer") and logged to a reconciliation table. A periodic job then re-attempts the lookup and updates the fact table with the correct key once the dimension record arrives.