Project Goals

This playbook provides a comprehensive guide for migrating data, metadata, and governance controls to Databricks Unity Catalog (UC). The primary objectives are to establish a secure, auditable, and well-governed data platform while ensuring a smooth transition for existing workloads.

Create a step-by-step migration plan without breaking workloads.
Stand up secure access controls using the principle of least privilege.
Establish trustworthy data lineage and auditability for all UC objects.
Define a robust cataloging strategy aligned to domains and data quality tiers.
Document best practices for data governance, security, and ongoing UC operations.

Intended Audience

This guide is designed for technical stakeholders involved in the data platform lifecycle.

Cloud/Data Architects Platform/SRE Data Governance Security Teams Data Engineers Analytics Engineers Compliance/Risk

Deliverables Breakdown

The migration process yields several key artifacts across planning, implementation, and governance.

High-Level Migration Plan

This interactive timeline outlines the eight phases of migration. Click on any phase to see more details.

Unity Catalog Object Model

The Unity Catalog model establishes a clear hierarchy for organizing and securing data assets. The metastore is the top-level container, scoped per region, and all other objects exist within it. Ownership is a key concept, preferably assigned to groups to manage privileges effectively.

Metastore (1 per region)

Catalog (e.g., Finance, Sales)

Schema (e.g., Core, Analytics)

Table / View

Function

Volume

Storage Credential

External Location

Share / Recipient

Access Control Design

A robust access model is founded on the principle of least privilege, using groups to streamline permissions and ensure separation of duties. This approach simplifies audits and management.

Core Principles

Use groups, not individuals, for all grants.
Separate duties: platform-admins, data-stewards, engineers, analysts.
Service principals for CI/CD and production jobs; no personal tokens.
Least privilege; deny by default; inherit where possible.

Example Privilege Model

CREATE CATALOG fin COMMENT 'Finance domain';
GRANT USE CATALOG ON CATALOG fin TO `analysts_finance`, `engineers_finance`;
CREATE SCHEMA fin.core COMMENT 'Authoritative core';
GRANT USE SCHEMA ON SCHEMA fin.core TO `analysts_finance`;
GRANT CREATE, MODIFY ON SCHEMA fin.core TO `engineers_finance`;
CREATE TABLE fin.core.gl_entries (...);
GRANT SELECT ON TABLE fin.core.gl_entries TO `analysts_finance`;

Metadata & Cataloging Standards

Consistent metadata is crucial for data discovery and trust. A well-defined taxonomy, naming conventions, and liberal use of comments and tags create a self-documenting data platform.

Taxonomy

Use domain-oriented catalogs (e.g., `sales`, `hr`) and schemas organized by subdomain or quality tier (e.g., `bronze`, `silver`, `gold`).
Naming Conventions

Follow a predictable pattern like <domain>.<tier>.<entity> (e.g., sales.gold.orders).
Comments & Tags

Add descriptive comments to all objects. Use table properties for tags like 'sensitivity'='PII' and 'owner'='team_sales'.

Code & Implementation Assets

This section provides ready-to-use code snippets for provisioning resources, defining security policies, and ingesting data. Use the tabs below to navigate between Terraform for infrastructure, SQL for policies and objects, and PySpark for data pipelines. Each snippet includes a copy button for convenience.

Provisioning with Terraform

# Define Storage Credentials and External Locations
resource "databricks_storage_credential" "sc_fin" {
  name = "sc_fin"
  # ... cloud-specific IAM role or SPN reference
}
resource "databricks_external_location" "loc_fin_raw" {
  name                  = "fin_raw"
  url                   = "s3://company-fin/raw"
  storage_credential_name = databricks_storage_credential.sc_fin.name
}

# Define Catalogs, Schemas, and Grants
resource "databricks_catalog" "fin" {
  name    = "fin"
  comment = "Finance domain"
}
resource "databricks_schema" "fin_core" {
  catalog_name = databricks_catalog.fin.name
  name         = "core"
}
resource "databricks_grants" "fin_core" {
  schema = databricks_schema.fin_core.id
  grant {
    principal  = "engineers_finance"
    privileges = ["CREATE", "MODIFY", "USAGE"]
  }
  grant {
    principal  = "analysts_finance"
    privileges = ["USAGE"]
  }
}

Column & Row Level Security with SQL

-- Column Masking for PII
CREATE MASKING POLICY pii_mask AS (val STRING) RETURNS STRING ->
  CASE WHEN is_member('pii_readers') THEN val ELSE '****' END;
ALTER TABLE fin.core.customers ALTER COLUMN ssn SET MASKING POLICY pii_mask;

-- Row-level Security via a Dynamic View
CREATE VIEW fin.secure.v_gl_entries AS
SELECT * FROM fin.core.gl_entries
WHERE (is_member('fin_ap') AND dept = 'AP') OR (is_member('fin_ar') AND dept = 'AR');
GRANT SELECT ON VIEW fin.secure.v_gl_entries TO `analysts_finance`;

Data Quality Constraints with SQL

-- Set DQ expectations as table properties (for external tools)
ALTER TABLE sales.silver.orders SET TBLPROPERTIES ('dq_expectations'='not_null:order_id; range:amount>0');

-- Enforce NOT NULL constraint at the table level
ALTER TABLE sales.silver.orders ALTER COLUMN order_id SET NOT NULL;

-- Add a CHECK constraint to enforce business rules
ALTER TABLE sales.silver.orders ADD CONSTRAINT chk_amount CHECK (amount > 0);

Streaming Ingestion with PySpark & Volumes

from pyspark.sql.functions import *

# Read from a Volume using Auto Loader (cloudFiles)
df = (spark.readStream
    .format('cloudFiles')
    .option('cloudFiles.format', 'json')
    .load('/Volumes/landing/sales/')
)

# Write stream to a Delta table in Unity Catalog
(df.writeStream
    .format('delta')
    .option('checkpointLocation', '/checkpoints/sales_orders')
    .table('sales.bronze.orders_raw')
)

Data Lineage & Audit

Unity Catalog automatically captures lineage for all workloads that read from and write to UC objects. This is critical for impact analysis, root cause analysis, and regulatory compliance. The operating model should enforce practices that guarantee lineage is captured.

Ensure all production workloads use UC-managed tables, views, and volumes.
Standardize on Databricks Jobs and SQL Warehouses to maximize traceability.
Document critical data products with lineage graphs and perform validation after changes.
Configure and retain audit logs to monitor access patterns and investigate incidents.

Best Practices Summary

Adhering to these best practices for governance, security, and catalog usage will ensure the long-term success and maintainability of your Databricks platform.

Governance

Domain-oriented catalogs.
Everything as code (Terraform).
Use groups for all privileges.
Regular access reviews.

Security

Least privilege access.
Use Storage Credentials.
Service principals for automation.
Masking policies for PII.
Enable and monitor audit logs.

Using Unity Catalog

Prefer managed Delta tables.
Qualify object names fully.
Use Volumes for non-tabular data.
Leverage Delta features (constraints).
Adopt SQL Warehouses for BI.

Migration Playbook