Move Data & Governance to Databricks Unity Catalog
Project Goals
This playbook provides a comprehensive guide for migrating data, metadata, and governance controls to Databricks Unity Catalog (UC). The primary objectives are to establish a secure, auditable, and well-governed data platform while ensuring a smooth transition for existing workloads.
Create a step-by-step migration plan without breaking workloads.
Stand up secure access controls using the principle of least privilege.
Establish trustworthy data lineage and auditability for all UC objects.
Define a robust cataloging strategy aligned to domains and data quality tiers.
Document best practices for data governance, security, and ongoing UC operations.
Intended Audience
This guide is designed for technical stakeholders involved in the data platform lifecycle.
The migration process yields several key artifacts across planning, implementation, and governance.
High-Level Migration Plan
This interactive timeline outlines the eight phases of migration. Click on any phase to see more details.
Unity Catalog Object Model
The Unity Catalog model establishes a clear hierarchy for organizing and securing data assets. The metastore is the top-level container, scoped per region, and all other objects exist within it. Ownership is a key concept, preferably assigned to groups to manage privileges effectively.
Metastore (1 per region)
Catalog (e.g., Finance, Sales)
Schema (e.g., Core, Analytics)
Table / View
Function
Volume
Storage Credential
External Location
Share / Recipient
Access Control Design
A robust access model is founded on the principle of least privilege, using groups to streamline permissions and ensure separation of duties. This approach simplifies audits and management.
Core Principles
Use groups, not individuals, for all grants.
Separate duties: platform-admins, data-stewards, engineers, analysts.
Service principals for CI/CD and production jobs; no personal tokens.
Least privilege; deny by default; inherit where possible.
Example Privilege Model
CREATE CATALOG fin COMMENT 'Finance domain';GRANT USE CATALOG ON CATALOG fin TO `analysts_finance`, `engineers_finance`;
CREATE SCHEMA fin.core COMMENT 'Authoritative core';GRANT USE SCHEMA ON SCHEMA fin.core TO `analysts_finance`;
GRANT CREATE, MODIFY ON SCHEMA fin.core TO `engineers_finance`;
CREATE TABLE fin.core.gl_entries (...);
GRANT SELECT ON TABLE fin.core.gl_entries TO `analysts_finance`;
Metadata & Cataloging Standards
Consistent metadata is crucial for data discovery and trust. A well-defined taxonomy, naming conventions, and liberal use of comments and tags create a self-documenting data platform.
Taxonomy
Use domain-oriented catalogs (e.g., `sales`, `hr`) and schemas organized by subdomain or quality tier (e.g., `bronze`, `silver`, `gold`).
Naming Conventions
Follow a predictable pattern like <domain>.<tier>.<entity> (e.g., sales.gold.orders).
Comments & Tags
Add descriptive comments to all objects. Use table properties for tags like 'sensitivity'='PII' and 'owner'='team_sales'.
Code & Implementation Assets
This section provides ready-to-use code snippets for provisioning resources, defining security policies, and ingesting data. Use the tabs below to navigate between Terraform for infrastructure, SQL for policies and objects, and PySpark for data pipelines. Each snippet includes a copy button for convenience.
Provisioning with Terraform
# Define Storage Credentials and External Locationsresource"databricks_storage_credential""sc_fin" {
name = "sc_fin"# ... cloud-specific IAM role or SPN reference
}
resource"databricks_external_location""loc_fin_raw" {
name = "fin_raw"
url = "s3://company-fin/raw"
storage_credential_name = databricks_storage_credential.sc_fin.name
}
# Define Catalogs, Schemas, and Grantsresource"databricks_catalog""fin" {
name = "fin"
comment = "Finance domain"
}
resource"databricks_schema""fin_core" {
catalog_name = databricks_catalog.fin.name
name = "core"
}
resource"databricks_grants""fin_core" {
schema = databricks_schema.fin_core.id
grant {
principal = "engineers_finance"
privileges = ["CREATE", "MODIFY", "USAGE"]
}
grant {
principal = "analysts_finance"
privileges = ["USAGE"]
}
}
Column & Row Level Security with SQL
-- Column Masking for PIICREATE MASKING POLICY pii_mask AS (val STRING) RETURNS STRING ->CASE WHEN is_member('pii_readers') THEN val ELSE'****'END;
ALTER TABLE fin.core.customers ALTER COLUMN ssn SET MASKING POLICY pii_mask;
-- Row-level Security via a Dynamic ViewCREATE VIEW fin.secure.v_gl_entries ASSELECT * FROM fin.core.gl_entries
WHERE (is_member('fin_ap') AND dept = 'AP') OR (is_member('fin_ar') AND dept = 'AR');
GRANT SELECT ON VIEW fin.secure.v_gl_entries TO `analysts_finance`;
Data Quality Constraints with SQL
-- Set DQ expectations as table properties (for external tools)ALTER TABLE sales.silver.orders SET TBLPROPERTIES ('dq_expectations'='not_null:order_id; range:amount>0');
-- Enforce NOT NULL constraint at the table levelALTER TABLE sales.silver.orders ALTER COLUMN order_id SET NOT NULL;
-- Add a CHECK constraint to enforce business rulesALTER TABLE sales.silver.orders ADD CONSTRAINT chk_amount CHECK (amount > 0);
Streaming Ingestion with PySpark & Volumes
from pyspark.sql.functions import *
# Read from a Volume using Auto Loader (cloudFiles)
df = (spark.readStream
.format('cloudFiles')
.option('cloudFiles.format', 'json')
.load('/Volumes/landing/sales/')
)
# Write stream to a Delta table in Unity Catalog
(df.writeStream
.format('delta')
.option('checkpointLocation', '/checkpoints/sales_orders')
.table('sales.bronze.orders_raw')
)
Data Lineage & Audit
Unity Catalog automatically captures lineage for all workloads that read from and write to UC objects. This is critical for impact analysis, root cause analysis, and regulatory compliance. The operating model should enforce practices that guarantee lineage is captured.
Ensure all production workloads use UC-managed tables, views, and volumes.
Standardize on Databricks Jobs and SQL Warehouses to maximize traceability.
Document critical data products with lineage graphs and perform validation after changes.
Configure and retain audit logs to monitor access patterns and investigate incidents.
Pre-Migration Checklist
Confirm all readiness criteria are met before starting the migration.
Post-Migration Checklist
Verify the success of the cutover and ensure operational stability.
Best Practices Summary
Adhering to these best practices for governance, security, and catalog usage will ensure the long-term success and maintainability of your Databricks platform.