[ STRUCTURED ] Unity Catalog Concept in Databricks
>>> Topics in UC
- Definition
- History
- Before UC After UC
- Features and Limitations
- Eg
> >>Additional Terms
- Catalog Binding
- Delta Sharing
- Lakehouse Fedaration
_______________________________________________________________________
>> Definition
- Unified Governance solution + Data Security
>> History
- Databricks Was lacking in Fundamental data governance centralization over databricks
>> B4 Vs After
Previous Approach:Without UC[Separate]
- Lack governance and Centralization
- Previously We have to do governance on each workspace individually
- Seperate User Management, Metastore
- We Used Data Cloning Activity which increased the data duplication and also cost
Current Approach: With UC[One]
- Unity Catalog
- Centralized governance model sit at top of workspaces
- There is single entity from there you can govern each and every entity in your workspace from one place
Main Challenge :
1. User Group:
- We dont had centralized place where we can manage our user group now we have
2. Metastore:
- Share Data Between Workspace
- NB1 to NB2
- Just Simply Set permissions
Pic : Define once, secure everywhere, Everything at one umbrella
>> Features
- Access Control
- Auditing
- Data Discovery
- Data Lineage
1: Access Control:
- 100 users in project only few has access to tables,views and functions
2: Auditing:
- Automatic logs (Who is creating and deleting table, access views)
3: Data Discovery
- provide search interface
4: Data Lineage
- Keep Track of dataflow from source to destination
>> limitations
- Bucketing is not supported in UC table, if: throws an exception
- Python UDFS are not supported
- Scala UDFS not supported i
- Shallow Clones are not supported
- Groups that were previously created in workspace cannot be used in Unity Catalog Statements
>> The Unity Catalog object model
(Top-Level-Container)=> Metastore
LEVEL 1=> Catalog
LEVEL 2=> Schema
LEVEL 3=> Tables, Views, Volume, Functions
>> Eg
Healthcare Data Management : Hospital
- Manage and secure their data
- keeping sensitive data protected
_______________________________________________________________________________
>>> Delta Sharing
- Secures Data Sharing with users outside the organization
>>> Lakehouse Federation
- Layman's lang: Connecting and Moving Data across to 3rd party and Here Connector Means ORACLE, Snowflake, MySQL, etc..
>>> Catalog Binding
- The three Environment are there and we give access to only particular workspaces (only sensitive info. Is available)
Three Workspace Environment in Project using UC
- Development
- Testing
- Production
>> How TO DO UC SETUP ?
Requirements before Setting up Unity Catalog are:
- Must Have this Global Role: Global Administrator Role
- Must Have Premium plan of ADF
- Create Permission: To Create Azure Data Lake Storage Gen 2
Go To Portal.azure.com and create these Three( Azure Databricks Workspace. Azure Storage Account, Azure Connector for Databricks)
> Resource Group
1. Azure Databricks Workspace:
- Select Premium Tier
2. Azure Storage Account:
- Enable Hierarchical Namespaces: ON
- Account Type: Block Blobs
3. Azure Connector For Databricks
- Copy and Keep the Resource id after that you need it while creating metastore because now yoour files will not be stored in dbfs mount point(mnt) but it will be at external location like aws, azure, gcp
Then,
SET I AM ROLE :
- Add Role: Global Administrator Role
- Access: Managed Identity
Then,
Log in Databricks Account: accounts.databricks.net
>> Create Metastore:
- Data Section
- Create Metastore
- Name: metastore_2409
- ADLS GEN 2 PATH
- Acccess Connector ID
- Delta Sharing: ON this
>> Code
create catalog <<catalog_name>>
use cataog <<catalog_name>>
create database <<databasename>>
create table <<tablename>> (.......)
insert into.....
select * from .........
now you made i.e THREE LEVEL NAMESPACE
In this way we setup Unity Catalog
Comments
Post a Comment