[Unstructured to Structured ]Learn Relearn and Unlearn Technology

# Terms Used by Data engineer

>Tables: Only Structured

> Volume : All Types [Structured + Unstructured]

> Catalog Binding : Restricts User Permission

diffferent categories: 1. production 2. development 3. testing

> Scheme Provisioning :

Dont need to explicitly define link/sync azure active directory
when resource is leaving we dont need to explicitly remove id IF resource is leaving company is called scheme provsioning

> UC:

Storage Credentials
External locations
Celebal Tech Utility (UNITY LAUNCHER) => UCX Similar to unity launcher (UR Utility)

> Sync and deep clone

> Migrating Notebook to Hive [Steps]:

Two level --> Three level namespace (UC only supports Three Level Namespace)
If we use mount : we need to replace with external location
If we use RDD replace with dataframe

> Goup vs Service Principle

Group: Handles by user
Service Principle: Handled by Machine

> CI/CD and Branching Strategies in Databricks

> Q: How would you deal with merge conflicts ?

2 September Monday to 7 september Saturday

Index:

1.UC Technical Training

UC Essentials
Demo Training-1
Demo Training-2
Demo Training-3
Demo Training-4
Demo Training-5

2. Delta Live Table (Youtube Playlisyt: Naval yemul + S/w Engineer testing)

3. Nishtha Questions

4. Python Boot Camp Sessions (3,4,5,6)

Day1: topics => Pyhton Basics
Day2: topics=> Python Basics
Day3: Topics => python Important Data Structuresa: list,tuple,dict
Day4: Topics=> OOPS Concept

OOPS Pillar :

Class
Object
Data Abstraction
Encapsulation
Inheritance
Polymorphism

5. Saturday (7 September) [ Deep Dive into Spark and databricks...]

Summary:

1. Unity Catalog Essentials:

Overview of UC : 6 Sessions
Compute Resource and UC
Data Access Control in UC
UC patterns and Best practice

1] Introduction to UC:

UC is central hub for administering and securing your data
UC Enabels=> a. Access Control b. Auditing accross databricks platform

By end of course you will know

Describe UC Key concept and how it integrates with databricks platform
Manage groups, users and service principles
create and manage UC metastore

B4 attempting this course high level concept of databricks lake house platform you must know this about databricks lake house platform

(What it is? | What it does ? | the service that move it up? )

Let's Understand Databricks Lake House Platform=>

1] Overview

Question: What it is?

Ans: It's Data platform that combines the DataLake + Datawarehouse and allows the organization to store, process and analyze data in various formats.

Question Whats it does ?

Ans: It provides Scalable Environment for DE + DS + ML and allows seamless data collaboration + Integration tools + Faster Insights.

Question : What Services Make it up?

Ans: Platform Integrates various service, Including:

Delta Lake [For Structured + Unstructured Data management]
Databricks SQL [For Querying and Visualizing Data]
Databricks ML and Databricks Data-Science Engineering [For building and Deploying Machine Learning models...]

2] Key Components

Delta Lake:

Storage layer => [ACID Transactions]
Batch + Streaming

Databricks SQL

Service for running SQL Queries on your data lake

Structured Streaming

Scalable and Fault tolerant Stream Processing Engine built on Spark SQL Engine

Databricks Runtime:

Set of core components that ensures high performance for processing large scale data,optimized versions of Apache Spark

3] Architecture and Data Management

a. LakeHouse Architecture:

Combines [Data Lake + Datawarehouse]
in single platform, eliminating need of seperate systems

b. Data management:

Understand how databricks handle data governance, security and data quality is crucial.

4] Common usecase

DE=> Building and managing data-pipeline to transform raw data into actionable Insights

DS=> Developing, training and deploying ML Models

Data Analysts=> Using SQL and other Analytical Tools to Explore and visualize data

5] Integration with other tools

Cloud Integration=>

Integrates Major Cloud Platform like AWS, GCP, Azure

Collaboration Tools=>

Integration with tools like GIT,ML Flow, and other API's for better Collaboration and version control

OVERVIEW OF UNITY CATALOG

Data Governance in UC Video
Key Concept in UC
UC architecture
Roles in UC
UC identities
Security Model in UC

1] Data Governance in UC

Data Governance : Four Key Functional Areas:

Data Access Control: 100 user in projects
Data Access Audit: audit logs in DLT
Data Discovery: Search bar
Data Lineage: keeps track of data flow from source to destination

# Challenges in Data Lake are:

No Fine-grained access(Control)
No Common Metadate layer
Non Standarad Cloud- Specific governance model
Hard to AUDIT
No Common governance model for data assests types (like: for hr data,employee data, client data , different sector data)

Then ,

How UC tackle this limitation?

Unify Governance Across Clouds
Unify Data and AI Assests
Unitfy Existing Catalogs

1] Unify Governance Across Clouds:

Fined grained access for data lake across clouds based on open standard ANSI SQL

2] Unify Data and AI assets

Centrally Share, Audit, Secure and manage all data types with one simple Interface

3] Unify Existing Catalog

Work in concert with existing data,storage and catalogs : No Hard Migration Required

2] Key Concept in UC

Unity Catalog has Three Level Namespace
Traditional two level : select * from schema.table
Unity Catalog Three Level namespace: select * from catalog.schema.table

Manage table=> table resides on storage location (i.e Cloud Storage Container)

External table=> data resides on other (3rd party storage)

# Views:

view only
can not modify data

# Function:

Cloud Storage

# Share and Receipt

Read-only logical collection

3] Unity Catalog Architecture

Before UC ||||||||||||||||||||||||||||||||||||||||||||||||||||| With UC

Before UC:

workspace 1

user/group

metastore

access control

compute

resources

With UC

Workspace 2

user/group

Metastore

Access Controls

Compute Resources

4] Roles in U

1.Cloud Administrator

Administers underlying cloud resources
Storage Accounts/buckets
IAM role/service principals/Managed Identities

2. Identity Administrator
3. Account Administrator
4. Metastore Admin
5. Data Owner
6. Workspace Administrator

> User

first name=

last name=

password=

admin role=

> Service principle

app id= UID

name =

admin role =

>Groups

analyst

developer

>Nesting groups

all user

developer

>Identity Federation

6] Security Model in UC

1) Query lifecycle

Principle -------Send Query ------->> Compute

2) Check namespace

Access legacy metastore

Section :B] MANAGING PRINCIPLE IN UC [5 Videos]

Managing Principle overview
Adding and deleting user
Adding and deleting service principals
Adding and deleting groups
Assigning Service principals, and groups to workspace

1] Managing Principle overview

> Identities

user

fname=

lname=

passwd=

admin role=

>service principal

applnid = GUID

name =

adminrole = ON/OFF

2] Adding and Deleting User Video

# Steps=>

Login Account console
User Management
Add User
Email

# Grant Previlages

Main
Permission
Grain
Name: db.analyst (Search Bar)

# Delete

User
Test user
Delete user

3] Adding and Delting Service Principle

Steps:

User Management
Service pricipal
add service principal
Name Terraform

4] Adding and Deleting Groups Video

Identities
nesting Groups

Steps:

User Management
Groups
Add Group
Group Name: Analyst

5] Assigning user, Sevice principal and Groups to workspace

Steps:

Account Console
Workspace
Student-MNF
Permission
Add Permission
Search=> (User, group/ Service Principle)

SECTION C=> MANAGING UNITY CATALOG METASTORE [3 VIDEOS]

1] Creating and Deleting Metastore in UC Video

2] Assigning a Metastore to Workspace in Unity Catalog

3] Assigning Metastore Administrators in UC

1] Creating and Deleting Metastore in UC Video

Prerequisite:

Account Admistrator Capabilities
Cloud Resource to Support the Metasstore
Completed the Managing Princicple in UC Demo

Steps:

Log in
Data
create metastore
Name => main us-east
Region=> N.virginia
S3.bucket path
IAM Role ARN
Create
Skip

2] Assigning a metastore to workspace in UC

Note:

A Metastore must be located in the same region as a workspace being asign to...
Administrators can be assigned to multiple workspace as long as all are in same region
An Workspace can only have 1 Metastore at any given time

Steps:

Data
Development
Workspace (TAB)
Assign to workspaces

3] Assigning Metastore Administrators in UC

Eg: Transfering Access to other if any person has left the organization

Steps:

Data
Development
Owner=> sauru@databricks.com
Choose another user

Congrats! Completed UNITY CATALOG OVERVIEW

Topics Covered=>

Describe UC key Concept and how it integrates with databricks Platform
Manage Groups, User And Service principals
Create and Manage UC Metastore

September 3

Demo Training-1

Demo Training-2

Demo Training-3

Demo Training-4

Demo Training-5

Summary OF DEMOTRAINING 1 to 5

> DEMO TRAINING -1

TITLE: Unity Catalog Demo

INDEX

Intro:

Agenda:

Importance of UC
Features of UC
Metastore
Two-level Namespace vs 3 level namespace
Storage Credentials and External Location
Databricks Workspace: Before and After UC

Explanation of Demo training 1=>

1. Importance of UC : unified Governance For Data

Benefits of Data Governance:

Improves Data Values
Reduce Data Cost
Increase Revenue
Ensure Security
Promote Clarity
Simplifies Data System

Addn:

Discovery
Access Control (Only Authorize users)
Lineage
Monitoring
Auditing
Data Sharing

2. Features of UC

Metastore : Top Level container | Data and Meta Data of all data in databricks workspace
Audit Log : actions and event on data like DLT Framework in databricks
Account Level Management
Storage Credentials
ACL Store
Access Control : roles and policies
Data Explorer : interface for browsing and discovering data assets
Lineage Explorer : Data Flow (Source to Destination)

3. Metastore

Top Level Container
Data and Meta data of all data in databricks workspace is stored in one centralize repository i.e Metastore
Note:

In 1 Region=> only 1 metastore
Under 1 metastore => Multiple Databricks Workspaces

Metastore> catalogs> schema> view,table,volumes,model,functions

4. Two level name Space in UC vs three

Two Level

select * from schema.tablename

Non-UC [only 1 catalog => hive metastore (AUTO by Default)]

With use of unity catalog it provides feature to create multiple catalog apart from the HIVE Metastore (Eg: Catalog 1, Catalog 2, etc.... depending on business use case.)

Three Level

select * from catalog.schema.table

Question: What is use of having differnet catalogs?

Ans:

By Creating Different catalogs we can seggregate the organization data depending on different departments
Eg:

Finance Department
Data Engineering
Data Analyst
Production, Development, Staging Data

5. Storage Credentials and external location

Once a storage credential is created access to it can be granted to principals (users and groups)

6. Databricks Workspace

Before and After UC

DEMO TRAINING - 2

Watch Videos:

Hands on to understand :

UC Features Like :

Lineage
What's Workspace Level Catalog binding
Cluster Level Catalog binding
Notebook Level Catalog Binding
Dicussion of migration assests as part of this unity catalog migration like:

Table Migration
Cluster Migration
Job Migration

Let's get Started :

1] First Feature is Lineage

This Feature is availbale only in unity catalog not in hive
Eg: lineage direction: upstream and downstream
code : select * from <<upstream>> <<downstream>>

2] Catalog Binding

The Catalog Binding is in three level

Workspace
Cluster
Notebook

Three Level Namespace we can use two Level also here... this is catalog binding
Setting> advance > other>default> catalog(set) for 2 level
Before it was three level but now we enabled the option of 2 level both three level and 2 level gives same output

1. Workspace level catalog binding

code: select * from demo_catalog.demo_schema04.employee04

2. Cluster Catalog Binding

> compute>cluster> adv options> config

spark

spark.master.local

spark.databricks.profile single node

spark.databricks.sql.sql.initial.catalog

name cluster_binding

Demo

>catalog>catalog Explorer> cluster binding

>demo_schema2

tables=>

employeee

department,etc...

3. Notebook Binding

code: use catalog notebook binding

# Note: IMPORTANT

the flow of overwrite from Notebook to workspace

notebook binding> cluster binding>workspace binding

# Migration of Tables

Two tables:

1. HMS Managed table

2. HMS External table on cloud storage (eg: ADLS Gen 2)

3. Creating HMS External table on dbfs.

Done with Demo training 2

DEMO TRAINING - 3

Watching Video:

>>> Migrations of table to UC

hms_externl_cloud_table
hms_external_dbfs_table
hms_managed_table

Steps:

we create table
Each Hivemetastore table will be migrate to UC

1] Migrating HMS table to UC managed table using deep clone

Deep Clone:

Replicated data of metadata in Unity Catalog

Steps:

Syntax: create table << uc_managed_schema.uc_manage_table>> (Destination) CLONE <<hms_managed_schema>> ("Source")

2] Migrating HMS External Table on Cloud Storage to UC External Table using "SYNC"

Syntax:

SYNC table uc_managed_catalog.uc_managed.hms_external_cloud_table

FROM

hive_metastore.hms_managed_schema.hms_external_cloud_table

Note: When i migrate table using SYNC Command then destination and source table should have same name

3] Migrating HMS external table on DBFS root to UC external table

Syntax:

> create table UC_managed_catalog.uc_managed_schema.uc_externaltable

LOCATION: 's3://ct-demo-akash-databricks-external-location/tables/external-tables

(CTAS statement)

AS select * from hive_metastore.hms_managed_schema.hms external_dbfs_table

# Cluster Migration in UC

Q: How to Migrate Cluster in UC?

UC is enable for 10.4 above only.

Don't select access mode as No Isolation Shared

Databricks

Compute
All Purpose
More.... Clone
Change Name of Cluster to UC Cluster
Policy: Unrestricted
Mutlinode
Access Mode: Single User/Shared (UC is enable for 10.4 version LTS & above only)
Photon => ON/OFF

> Advance Options

I Am Role
Enable Credential pass through
(Dont enable it... UC will not be avilable)

# Migration of JOBS

> Workflow

> Tasks

> task name: UC Job

> Type : Notebook

> Source : Workspace

> Path : ...../UCDemo/job migration/Non UC Notebook

(Let's Upgrade Non UC Job ----> UC JOB)

> More

>Clone Job

> Job name : UC Job

> Cluster : SIMO Cluster [UC Must be enabled]

Note:

Non UC same as UC Notebok only made some changes to enable unity catalog in it....

> Dependent Libraries:

For Dependent Libraries part in NON UC Cluster the Jar File is present in dbfs location But it's Not Supported in UC

So, i downloaded jar file from filestore

# Changes in Notebook

Non UC notebook=>

Syntax:

Select * from hive_metastire.demo_schema04.employee04

change to...

select * from demo_catalog.demo_schema04.employee04

Q: how do we migrate hive metastore table to UC ?

Ans: there are many different way to migrate a table (manage and external table...)

we Use

Deep Clone
SYNC
CTAS

In this session we discussed about topics

Lineage
Binding
Data Assets Migration Like:

Table Migration
Cluster Migration
Jobs Migration

Inside job migration we also discussed how do we do notebook migration

DEMO TRAINING - 4

Title: Catalog Isolation Strategy

Improves Security

Diagram:

____________________________

____________________________________

_____________________________

DEMO TRAINING - 5

>>> NISHTA JAIN 4 th september Meeting

Topics in mams Session: 1. Delta live table 2. UC 3. CDC,CDF 4. Do your study at home

Question 1] Role Based Access Control : See on MS Account documentation
Question 2] Service Principle Create and its Access
How Does UC differfrom hive meta store?
How would you map Hivemetastore permission to Unity Catalog ?
After Hive to UC table (How to validate data and schema are correct ?) ans: Validations ---> Table Count: Rows, and Columns
Question 3] How Does UC handle Mutli tendancy ?
Question 4] How can you acces audit data access in UC?
Question: manage table , external table at dbfs ||| I migrate hive to UC ans: create table as select [CTAS]

# CDF Enable and Disable

Command=> CDF Command

select * from table changes ..

optimize

USECASE by NISHTA =>

API ke throgh cluster kho start karo

Workflows list
Start or stop workflow using API
Cluster List (Show...)
Workspace NB Counts ---(Start| STOP |Terminate Cluster)
Create Cluster Through API
Notebook list
API through stop job [Pause it]

>>> Rishabh Pandey [7 th september Saturday 2024]

Session Title: Deep dive into Spark and Databricks

# Spark Fundamentals

Topics like:

> Spark Architecture in technical terms vs layman's langauge

> Hadoop vs Spark

> Spark Components

> Serverless

> RDD

> Create RDD

> DAG

> Narrow vs Wide

> Spark Components:

RDD
Dataframes

> Serverless [Remove more info on this]

You Can use anywhere
It Depends on Data

# Note: RDD Output is always in list..

# Transformation: filter, group by, where

# Actions: display, collect, show

# Create RDD from collection

> data = [ ('..........)]

rdd = sc.parallize(data)

print(rdd.collect())

# Define schema

schema = StructType([StructField id,

StructField id name,

StructField id age)]

# Convert to rdd to datadframe

df = spark.createDataFrame(rdd,schema)

# Job

Suppose you write any query

job-9 => df.display (Action)

job-10 => df.collect (Action)

Jitne Actions utne jobs This is how JOB Works....

"""

Jay Shree Ram Bolke in Databricks write...

from pyspark.sql.import function import *

from pyspark.sql Types import *

# DAG

# Task

Filter=> Narrow

group by,cross join, etc...=? Data Shuffling is there Multiple Partition by default 200 Partitions are there

if you dont want this 200n parttitons then you can do repartitions (And then you can say there i just want 10 - 15 partitions....)

Q. Narrow vs WIDE

WIDE=>

Data Shffling(More)
By Default : 200 Partitions , 200 tasks
If you say you want less partitions eg: 10 or 50 then its repartitions
Har ek parttitions se 1 task Banta Hye

# Stages

It depends on kitne hamare narrow/ wide transformation hye

# Terms to be known

Data Schema
Partitions
Optimization Techniques... etc

"""

Python BootCamp By Rishabh Pandey

Day 1: [Present]

Python Fundamentals
Why Pyhton
Pyhton vs Java
Python variable
Python Data types
String Functions:

upper
lower
capitalize
lstrip
rstrip
startswith
endswith

Methods:

split
join

python list and its functions
python tuple and its functions

Day 2: [Present]

list

tuple

set

dict

Day 3: [Abscent]

OOPS Concepts

Day 4: [Abscent]

OOPS Concept

______________________________________________________________________

>>> Spotify End to End Pipeline Project PPT Preparation

Module 1:

ETL Pipeline
Architecture
Spotify API
Cloud Providers : AWS
AWS Services

Storage => Amazon S3
Compute => AWS Lamba
Logs/Triggers => Amazon Cloud Watch
Data Crawler => Crawler
Data Catalog => AWS Glue Data Catalog
Analytics Query => Amazon Athena

Module 2:

Go to Spotify: Sign up
Jupyter Notebook (Python Libraries)
Client ID, Secret ID
Write Python Code: To Extract Data from that spotify API (Perform All Transformation on that data....)

Module 3:

AWS Account needed
Selection Region: North Virginia
Billing Dashboard : (Receving Billing Alerts)

Billing Dashboard:

Cloud Watch: Logs/Trigger

Glue: Data Catalog

S3: Storage

Total Tax: 0.0 USD

4) Lambda..

Module 4:

Moving Data

(Note: Hear the video again make PPT Fastly on your own: as fast as possible)

[Plan it after wednesday as now your schedule is set for current project of UC Catalog and you have to coplete it before tuesday]

>>> SCD and CDF Implement practically

From Rajas DE

>>> SCD Types

Slowly Changing Dimensions (SCD) can be implemented using Delta Lake, which provides capabilities for handling big data and supports ACID transactions,

# Types :

SCD Type 1 (Overwrite)
SCD Type 2 (History)
SCD Type 3 (Add new attribute)
SCD Type 4 (Historical Table)
SCD Type 6 (Hybrid SCD)

_______________________________________________________________________

A Slowly changing Dimension (SCD) is a dimesion that stores and manages both current and historical data over time in a data warehouse (Source System). It is Considered and implemented as one of the most critical ETL Tasks in tracking the history of dimension records

# Types=>

SCD Type 1 (Over writing on existing data)
SCD Type 2 (Maintaining No. of times history as creating another dimension record )
SCD Type 3 (Mainiaining one time history in new column)

From Above Three Below Types are Came =>

Type 0 , Type 4, Type 5 , Type 6, Type 7

Real Life Eg:

Type 1 :

If you dont want to maintain any history in your target variables
Eg: Customer Changes Address
# if you dont want to maintain any history

Type 2 :

Maintianing History as More no. of records
Eg: Customer Phone Number
# If you need all history

Type 3 :

Maintain one time history in new column
# If you dont need any history

Type 3 :

Reference of video : Youtube Techlake Channel

~sauru_6527

30 th august 2024

________________________________________________________________________

>> Atharva Interview Questions

> Python

Split Code
Join Code
Count of Vowels Code
Exception Handling
List, tuple

> Spark

Architecture
Cache Persit
Coalesce Repartition
Catalyst Optimizer
AQE

>> Change Data Capture(CDC) ~ Seattle Data Guy

A] Intro

CDC [Change Data Capture]

Capture the Changes [Insert Update Delete] that Appears in database and these changes are stored often in sort of logs

> Real Time Data Landscape:

Open Source :Eg: Apache Spark
Hybrid :Eg: Databricks
Managed Service : Eg: Delta Stream

B] Write Ahead Logs [WAL]

Common Approach used is WAL

Whats WAL Main Function ?

C] Triggered Based CDC

Why Do Companies Use CDC ?

# Benifits :

Realtime Data (If you want to do realtime analytics...)
Historical Data Preservation

D] Where i have used CDC ?

Example of CDC Solutions that i have used in PAST

>> Now I have to do the CDC by Rajas DE

27 th August Tuesday 2024

Tasks to told today

CDC, CDF From Youtube intro
Rajas DE Playlist
UC Migration Links sent By Pratima Jain

Topics to do:

CDC,CDF,Apache Spark, Rajas DE Playlist, UC Migration Links + Be prepare for the Interview

Important to Do :

Do Leet code in parttime
Do Apache Spark Course
Do Rajas Data Engineering Playlist Make A plan to complete 133 videos
Do 30 Pyspark Questions
Taxi Sheet Questions to do
UC Migration (Main Project Where mam is working)
PyCharm: Daily Practice i.e(@ PG Complete the Darshil python Course Again)
Make PPT Of Topics Covered till date today and submit to Mayank

________________________________________________________________

Today's Date: 23 August, Friday 2024

Two Tasks given by Nishitha Mam are # Task 1 and Task 2 Refer: Intern Group on Teams [yet to do...]
30 + Pyspark Questions [5 Done...] [yet to do...] : pyspark.xlsx
Check this redshift mapping and create Synapse mapping accordingly: Two Excel Files Sent By MAM [IMPORTANT] and Do ====>>> Synapse to REDSHIFT mapping

____________________________________________________

>> Four Questions based on project

Q1) Estimate of 1000 GB data

Azure Instance Standard E16S_V4:

Driver: 4 Cores, 32 GB RAM Memory
Executors : 3 Executor per Instance
Instance : 2
Cost : $ 2.17 Per Hour => 182.6 Per Hour

It provides a good balance of compute power, memory, and local storage in environment like Databricks

Q2) Autoloader

Understand Medaloin Archi
Generate Random function in python and then apply autoloader on it
DLT (Delta live table)

Q3) TAXI Data

5 Questions Done (2 Remained...)

Q4) Call one NB from another

wigets concept in it...

>>> Questions By Pratima

Search This Blog

Code With Sauru💻 [ALL]

[Unstructured to Structured ]Learn Relearn and Unlearn Technology

Comments

Post a Comment

Popular posts from this blog

QUICK

[STRUCTURED] Data with Darshil [Apache Spark With Databricks For Data Engineering ]

[ STRUCTURED ] Unity Catalog Concept in Databricks