[Unstructured to Structured ]Learn Relearn and Unlearn Technology

# Terms Used by Data engineer

>Tables: Only Structured

> Volume : All Types  [Structured + Unstructured]

> Catalog Binding :  Restricts User Permission

diffferent categories: 1. production 2. development 3. testing

> Scheme Provisioning :

  • Dont need to explicitly define link/sync azure active directory
  • when resource is leaving we dont need to explicitly remove id IF resource is leaving company is called scheme provsioning
> UC: 
  • Storage Credentials 
  • External locations
  • Celebal Tech Utility (UNITY LAUNCHER)  => UCX Similar to unity launcher (UR Utility)
> Sync and deep clone

> Migrating Notebook to Hive [Steps]:
  1. Two level  --> Three level namespace (UC only supports Three Level Namespace)
  2. If we use mount : we need to replace with external location
  3. If we use RDD replace with dataframe
> Goup vs Service Principle
  • Group: Handles by user
  • Service Principle: Handled by Machine
> CI/CD and Branching Strategies in Databricks
> Q: How would you deal with merge conflicts ?




2 September Monday  to 7 september Saturday

Index:

1.UC Technical Training

  • UC Essentials
  • Demo Training-1
  • Demo Training-2
  • Demo Training-3
  • Demo Training-4
  • Demo Training-5
2. Delta Live Table (Youtube Playlisyt: Naval yemul + S/w Engineer testing)

3. Nishtha Questions

4. Python Boot Camp Sessions (3,4,5,6)
  • Day1: topics => Pyhton Basics
  • Day2: topics=> Python Basics
  • Day3: Topics => python Important Data Structuresa: list,tuple,dict
  • Day4: Topics=> OOPS Concept 
    • OOPS Pillar :
      • Class
      • Object
      • Data Abstraction
      • Encapsulation
      • Inheritance
      • Polymorphism


5. Saturday (7 September) [ Deep Dive into Spark and databricks...]

Summary:

1. Unity Catalog Essentials: 

  1. Overview of UC :  6 Sessions
  2. Compute Resource and UC
  3. Data Access Control in UC
  4. UC patterns and Best practice
1] Introduction to UC:
  • UC is central hub for administering and securing your data
  • UC Enabels=> a. Access Control b. Auditing accross databricks platform
By end of course you will know
  • Describe UC Key concept and how it integrates with databricks platform
  • Manage groups, users and service principles
  • create and manage UC metastore
B4 attempting this course high level concept of databricks lake house platform  you must know this about databricks lake house platform
(What it is? | What it does ? | the service that move it up? )


Let's Understand Databricks Lake House  Platform=>

1] Overview

Question: What it is?

Ans: It's Data platform that combines the DataLake +  Datawarehouse and allows the organization to store, process and analyze data in various formats.

Question Whats it does ?

Ans: It provides  Scalable Environment for DE + DS + ML and allows seamless data collaboration + Integration tools + Faster Insights.

Question : What Services Make it up?

Ans: Platform Integrates various service, Including:

  • Delta Lake [For Structured + Unstructured Data management]
  • Databricks SQL [For Querying and Visualizing Data]
  • Databricks ML and Databricks Data-Science Engineering [For building and Deploying Machine Learning models...]

2] Key Components

Delta Lake:

  • Storage layer => [ACID Transactions]
  • Batch + Streaming
Databricks SQL
  • Service for running SQL Queries on your data lake
Structured Streaming
  • Scalable and Fault tolerant Stream Processing Engine built on Spark SQL Engine
Databricks Runtime:
  • Set of core components that ensures high performance for processing large scale data,optimized versions of Apache Spark
3] Architecture and Data Management
a. LakeHouse  Architecture:
  • Combines [Data Lake + Datawarehouse]
  • in single platform, eliminating need of seperate systems
b. Data management:
  • Understand how databricks handle data governance, security and data quality is crucial.

4] Common usecase 

DE=> Building and managing data-pipeline to transform raw data into actionable Insights

DS=> Developing, training and deploying ML Models

Data Analysts=> Using SQL and other Analytical Tools to Explore and visualize data

5] Integration with other tools

Cloud Integration=>

  • Integrates Major Cloud Platform like AWS, GCP, Azure
Collaboration Tools=>

  • Integration with tools like GIT,ML Flow, and other API's for better Collaboration and version control
OVERVIEW OF UNITY CATALOG
  1. Data Governance in UC Video
  2. Key Concept in UC
  3. UC architecture 
  4. Roles in UC
  5. UC identities
  6. Security Model in UC
1] Data Governance in UC 
Data Governance : Four Key Functional Areas:
  1. Data Access Control: 100 user in projects 
  2. Data Access Audit: audit logs in DLT
  3. Data Discovery: Search bar
  4. Data Lineage: keeps track of data flow from source to destination

# Challenges in Data Lake are:

  • No Fine-grained access(Control) 
  • No Common Metadate layer
  • Non Standarad Cloud- Specific governance model
  • Hard to AUDIT
  • No Common governance model for data assests types (like: for hr data,employee data, client data , different sector data)
Then ,
How UC tackle this limitation? 
  1. Unify Governance Across Clouds
  2. Unify Data and AI Assests
  3. Unitfy Existing Catalogs
1] Unify Governance Across Clouds:
  • Fined grained access for data lake across clouds based on open standard ANSI SQL
2] Unify Data and AI assets
  • Centrally Share, Audit, Secure and manage all data types with one simple Interface
3] Unify Existing Catalog
  • Work in concert with existing data,storage and catalogs : No Hard Migration Required

2] Key Concept in UC 
  • Unity Catalog has Three Level Namespace
  • Traditional two level : select * from schema.table
  • Unity Catalog Three Level namespace: select * from catalog.schema.table
Manage table=> table resides on storage location (i.e Cloud Storage Container)
External table=>  data resides on other (3rd party storage)

# Views:
  • view only 
  • can not modify data
# Function:
  • Cloud Storage
# Share and Receipt
  • Read-only logical collection
3] Unity Catalog Architecture

Before UC |||||||||||||||||||||||||||||||||||||||||||||||||||||     With UC

Before UC:
workspace 1
user/group
metastore
access control
compute
resources

With UC
Workspace 2
user/group
Metastore
Access Controls
Compute Resources

4] Roles in U
1.Cloud Administrator
  • Administers underlying cloud resources
  • Storage Accounts/buckets
  • IAM role/service principals/Managed Identities
2. Identity Administrator
3. Account Administrator
4. Metastore Admin
5. Data Owner
6. Workspace Administrator

> User
first name=
last name=
password=
admin role=

> Service principle
app id= UID
name = 
admin role = 

>Groups
analyst
developer

>Nesting groups
all user
developer

>Identity Federation

6] Security Model in UC
1) Query lifecycle
  • Principle -------Send Query ------->>   Compute
2) Check namespace
  • Access legacy metastore
Section :B] MANAGING PRINCIPLE IN UC [5 Videos]
  1. Managing Principle overview
  2. Adding and deleting user
  3. Adding and deleting service principals
  4. Adding and deleting groups
  5. Assigning Service principals, and groups to workspace
1] Managing Principle overview
> Identities 
user
fname=
lname=
passwd=
admin role=

>service principal
applnid = GUID
name = 
adminrole = ON/OFF

2] Adding and Deleting User Video
# Steps=>
  • Login Account console
  • User Management
  • Add User
  • Email
# Grant Previlages
  • Main
  • Permission
  • Grain
  • Name: db.analyst (Search Bar)
# Delete
  • User
  • Test user
  • Delete user
3] Adding and Delting Service Principle
Steps:
  • User Management
  • Service pricipal
  • add service principal
  • Name Terraform
4] Adding and Deleting Groups Video
  • Identities 
  • nesting Groups
Steps:
  • User Management
  • Groups
  • Add Group
  • Group Name: Analyst
5] Assigning user, Sevice principal and Groups to workspace 
Steps:
  • Account Console
  • Workspace
  • Student-MNF
  • Permission
  • Add Permission
  • Search=> (User, group/ Service Principle)
SECTION C=> MANAGING UNITY CATALOG METASTORE [3 VIDEOS]
1] Creating and Deleting Metastore in UC Video
2] Assigning a Metastore to Workspace in Unity Catalog
3] Assigning Metastore Administrators in UC

1] Creating and Deleting Metastore in UC Video
Prerequisite:
  • Account Admistrator Capabilities
  • Cloud Resource to Support the Metasstore
  • Completed the Managing Princicple in UC Demo
Steps:
  • Log in
  • Data
  • create metastore
  • Name => main us-east
  • Region=> N.virginia
  • S3.bucket path 
  • IAM Role ARN
  • Create 
  • Skip
2] Assigning a metastore to workspace in UC
Note: 
  • A Metastore must be located in the same region as a workspace being asign to...
  • Administrators can be assigned to  multiple workspace as long as all are in same region
  • An Workspace can only have 1 Metastore at any given time
Steps:
  • Data
  • Development
  • Workspace (TAB)
  • Assign to workspaces
3] Assigning Metastore Administrators in UC
Eg: Transfering Access to other if  any person has left the organization

Steps:
  • Data 
  • Development
  • Owner=> sauru@databricks.com 
  • Choose another user
Congrats! Completed UNITY CATALOG OVERVIEW
Topics Covered=> 
  1. Describe UC key Concept and how it integrates with databricks Platform
  2. Manage Groups, User And Service principals
  3. Create and Manage UC Metastore
September 3
Demo Training-1
Demo Training-2
Demo Training-3
Demo Training-4
Demo Training-5


Summary OF DEMOTRAINING  1 to 5


> DEMO TRAINING -1
TITLE: Unity Catalog Demo 
INDEX
Intro:
Agenda:
  1. Importance of UC
  2. Features of UC
  3. Metastore
  4. Two-level Namespace vs 3 level namespace
  5. Storage Credentials and External Location
  6. Databricks Workspace: Before and After UC
Explanation of Demo training 1=>
1. Importance of UC : unified Governance For Data 
Benefits of Data Governance:
  • Improves Data Values
  • Reduce Data Cost
  • Increase Revenue
  • Ensure Security
  • Promote Clarity
  • Simplifies Data System
Addn:
  • Discovery
  • Access Control (Only Authorize users)
  • Lineage
  • Monitoring
  • Auditing
  • Data Sharing
2. Features of UC
  • Metastore :  Top Level container | Data and Meta Data of all data in databricks workspace
  • Audit Log : actions and event on data like DLT Framework in databricks
  • Account Level Management
  • Storage Credentials
  • ACL Store
  • Access Control : roles and policies
  • Data Explorer :  interface for browsing and discovering data assets
  • Lineage Explorer :  Data Flow (Source to Destination)
3. Metastore
  • Top Level Container
  • Data and Meta data of all data in databricks workspace is stored in one centralize repository i.e Metastore
  • Note: 
    • In 1 Region=> only 1 metastore
    • Under 1 metastore => Multiple Databricks Workspaces
  • Metastore> catalogs> schema> view,table,volumes,model,functions
4. Two level name Space in UC vs three
Two Level
  • select * from schema.tablename
Non-UC [only 1 catalog => hive metastore (AUTO by Default)]
  • With use of unity catalog it provides feature to create multiple catalog apart from the HIVE Metastore (Eg: Catalog 1, Catalog 2, etc.... depending on business use case.)
Three Level
  • select * from catalog.schema.table
Question: What is use of having differnet catalogs?
Ans: 
  • By Creating Different catalogs we can seggregate the organization data depending on different departments
  • Eg: 
    • Finance Department
    • Data Engineering
    • Data Analyst
    • Production, Development, Staging Data
5. Storage Credentials  and external location
  • Once a storage credential is created access to it can be granted to principals (users and groups)
6. Databricks Workspace
  • Before and After UC

DEMO TRAINING - 2

Watch Videos:
Hands on to understand :
UC Features Like :
  1. Lineage
  2. What's Workspace Level Catalog binding 
  3. Cluster Level Catalog binding
  4. Notebook Level Catalog Binding 
  5. Dicussion of migration assests as part of this unity catalog migration like:
    1. Table Migration
    2. Cluster Migration
    3. Job Migration
Let's get Started :
1] First Feature is Lineage
  • This Feature is availbale only in unity catalog not in hive
  • Eg: lineage direction:  upstream and downstream
  • code : select * from <<upstream>> <<downstream>>
2] Catalog Binding  
  • The Catalog Binding is in three level 
    1. Workspace
    2. Cluster
    3. Notebook
  • Three Level Namespace we  can use two Level also here... this is catalog binding
  • Setting> advance > other>default> catalog(set) for 2 level 
  • Before it was three level but now we enabled the option of 2 level both three level and 2 level gives same output
1. Workspace level catalog binding
code: select * from demo_catalog.demo_schema04.employee04

2. Cluster Catalog Binding 
> compute>cluster> adv options> config
spark
spark.master.local
spark.databricks.profile single node
spark.databricks.sql.sql.initial.catalog
name cluster_binding

Demo
>catalog>catalog Explorer> cluster binding
>demo_schema2
tables=>
employeee
department,etc...

3. Notebook Binding
code: use catalog notebook binding

# Note: IMPORTANT
the flow of overwrite from Notebook to workspace
notebook binding> cluster binding>workspace binding

# Migration of Tables
Two tables:
1. HMS Managed table
2. HMS External table on cloud storage (eg: ADLS Gen 2)
3. Creating HMS External table on dbfs.

Done with Demo training 2



DEMO TRAINING - 3 

Watching Video:

>>> Migrations of table to UC
  1. hms_externl_cloud_table
  2. hms_external_dbfs_table
  3. hms_managed_table
Steps:
  1. we create table
  2. Each Hivemetastore table will be migrate to UC
1]  Migrating HMS table to UC managed table using deep clone
Deep Clone: 
  • Replicated data of metadata in Unity Catalog
Steps:
Syntax: create table  << uc_managed_schema.uc_manage_table>> (Destination) CLONE <<hms_managed_schema>> ("Source")
 

2] Migrating HMS External Table on Cloud Storage to UC External Table using "SYNC"

Syntax: 
SYNC table uc_managed_catalog.uc_managed.hms_external_cloud_table
FROM
hive_metastore.hms_managed_schema.hms_external_cloud_table

Note: When i migrate table using SYNC Command then destination and source table should have same name

3] Migrating HMS external table on DBFS root to UC external table
Syntax:
> create table UC_managed_catalog.uc_managed_schema.uc_externaltable
LOCATION: 's3://ct-demo-akash-databricks-external-location/tables/external-tables
(CTAS statement)
AS select * from hive_metastore.hms_managed_schema.hms external_dbfs_table

# Cluster Migration in UC
Q: How to Migrate Cluster in UC?
UC is enable for 10.4 above only.
Don't select access mode as No Isolation Shared


Databricks
  • Compute
  • All Purpose
  • More.... Clone
  • Change Name of Cluster to UC Cluster
  • Policy: Unrestricted
  • Mutlinode
  • Access Mode: Single User/Shared (UC is enable for 10.4 version LTS & above only)
  • Photon => ON/OFF
> Advance Options

  • I Am Role 
  • Enable Credential pass through
  • (Dont enable it... UC will not be avilable)
# Migration of JOBS
> Workflow
> Tasks
> task name: UC Job
> Type :  Notebook
> Source : Workspace
> Path : ...../UCDemo/job migration/Non UC Notebook

(Let's Upgrade Non UC Job ----> UC JOB)

> More
>Clone Job
> Job name : UC Job 
> Cluster : SIMO Cluster [UC Must be enabled]

Note:
Non UC same as UC Notebok only made some changes to enable unity catalog in it....

> Dependent Libraries:
For Dependent Libraries part in NON UC Cluster the Jar File is present in dbfs location But it's Not Supported in UC
So, i downloaded jar file from filestore

# Changes in Notebook
Non UC notebook=> 
Syntax: 
Select * from hive_metastire.demo_schema04.employee04  
change to...
select * from demo_catalog.demo_schema04.employee04

Q: how do we migrate hive metastore table to UC ?
Ans: there are many different way to migrate a table (manage and external table...)
we Use 
  1. Deep Clone 
  2. SYNC 
  3. CTAS
In this session we discussed about topics
  • Lineage
  • Binding
  • Data Assets Migration Like:
    • Table Migration 
    • Cluster Migration 
    • Jobs Migration
Inside job migration we also discussed how do we do notebook migration

DEMO TRAINING - 4

Title:  Catalog Isolation Strategy
  • Improves Security
Diagram:
____________________________
____________________________________
_____________________________








DEMO TRAINING - 5




>>> NISHTA JAIN 4 th september Meeting
Topics in mams Session: 1. Delta live table 2. UC 3. CDC,CDF 4. Do your study at home
  • Question 1] Role Based Access Control :  See on MS Account documentation
  • Question 2] Service Principle Create and its Access
  • How Does UC  differfrom hive meta store?
  • How would you map Hivemetastore permission to Unity Catalog ?
  • After Hive to UC table (How to validate data and schema are correct ?) ans: Validations ---> Table Count: Rows, and Columns
  • Question 3] How Does UC handle Mutli tendancy ?
  • Question 4] How can you acces audit data access in UC?
  • Question: manage table , external table at dbfs ||| I migrate hive to UC ans: create table as select [CTAS]
# CDF Enable and Disable
Command=> CDF Command
select * from table changes ..
optimize

USECASE by NISHTA =>


API ke throgh cluster kho start karo
  1. Workflows list
  2. Start or stop workflow using API
  3. Cluster List (Show...)
  4. Workspace NB Counts ---(Start| STOP |Terminate Cluster)
  5. Create Cluster Through API
  6. Notebook list
  7. API through stop job [Pause it]
>>> Rishabh Pandey [7 th september Saturday 2024]
Session Title: Deep dive into Spark and Databricks
# Spark Fundamentals

Topics like:
> Spark Architecture in technical terms vs layman's langauge
> Hadoop vs Spark
> Spark Components
> Serverless
> RDD
> Create RDD 
> DAG
> Narrow vs Wide




> Spark Components:
  • RDD
  • Dataframes
> Serverless [Remove more info on this]
  • You Can use anywhere
  • It Depends on Data
# Note: RDD Output is always in list..

# Transformation: filter, group by, where
# Actions: display, collect, show

# Create RDD from collection
> data = [ ('..........)]
 rdd = sc.parallize(data)
print(rdd.collect())

# Define schema
schema = StructType([StructField id,
StructField id name,
StructField id age)]

# Convert to rdd to datadframe
df = spark.createDataFrame(rdd,schema)

# Job
Suppose you write any query 
job-9 => df.display (Action)
job-10 => df.collect (Action)

Jitne Actions utne jobs This is how JOB Works....

"""
Jay Shree Ram Bolke in Databricks write...
from pyspark.sql.import function import *
from pyspark.sql Types import *

# DAG

# Task 
Filter=> Narrow
group by,cross join, etc...=? Data Shuffling is there Multiple Partition by default 200 Partitions are there 

if you dont want this 200n parttitons then you can do repartitions (And then you can say there i just want 10 - 15 partitions....)

Q. Narrow vs WIDE

WIDE=>
  • Data Shffling(More)
  • By Default : 200 Partitions , 200 tasks
  • If you say you want less partitions eg: 10 or 50 then its repartitions
  • Har ek parttitions se 1 task Banta Hye
 # Stages 
  • It depends on kitne hamare narrow/ wide transformation hye
# Terms to be known
  • Data Schema
  • Partitions
  • Optimization Techniques... etc 




"""
    
Python BootCamp By Rishabh Pandey 
Day 1: [Present]
  • Python Fundamentals
  • Why Pyhton
  • Pyhton vs Java
  • Python variable
  • Python Data types
  • String Functions:
    • upper
    • lower
    • capitalize
    • lstrip
    • rstrip
    • startswith 
    • endswith
  • Methods:
    • split
    • join
  • python list and its functions
  • python tuple and its functions


Day 2: [Present]
list 
tuple
set
dict
Day 3: [Abscent] 

OOPS Concepts

Day 4: [Abscent]
OOPS Concept










______________________________________________________________________

>>> Spotify End to End Pipeline Project PPT Preparation

Module 1:

  1. ETL Pipeline
  2. Architecture
  3. Spotify API
  4. Cloud Providers : AWS
  5. AWS Services 
    1.  Storage => Amazon S3
    2. Compute => AWS Lamba
    3. Logs/Triggers => Amazon Cloud Watch
    4. Data Crawler => Crawler
    5. Data Catalog => AWS Glue Data Catalog
    6. Analytics Query => Amazon Athena
Module 2: 
  1. Go to Spotify: Sign up
  2. Jupyter Notebook (Python Libraries)
  3. Client ID, Secret ID
  4. Write Python Code: To Extract Data from that spotify API (Perform All Transformation on that data....)

Module 3:

  1. AWS Account needed
  2. Selection Region: North Virginia
  3. Billing Dashboard : (Receving Billing Alerts)
Billing Dashboard: 
Cloud Watch: Logs/Trigger
Glue: Data Catalog
S3: Storage
Total Tax: 0.0 USD
4) Lambda..

Module 4: 
  • Moving Data 
(Note: Hear the video again make PPT Fastly on your own: as fast as possible)
[Plan it after wednesday as now your schedule is set for current project of UC Catalog and you have to coplete it before tuesday]









>>> SCD and CDF Implement practically

From Rajas DE














>>> SCD Types

 Slowly Changing Dimensions (SCD) can be implemented using Delta Lake, which provides capabilities for handling big data and supports ACID transactions,

# Types :

  1. SCD Type 1 (Overwrite)
  2. SCD Type 2 (History)
  3. SCD Type 3 (Add new attribute)
  4. SCD Type 4 (Historical Table)
  5. SCD Type 6 (Hybrid SCD)

_______________________________________________________________________

A Slowly changing Dimension (SCD)  is a dimesion  that stores and manages both current and historical data over time in a data warehouse (Source System). It is Considered and implemented as one of the most critical ETL Tasks in tracking the history of dimension records


# Types=> 

  1. SCD Type 1 (Over writing on existing data)
  2. SCD Type 2 (Maintaining No. of times history as creating another dimension record )
  3. SCD Type 3 (Mainiaining one time history in new column)
From Above Three Below Types are Came => 
Type 0 , Type 4, Type 5 , Type 6, Type 7

Real Life Eg:
Type 1 : 
  • If you dont want to maintain any history in your target variables
  • Eg: Customer Changes Address 
  • # if you dont want to maintain  any history

  
Type 2 :
  • Maintianing History as More no. of records 
  • Eg: Customer Phone Number
  • # If you need all history
Type 3 :
  • Maintain one time history in new column
  • # If you dont need any history


Type 3 :



Reference of video : Youtube Techlake Channel

~sauru_6527

30 th august 2024








________________________________________________________________________

>> Atharva Interview Questions

> Python

  • Split Code
  • Join Code
  • Count of Vowels Code
  • Exception Handling
  • List, tuple
> Spark
  • Architecture
  • Cache Persit
  • Coalesce Repartition
  • Catalyst Optimizer
  • AQE



>> Change Data Capture(CDC)  ~ Seattle Data Guy

A] Intro

CDC [Change Data Capture]

  • Capture the Changes [Insert Update Delete] that Appears in database and these changes are stored often in sort of logs






> Real Time Data Landscape:

  • Open Source :Eg:  Apache Spark
  • Hybrid :Eg:  Databricks
  • Managed Service : Eg: Delta Stream



B] Write Ahead Logs [WAL]

  • Common Approach used is WAL
Whats WAL Main Function ?


C] Triggered Based CDC 
Why Do Companies Use CDC ?
# Benifits :
  • Realtime Data (If you want to do realtime analytics...)
  • Historical Data Preservation

D]  Where i have used CDC ?

Example of CDC Solutions that i have used in PAST



>> Now I have to do the CDC by Rajas DE

27 th August Tuesday 2024 

Tasks to told today

  • CDC, CDF From Youtube intro
  • Rajas DE Playlist
  • UC Migration Links  sent By Pratima Jain
Topics to do:
  • CDC,CDF,Apache Spark, Rajas DE Playlist, UC Migration Links + Be prepare for  the Interview
Important to Do :
  • Do Leet code in parttime
  • Do Apache Spark Course
  • Do Rajas Data Engineering Playlist Make A plan to complete 133 videos
  • Do 30 Pyspark Questions
  • Taxi Sheet Questions to do
  • UC Migration (Main Project Where mam is working)
  • PyCharm: Daily Practice i.e(@ PG Complete the Darshil python Course Again)
  • Make PPT Of Topics Covered till date today and submit to Mayank



________________________________________________________________

Today's Date: 23 August, Friday 2024

  • Two Tasks given by Nishitha Mam are # Task 1 and Task 2 Refer:  Intern Group on Teams [yet to do...]
  • 30 + Pyspark Questions [5 Done...]  [yet to do...] : pyspark.xlsx
  • Check this redshift mapping and create Synapse mapping accordingly: Two Excel Files Sent By MAM [IMPORTANT] and Do ====>>>  Synapse to  REDSHIFT mapping 

____________________________________________________

>> Four Questions based on project 

Q1) Estimate of 1000 GB data

Azure Instance Standard E16S_V4:
    • Driver: 4 Cores, 32 GB RAM Memory
    • Executors : 3 Executor per Instance
    • Instance  : 2 
    • Cost : $ 2.17 Per Hour => 182.6 Per Hour
    It provides a good balance of compute power, memory, and local storage in environment like Databricks

    Q2)  Autoloader 
    • Understand Medaloin Archi
    • Generate Random function in python and then apply autoloader on it
    • DLT (Delta live table)
    Q3) TAXI Data
    • 5 Questions Done (2 Remained...)
    Q4) Call one NB from another
    • wigets concept in it...

    >>> Questions By Pratima 

    More on UC
    Q. how will you migrate Manage Table ?
    Q. How Will you migrate external table ?
    Q. CDC, CDF do code and L & D on this ?
    Q. Solve on Question of Delta Live Table ?
    Q. Apache vs Hadoop vs Hive


    >>> Cluster/ Compute

    1] All-Purpose Compute

    • Analyze data in NB
    • Create, Terminate and Restart
    • Cost: Expensive
    2] JOB Compute
    • Just Support Running a NB as Job
    • No restart
    • Cost : Low 
    3] Pools
    • Instance Pool, it is as pool of resources (Set Of VM) (Swimming Pool)
    • Eg: If you have job that requires a lot of processing power you can assign more instance from pool
    • If workload decrease's you can release the instance back

    >>>Photon
    • Optional (On/Off)
    • Improves Performance, use when multiple sql code, Optimal in cost

    >>> Databricks Architecture

    >> Control Plane:

    • Cluster Manger
    • Handled By Databricks

    >> Data Plane: 

    • Storage: VM, BLOB
    • Handled by Cloud Provider (Azure, GCP)

    >>> Unity launcher (Celebal Product)

    • Boost Your UC Migration By 70%


    >>> Tree in UC

    Levels :  

    •  Metastore 
    • Catalogs 
    • Schemas  
    • Tables, Volumes

    >>> Prerequisites in UC

    • Latest: 11.3 LTS
    • Current: 10.4  

    >>> 3 Workspace Environment in Project using UC

    • Development
    • Testing
    • Production 

    REST API Understanding

    1: Introduction To REST API

    Rest API  (Representational State Transfer)

    AP => Application programming interface


    If your application is DYNAMIC => Eg: ZOMATO

    Dynamic App (Zomato)    -----> (Request)         -------> Websever

                                               <----- (Response)       <-------


    we get response in HTML and JSON Format

    but here we get response in unstructured day 

    HTML Response => WRONG

    We need Structures => Right (Eg: 1 value)


    > JSON format (dict,key)

    >Xml format (hierarchical datastructure )


    >>> Isssue: We need a lot of methods to get required information from server to solve this Issue we use REST API 

    > Rest API defn: 

        Rest API creates an object and there after sends the values of the object in response to client request

     OBJECT => IMPORTANT POINT (Only 1 Object value we get )

    > Eg:

    city : Jaipur

    restaurant name : Moms tiffin

    food item : Dal bati


    this above data is send to our DYNAMIC APP (ZOMATO)


    2: REST API Connection With Databricks usings Fake API Json placeholder

    code: 

    >>>>> YOUTUBE SECTION

    Links to refer : 

    > YouTube Videos and playlist: 

    > Channel name : Rajas DE

    > Unity catalog in 60 Seconds : Youtube

    Youtube Video 1:  Connect REST API in DataBricks via JSON placeholder [yet to do]

    Summary: Youtube Video

    Important :

     Create API endpoint =>

    Eg: url = "https://jsonplaceholder.typicode.com/post/1"


    Youtube Video 2: How to get Access Token in DataBricks [steps..]

    Summary: 

    Steps : 

    1. user>user setting
    2. developer>access token
    3. manage>generate new token
    4. Copy and Use that token wherever you want
    (Note: Seen the Interface of this above when i actually meet my team in New Building celebal AK47 Team)


    Concept from Azure  3 : ADF  : Azure Data Factory

    .

    .

    .

    Questions By Pratima Mam

    Note : Solve it using REST API, refer this docs : Link

    Question 1: Fetch All the schemas present in hive metastore, so i want list 

    Question 2: Fetch all the tables present in each schema

    Question 3: I want Count of tables and views of each schema

    Sample OP => 

    Schema Name: employee

    View(Count) : 15 Views

    Table(Count) : 10 Tables

    Addn Question 4: Get the Number of Workflow and notebook in your Databricks Account

    .


    ____________________________________________________________________________

    >>> Doubted Questions: [Don't Be DoubtFul Ask Questions and get its answer bro]

    1. What is your name ?


    _____________________________________________________________________________

    >>>>> Full Roadmap of associate DE-DBC

    1. Lakehouse (24%)
    2. ETL With Spark SQL and Python (29%)
    3. Incremental Data processing (22%)
    4. Production Pipelines (16%)
    5. Data Governance (9%)
    Now,


    5. Data Governance

    >> INDEX

    1. Unity Catalog
    • Benefits of Unity catalog
    • Unity catalog features
    2. Entity Permission
    • Configuring access to production tables and databases
    • Creating different levels of permissions for users and groups
    • Data Governance
    Summary

    # Benefits of UC
    • Defn
    • With and Without UC
    • Key Features
    • UC Catalog Object Model
    • Metastore
    • Object Hierarchy in metastore
    • Working with database objects in UC
    • other Securable Objects
    • Granting and revoking access to database objects and other securable objects in uc
    • admin roles
    • managed vs external tables and volumes
    • Data isolation using managed storage
    • workspace catalog biniding
    • auditing data access
    • tracing data Lineage
    • Lakehouse Federation and UC
    • Delta Sharing, Databricks Market Place and UC
    • How i setup Unity Catalog for my organization
    • Migrating an existing workspace to UC
    • UC catalog requirement and restrictions
    • Region Support
    • Compute Requirements
    • File Format Support
    • Securable Object naming requirements
    • Limitations
    • Resource Quota
    # UC Features :


    2. Entity Permission

    # Configuring Access to production table and databricks

    • Manage privileges in UC
    • Who can manage Privilege
    • Workspace Catalog Privileges
    • Inheritance Model
    • Show Grant And Revoke privilege
    • Show grants on Objects in UC metastore
    • Show my grants on objects in UC metastore
    • Revoke permission on object in UC metastore
    • Show grants on metastore
    • Grant Permission on metastore
    • Revoke Permission on metastore
    # Creating different levels of permission for users and groups
    • Access control lists
    • Access control list overview
    • Manage Access control list with folders
    • AI/BI dashboard ACLS
    • Alerts ACLS
    • Compute ACLS
    • Legacy dashboard ACLS
    • Delta live tables ACLS
    • feature tables ACLS
    • File ACLS
    • Folder ACLS
    • Genie Space ACLS
    • Git Folder ACLS
    • Job ACLS
    • ML flow experiment ACLS
    • ML Flow model ACLS
    • Notebook ACLS
    • Pool ACLS
    • Query ACLS
    • Secret ACLS
    • Serving Endpoint ACLS
    • SQL warehouse ACLS
    # Data Governance
    • Data governance with UC
    • Centralize access control using UC
    • track Data lineage using UC
    • Discover data using Catalog explorer
    • Share Data using Dela sharing
    • Configure Audit Logging
    • Configure Identity
    • legacy data governance solutions







    Comments

    Popular posts from this blog

    [ STRUCTURED ] Unity Catalog Concept in Databricks