SAURABH NOTES BLOG

POST 1] Unity Catalog Links and Quick Notes By Nishtha Jain and Pratima Jain

To Do Now =>

1: Make All Links Proper Here [ ]

2: Start Learning and Making Notes [ ]

>> 6 Augut Call by => Nishita Jain @ Morning

Azure data bricks, UC, Data governance ,Why we use UC ? , Features, Advantages, Limitation, Community edition, Complete UC, Setup db UC, Scheme, Commands?, UC then Migration, Data Plane and Control Plane ,Project : How to UC migrate, UC 3 Level : Hive and then Schema and table, Hive metastore, Catlog Data governace security, Make Documents, Share on group, 2 weeks UC complete

Links:>

Videos [24]

>> How to Setup Databricks Unity Catalog for Azure

>> How To Setup Databricks Unity Catalog for AWS

>> How To Setup SCIM in the Account Console

>> How to upgrade your Hive metastore tables to Unity Catalog Using Sync

>> How to setup Databricks Unity Catalog with Terraform

>> How to Getting Started with Databricks Connect V2

>> How Getting started with Databricks Terraform Modules

>> How to Getting started with Delta Sharing in Databricks

>> How to Getting started with Delta Market Place

>> How to Getting started with Catalog binding in Unity Catalog

>> How to use The Undrop Feature (undrop table in databricks)

>> How to run a job as a service principo

POST 2] ALL CELEBAL LINKS

Celebal Tracker:

Tracker :> Link

Sample :> Link

Colab :> Link

Curiculum :> Link

Youtube Links

https://www.w3schools.com/sql/sql_ref_sqlserver.asp

Naval Sir Channel

https://www.youtube.com/@thedatamaster

https://youtu.be/tZw4Nrv5X9g

Company Initernals

Celebal Company Internal : Drive

Naval Yemul Sirs Link:

Databricks: https://learn.microsoft.com/en-gb/azure/databricks/

Delta Lake: https://delta.io/

Pyspark:>>> https://sparkbyexamples.com/

Azure Linked Service:>> https://youtube.com/playlist?list=PL7S7dD8r4QdUcHTTF-EYS3aYzuZiisuax

-------------------------------------------------------------------------------------------------

30 June :>>>>>>>>

DP 203: Link

Azure Data Factory: Youtube

Architecture:> Architecture

Image : img

----------------------------------------------------------------------------------------------------------

27 June :>>>

Naval Sir NoteBook Link: Navalsir27-Link

Datasets Naval Sirs Drive: > Drive

----------------------------------------------------------------------------------------------------------------

23rd June:>

Youtube

1) Big DATA PySpark :> https://www.youtube.com/playlist?list=PL7S7dD8r4QdWUXL4zOC6NYnxDD8OmIVLr

2) Delta Lake > 40,41

https://www.youtube.com/playlist?list=PL7S7dD8r4QdVzOYRzIG2UJdCaCasqBv1F

All about Delta Lake:> https://youtu.be/agtUI25LxuA

Naval Yemul:> 9689777700

3) Spark Eg> To Learn Spark

https://sparkbyexamples.com/pyspark-tutorial/?expand_article=1

4) Do These Channel You Will Be Master in SQL:>

Do Hands On....

Ankit Bansal:> https://youtu.be/dOLBRfwzYcU

Kudvenkat:> https://youtu.be/7GVFYt6_ZFM

SIMPLE FUNDA:>

<<<<<<<<<<<<<<<SQL------PYHTON------PYSPARK>>>>>>>>>>

------------------------------------------------------------------------------------------------------------------------

DE Course (SQL) Curiculum :> Link

SQL W3 Schools :---->>> Link

You Tube :---->>> Youtube

Joins :------->> Link

Import data From CSV to SQL Server :> Youtube

Big Data: PySpark ~ Naval Sir : Youtube

Azure Cloud | Make Data your Career ~ Naval Sir : Youtube

Start New > Try Databricks free : Link

Data Bricks Login : Link

https://spark.apache.org/docs/latest/api/python/reference/index.html

spark

notebook

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4223192473721802/3413217916972752/2212737435575810/latest.html

Databricks: https://learn.microsoft.com/en-gb/azure/databricks/

Delta Lake: https://delta.io/

Pyspark:>>> https://sparkbyexamples.com/

Azure Linked Service:>> https://youtube.com/playlist?list=PL7S7dD8r4QdUcHTTF-EYS3aYzuZiisuax

-------------------------------------------------------------------------------------------------

30 June :>>>>>>>>

DP 203: Link

Azure Data Factory: Youtube

Architecture:> Architecture

Image : img

----------------------------------------------------------------------------------------------------------

27 June :>>>

Naval Sir NoteBook Link: Navalsir27-Link

Datasets Naval Sirs Drive: > Drive

----------------------------------------------------------------------------------------------------------------

23rd June:>

Youtube

1) Big DATA PySpark :> https://www.youtube.com/playlist?list=PL7S7dD8r4QdWUXL4zOC6NYnxDD8OmIVLr

2) Delta Lake > 40,41

https://www.youtube.com/playlist?list=PL7S7dD8r4QdVzOYRzIG2UJdCaCasqBv1F

All about Delta Lake:> https://youtu.be/agtUI25LxuA

Naval Yemul:> 9689777700

3) Spark Eg> To Learn Spark

https://sparkbyexamples.com/pyspark-tutorial/?expand_article=1

4) Do These Channel You Will Be Master in SQL:>

Do Hands On....

Ankit Bansal:> https://youtu.be/dOLBRfwzYcU

Kudvenkat:> https://youtu.be/7GVFYt6_ZFM

SIMPLE FUNDA:>

<<<<<<<<<<<<<<<SQL------PYHTON------PYSPARK>>>>>>>>>>

------------------------------------------------------------------------------------------------------------------------

DE Course (SQL) Curiculum :> Link

SQL W3 Schools :---->>> Link

You Tube :---->>> Youtube

Joins :------->> Link

Import data From CSV to SQL Server :> Youtube

Big Data: PySpark ~ Naval Sir : Youtube

Azure Cloud | Make Data your Career ~ Naval Sir : Youtube

Start New > Try Databricks free : Link

Data Bricks Login : Link

https://spark.apache.org/docs/latest/api/python/reference/index.html

spark

notebook

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4223192473721802/3413217916972752/2212737435575810/latest.html

POST 3] Fuel Internship Notes

Internships:

Fuel Business School as Summer Internship
Celebal Technology as Data Engineer

1. Fuel Business School:

What Internship Taught [Technical]:

AI Basics
Python Basics : Numpy, Pandas

Other activities :

Attended Events
Linked in Tracking and its PPT presentation

Internship Experience: [FBS]

Title: Future Skills Training in AI (Artificial Intelligence) along with Python and Personality Development

FBS Intro :

FBS(Fuel Business Schools)
FUEL(Friends Union and Energizing Lives)
CEO : Founded by Ashoka Fellow (Ketan Deshpande),
Mentored by : Entreprenur : Santosh Huralikoppi

Fuel offers :

100% CSR Scholarship for girls
Courses: PGDM,BBA
2 Months of Sponsered intership for college students

Internship Duration : 2 Months

Feb to March

Fuel Internship Things Done By me:

AI [AI Theory]
Python : Basics,Pandas,Numpy
Personality Development : Sessions
Certificates: Linked in Learnings
AMCAT Test (Technical coding preparation)
Project : (Which Project you did ?)

Fuel Learnings:

Future Skills Training in AI

Artificial Intelligence [Theory]

Types
Whats is AI
Eg
Python packages for AI
Need of AI
History

Summary of AI :

Machine Learning [Theory]

Defn
Supervised,Unsupervised
How ML works ?
Types
Supervised: Regression,Classification
Unsupervised: Clustering, Dimensionality Reduction

Summary of ML:

Deep Learning [Theory]

Defn
Watch Youtube Video

Summary of DL:

Python

Teachings:

Basics
Pandas, Numpy

Python Basics:

Day 1:

Datatype
List
functions in list: Append,Extend,isinstance,pop,clear,del
deleting values from list by various functions
sorting functions: sort,reverse,index,count,min,max
reverse function
Tuple
defn,properties,syntax
function:index,count,sorted,reverse,max,min
Compare list and tuple ?-- refer drive notes

Day 2:

String
defn
indexing,slicing
string functions: (1 to 14)
functions:capitilize,lower,upper,title,lstrip,rstrip,replace,count,endswith,startswith,islower,isspace,isdigit

Day 3:

Dictionary
defn
Properties
Functions:(1 to 8)
Functions:get,keys,values,update,delete function,pop item,clear,del
32 Keywords (All with Examples)

Day 4:

Questions on functions: 17 Questions

Summary of basics python:

Python Libraries:

Pandas
Numpy
Matplotlib
Seaborn
Sklearn

Numpy

(Write here what was taken in fuel lecture )

Numpy:

Intro
numpy typecasting
1D
2D
indexing and Slicing
Numpy Functions
Zeros and ones
random
Matrix Operation
Math operation
Mean,Mode Median

Numpy Functions

Zeros and Ones:

np.zeros
np.ones

Random Functions

Randint
rand

Matrix Operation

np.add(array,array1)

Math Operation Functions:

np.floor
np.ceil
np.around

Mean,Mode Median(Explain with Eg

Summary of Numpy Lib in Python

Pandas

Pandas >>> Python data analysis It is used for data manipulation and analysis

Topics:

Read File
Library Intro : Matplotlib,Seaborn
Data Structure
Use Case: Backend 90%works
How to create dataframe
Excel File
Pandas Function
Access Columns and rows
Handling Missing Values
Drop Null Value
Drop and DropNA

data structure:

dataframe =====> 2d
dataseries

Use Cases :

Read CSV,Excel and json
EDA (exploratory data anaylysis) (50% Work done here) (int values kiticoln kiti rows etc...)
Data Cleaning: 1.Mising values 2.Outliers
Encoding: [[ object(character) to integer ]] 1.One hot encoding --> we dont have labels like male,female for encoding here ..... 2.label encoding --> Male 0 Female 1
Feature Selection [[Best Columns Top 50]]
Join Multiple Columns

Note: Backend 90% Works..............

So Pandas is Important

Library Intro

Matplotlib,Seaborn
sklearn.linearmodel import linear regression

How to create dataframe:

Array
Dict
List
Tuple
Read CSV file

Excel File

excel_file_path = 'output_data.xlsx'

df1.to_excel(excel_file_path, index=False)

print(f"DataFrame successfully converted and saved to {excel_file_path}.")

Pandas Function¶

df.info()
df.describe()
df.columns
df.index
df.dtypes

Access Columns and rows

df.iloc

Pandas Function

1.df.info()                            #give full information about dataset
2.df.describe()                        #give statistcial value
3.df.columns()                         #only show column
4.df.index()                           #only show index(row)
5.df.dtype()                           #show column datatypes
6.df.shape()                           #show dataset shape(rows and columns)
7.df.iloc[row_index,column_index]      #work like slicing
8.df.loc[row_labels,columns_labels]    #work like slicing
9.df[[column1,column2,......columnN]]  #want to access multiple columns
10.df.head(default=5)
11.df.tail(default=5)
12df[column_name]                      #want to access one column

Drop null value

# 1. df.dropna()

# 2. df.drop

Note: Drop and Dropna is always the Last option of any data scientist

Data Filling must be my priority

Personality Development [Events]

i) 6th Feb: Fuel Conclave - CSR & HR teams shared inspiring experiences.
ii) 9th Feb: Fujitsu Team - Session on Effective Communication.
iii) 24th Feb: MEDIPOINT Team - Health Awareness session.
iv) 26th Feb: Japanese Introductory Session.
v) 27th Feb: Yes Bank Session on Financial Literacy.
vi) 8th March: AXA Team - Job Opportunities.
vii) 11th March: IGS Team - Market Trends Awareness.

Note:

Q: Write here all 8 Session (Paste the list from linked in)

Q: What you learnt from those session summary.

Summary of What you learnt from all 8 events at fuel.

Certificates (Linked in)

Total 44 Certificates Done: here is link : Link2DrivePdf

Note: Mention only imp (Properly done)

Links :

ALL Jupyter Notebook : DriveFolderLInk

POST 4] Celebal Training

Celebal Training:

Databricks
Spark
Cloud
SQL
Python

1. Databricks: [12.2] LTS (March)

Topics Covered:

What is databricks
databricks UI Explain
Medalllion Architecture (Multihop)
Control Plane and Data Plane--- (Components,Architecture)
Execution of Query on Databricks
Lazy Evolution
Wide and Narrow Transform
What is Cluster,driver node,Worker Node

Define Terms:

Databricks: Analytical platform + Process and Transform Big DATA
Azure : Cloud ,Microsoft, Service=>VM, storage, database, datawarehouse.
Apache Spark: Analytical Platform + Process and Tranform : Bigdata [Fast]
Hadoop: Open Source Framework: Store and Process data
Storage: BLOB, ADLS[Has Folders]
Cluster: Group of Computers(VM)
Worker: Executor => Executes Spark tasks and process data
Driver : Machine => Databricks Notebook is running
Need of Cloud

23 rd June:>

Medallion Architecture : 3 Layers
Databricks Lakehouse platform: Datawarehouse + Deltalake
Deltalake Features
SCD,CDC : Slow changing data, Change Data Capture

# Databricks:

Data + AI company
Its platform for spark (Web based UI)
Databricks Bring DE+DS+ML Together
Its apache-spark based analytical platform
It provides: one click support, streamline workflow and interactive workspaces that enables collaboration between DE,DS, and business Analytics
Spark only built Databricks because of that databricks sit on top of the spark
Whenever we want to process Big Dat and transform it into Some meaningful Info We use Databricks

# How databricks Work:

Databricks is just a platform.
It is software(Dont have cloud), it totally rely on cloud for storage and processing
So, you need : 1) Cluster[VM] 2)Storage
Databricks Rely on : 1)Azure 2)AWS 3)GCP

# Deployment in Databricks

AZURE => [PAS] ~ (Manage Compute, VM on own

AWS => [SAS] ~ (You just use databricks AWS take care of all)

# Databricks UI

Create :Cluster,Notebook,Table
Workspaces
Recent
Searches
Data : DBFS,Database Tables

# whats Dataplane and control plane (Architecture)

1. Control Plane

Workflows, jobs
WebUI, NB, Cluster Manager
HANDLED BY DATABRICKS

2. Data Plane

VM, Storage
Storage: Azure=> BLOB, AWS:S3
HANDLED BY AZURE, GCP

#Lazy Evolution:

Once transformation is done, output will not be displayed to give output there we use actions
Actions: show(), display()

# Narrow and Wider Transformation

1. Narrow

1 to 1
Data remain in same portion
map(),filter()

2. Wider (Shuffling)

group of key
reduce by key
your rows will be shuffled

#DBFS (Databricks File System)

It acts as virtual storage for cloud
Abstraction of that storage
Its storage account
it just build connection between : 1) Storage Account 2) Databricks

# Repos

version control
stores all revison history (when add,deletd,updated)
Repository like: Git Hub

# Schema

Different Ways: List, String, etc..
Structtype,dataframe
df(data, schema)

# databricks Workflows Jobs

"This is data Ingestion here we use Autoloader for it" [ETL,ELT]
Full orchestration: Automatic,Pipeline

# Delta Live table

There is no connection between delta live table and delta lake
Its totally "New framework"
used for "Data Orchestration"
Maintains Quality of Data [bronze-silver-gold]
Data Ingestion
Data Pipeline

# Databricks SQL:

Serverless(dont worry about computation)
just use like Datawarehouse
TO use SQL : You have to enable : SQL Warehouse

# Streaming in Databricks

We use Structured Streaming API in databricks i.e:=> "Autloader"

# When you want to bring data to databricks:

1. Autoloader

2. Copy Into:

SQL code
No Streaming Data
Everytime you have to Run "COPY INTO" to copy data

Functions

Package=> from pyspark.sql.functions import col

# Select

Four ways to specify columns=>

string
column
df.<<column_name>>
df['coln_name]

1.String

> df.select("age") OR >df.select("*")

2.Column:

> df.select(col("age"))

3.df.colnname

> df.select(df.age)

4. df[' ']

> df.select(['age'])

# Alias

> df.select(col("age").alias("sauru"))

# head,tail

> df.head(5) ----------- (Top 5 records)

> df.tail(5) ----------- (Last 5 records)

# withColumn (To add 1$)

> df.withColumn("unitprice",col("unitprice")+1)

# withcolumnRenamed

> df.withColumnRenamed("sauru","sauru_bro")

# describe

> df.describe()

# dtypes or Print Schema

> df.dtypes

# len

# limit

> df.limit(5)

# isNull

> df.select("id","name").where("name ISNULL")

.where( col("name" ).isNull())

# drop

> df.drop("name","age")

# dropduplicates

> df.dropDuplicates()

# dropna (when entire row is null,it is deleted)

> df.dropna("All")

# fillna

> df.fillna("sauru")

> df.fillna(100)

# sort (orderby)

> df.sort("season")

> df.sort("season", ascending = False)

sort:orderBy

# group

> df.groupBy("dept").count()

# where

> df.where("instance = 2")

READ AND WRITE FILES

1. CSV

> df = spark.read.csv(" ")

> df = df4.write.csv(" ")

2. JSON

> df = spark.read.json(" ")

> df = df4.write.json(" ")

3. Parquet

> df = spark.read.parquet(" ")

> df = df4.write.parquet (" " )

2. Spark [3.4.1]

Questions:

What is Spark
What are Spark Functions
How to read and write file in spark
Describe about spark architecture
Components of Spark

3. Cloud

INDEX:

Cloud : using someone else's resource, Pay as you go
Cloud Providers : AWS,AZURE,GCP
Azure Analytics: Compute, Storage, database, data warehouse
Important abbrevations
Azure (categories - Services)
Cloud Service Model

IAAS [Virtual Machine] : user can access and manage fundamental computing resources
PAAS [Azure ]: provides complete development and deployment environment
SAAS [Power BI]: user can access and use software application over net via subscripton

Types of Cloud

Public: Infrastructure and service (provided by 3rd party) accessible over internet
Private : Infrastructure and service deployed and managed with organization
Hybrid : PUBLIC + PRIVATE
Community

DE Architecture
ADF : Azure Data Factory [ETL]
Why do we use Spark: Suppports many language, Memory, Fast
Nested JSON: Dictionary=>{key,value}
Databricks : Analytics Platform run on cloud
Datawarehouse tools: Azure Synapse, Aws RedShift
Hadoop Diagram
Hadoop Components : 1] HDFS: Store 2] Mapreduce: Processing 3]Yarn : Process Manager
Diff betn BATCH and STREAMING=> Batch:OLAP, STREAM: OLTP
Diff betn Hadoop and Spark => Spark(100,memory,all lang) | hadoop(Slow,Disk,Java)
Memory vs Harddisk => Fast is memory | Slow is Hardisk for Large Data
Spark Architecture

# Cloud Questions:

What is Cloud?
What are cloud providers ?
What is Azure ?
What are Cloud Service Models ?
What are Cloud Deployment models ?
What is Hadoop and its Components ?
What is ADF ?
What is ADLS : Storage, Analytics, Cluster
What is Linked Sevice in AZURE ?
What is BIG DATA?
Diff Betn BATCH and STREAMIG ?
What is DeltaLake?
Diff betn SPARK and HADOOP ?
What is Synapse ?
What is MetaStore ?
ETL vs ELT ?

# Cloud:

using someone else cloud
Pay As You Go

# Cloud Providers

AWS : 34%
Azure : 21%
GCP : 10%
IBM
SAP
Alibaba
VMware

# Azure Analytics

Compute
VM
Storage
Database
Datawarehouse
BLOB, ADLS
Azure SQL
Transformation
ETL
Synapse
ADF
Databricks + HD Insights

# Abbrevations:

BLOB : Binary Large Object
ADLS : Azure Datalake Storage
S3 : Simple Storage Service
RDS : Relational database Service
ADF : Azure Data Factory
ETL : Extract Transform Load
EMR : Elastic Map Reduce

# Categories :

AZURE | AWS

Compute: VM |

Storage: BLOB,ADLS(Has Folders) |

Database: Azure SQL

DataWarehouse: Synapse

ETL: ADF

Transformation: Databricks + HD insights

# Cloud service model:

1. IAAS : VM

2. PASS : SQL DB in Cloud

3. SAAS : Power BI

# Types of Cloud

1. Public 2. Private 3. Hybrid Cloud

# ADF :

for complete ETL and ELT
bought by azure

# Note: We just have to drag and drop in ADLS (No Code)

# Why are we using PySpark ? (If no Code)

We have to drag and drop in ADLS (But Behind the Scene Pyspark code Runs)

# Nested Json:

Used to handle the Complex Data

# Databricks:

Databricks (Cloud Platform ) ---- Rely on---- > Azure,GCP,AWS

# Datawarehouse Tools

Azure: Synapse
AWS : RedShift
Primary Tool in Azure: Synapse
Databricks can be Integrated with other Azure Services:

Azure Data Lake Storage [ ADLS]
Azure SQL DB
Azure Synapse Analytics

All these above 1,2,3 to build Data Analytics Solution

A: ADF(No Streaming),ADLS => No Code

B :Databricks => Complex Code

A: For No code===> ADF,ADLS

B: Why we need to code (If ADLS is available) => For Complex Scenario We need code(For this tool is DataBricks)

Hadoop

# Hadoop Evolution Story

Problem:

1]Storage [2003]

Google had problem to store data
Google released paper to describe how to store large dataset on Distributed Storage
This paper was called as Google File System [GFS]

2] Process [2004]

Google had problem to process data
Google released paper : how to process large dataset on processing
This paper was called as Map Reduce

Hadoop [2006]

Yahoo took this paper and implemented it.
The implementation of GFS was named as HDFS (Hadoop Distributed File System)
The Implementaion of Map Reduce was named as Map Reduce

> HDFS Components:

Store : HDFS
Process : Map Reduce

>Yarn:

Process Manager
Job Scheduling

> Limitation:

Stores data on disk
only batch processing

# Note: In Hadoop only Batch Processing is Done, While Streaming is not possible

The Scientist [~ Matei Zaharia] came and developed Spark (Added:Streaming)

So, People Shifted from Hadoop to Spark

# Difference between Batch and Streaming

1.Batch:

Previous/last 5 Years data [OLAP]
Eg: DMART

2. Streaming:

Runtime/Realtime data [OLTP]
Online Purchase / Online Payment
Eg: PhonePay

Streaming Eg is of Phonepe :

No delay,No Latency
Sender send :500 his account get deducted by 500 , Receiver receives :500 his account gets 500 added

# Difference between Hadoop and Spark

Spark:

100 Times Faster than Hadoop
It Works on Memory
Languages: Spark can be integrated with Many languages

Hadoop

Slow
It works on Disk Calculation
Language: Only Java

# Memory vs Harddisk best eg by naval yemul sir

Naval: hey saurabh tell my mail ?
Saurabh : Thinking what would be naval sirs mail...hence slow its Harddisk
Naval : Hey saurabh tell me your own mail?
Saurabh : Quickly says : saurabh6527@gmail.com (hence Quick it is Memory)

# Story of Spark

Matei Zaharia : Spark Founder
He Founded and donated to "Apache Spark"
So,To Know databricks you should know "Apache Spark"
People Who know and worked with "Apache Spark" they have worked together with "AZURE" developers and they toghether invented "Azure Databricks"

# How Code Executes On Spark ?

1] Write Code

Behind Scene:

Spark Driver =>(Submits Code) => DAG Scheduler

DAG

(Creates Logical and Physical Plan)
Optimizes plan of your Code and Break in two stage: 1) Stage 2) Task

(Now, these Tasks are sent to "Worker Node" for "Execution"

# DAG ("Direct Acyclic Graph)

When You do transformation -> Nothing Happens....
When You take Action => BHS (DAG Works)
[Filter, group by, etc...]
(How to Show result in Best "OPTIMISED FORMAT" )

# Spark Context:

Important: "Spark Context" is Entry Point to Start "Spark Driver"

# Define :

Databricks
Azure
Apache Spark
Hadoop
Cluster
Worker Node
Driver Node
DBU

1. Databricks (Refer above in Databricks Notes)

2. Azure:

Azure is a cloud computing platform and service offered by microsoft
It provides wide range of cloud services like: Compute,VM,Storage,Database,etc..

3: Apache Spark:

Apache Spark is open source distributed computing system that provides a fast and efficient platform for processing large amount of data
Databricks on other hand is unified analytics platform built on top of spark

4. Hadoop

Hadoop is an open source framework that allows for distributed storage and processing of large dataset across cluster of computer

Components:

Hadoop distributed File System : [To Store Large Data ]

2. Map Reduce [ process data]

3. YARN (Yet another Resource Negotiator) [Job Scheduling,FrameWork]

# Cluster

Cluster is group of Computers[VM] that work together to process data and Run Computation
Consists of Workers,Drivers

# Worker Node [Executors work in parallel to execute spark task and process data ] (Student)

The Worker Node is Called as "Executors".The Executors are responsible for executing and processing data in parallel across the cluster.

(Executor works togethers in parallel to execute spark task and process data)

# Driver Node [Notebook | Job => Running]

The Driver Node is Databricks is the machine where the databricks Notebook or job is running

# DBU

Databricks Unit
Compulsory 2/3 DBU Consumed per Hour
It is based on Costing

# RDD, Dataframe and Dataset

ETL ( Extract Transform Load )

RDD [Immutable]
RDD[Backbone of Spark] : Its DS to handle both Structured and Unstructured
Dataframe : Looks Like Table
Dataset : RDD+Dataframe => Support only Scala, Java

# Our Work

Formuale 1 Racing Dataset
Raw Data: Azure Datalake (CSV,JSON,Semi-Structured Folder)
ADLS data=> PULL TO => Databricks (Here we do transformation and convert to Parquet /Delta Lake)
Write to DatawareHouse(Clean Data)
Connect to Power Bi

# Apache Spark

RDD:Backbone of Spark
Allows store and process data,faster data processing => RAM

> Spark Language:

R,scala,python,SQL
Real Time Processing
Process Big Data

> Spark Architecture:

Eg: Standalone Computer => Used to play games,movie,etc..

But to Process BIG DATA => We need group of computer => Cluster => "Framework"(Apache Spark")

> Components:

Driver Process : BOSS
Executor Process : Worker

Driver Process : Keeps track of all info about spark,analyze the work/Task and distribute to worker node

# How we are going to Do:

SQL
Python
PySpark
DataBricks (We run it on Azure)
Azure
ADF
Small Touch on : A) Synapse B) Power BI

# Medallion Architecture [Multi Hop]

# Databricks Lakehouse Platform

#Delta Lake

# ADF

# Synapse

# MetaStore

# ETL vs ELT

# Unity Catlog

# What are Linked Services in Azure

# Big data 5vs

# Functions

# Medallion Architecture :

It is a data designed pattern used to logically organize data in "LakeHouse"
To Improve Quality of data it goes through 3 layers:

Bronze Layer
Silver Layer
Gold Layer

We store Data in Lakehouse

# Databricks Lakehouse Platform

Datawarehouse + Datalake => Lakehouse(Delta Lake)

# Story:

B4 we extracted only structure data from datawarehouse and then BI and Dashboard Preparation

Now, We Merged both Dataware house and Data Lake

To store this type of data we use DELTA LAKE and this is stored in Parquet Form

# DeltaLake

[Parquett + (transaction log) form]

[Parquett + transaction log [JSON + CRC] ]

Parquett is compressed (50% - 60 % Reduction)

(Because It is in "Columnar Format" )

Delta has Both :

Streaming
Batch

Delta lake stores data in Parquet form by default

More about Datalake => Naval Sirs Youtube Chaneel (40,41)

# Delta Lake :

> Diagram => delta.io

Features:

ACID Transaction
Scalable Metadata
Time Travel
Open Source
Unified Batch / Streaming
Schema Evolution / Enforcement
Audit History
DML Operations

Note:

# Mainly we do all these things on realtime data

1: SCD (Slowly Changing Dimension)

2: CDC (Change Data Capture)

# Mainly these three operations are done:

1: UPDATE

2: INSERT

3: DELETE

# To Know CDC you must know

1: Python

2: SQL

3: Pyspark

# ADF madun ADLS madhe Data JATO...

# CI/CD : Continuous Integration (Bug,Testing) / Continuous Deployment (Pass all test )

# Summary

Note: ADLS == DATALAKE

# Delta lake :

1. combination of (Datawarehouse + Datalake)

Datawarehouse: Structured

Datalake: Both

2. Datalake Madhe apn file la convert to delta Karto

(df.write.format("delta"))

tyala delta lake mhnta

3.Delta Lake Provides [Parquet + Transaction Log]

Top Best Famous Features of Transaction Log :

1. Time travel (Previous Transaction Data)

2. ACID Properties

3. Batch + Streaming [We canntot do this in data lake]

4. DML operations [Insert Update Delete]

Data Warehouse + Data Lake ==> on Top is Lake House

(Data Warehouse: Structured Data)

(Data Lake : All type of Data)

(Data Lake : File from data lake is converted to parquet i.e Delta Format) => You see "Lake House"

# ADF (Azure Data Factory)

ADF is ETL and ELT service
It provides Multiple Connectors

Which Helps:

Data Extract
Data Clean
Data Store

Why we use ADF => "Data Orchestration"

(Automatic Process of load and clean data)

> Benefits=>

Reduced Cost: No Infrastructure (Subscription)
Increase productivity : Drag and Drop (No complex Code)
Flexibility : Connect Various Data Sources (On premise cloud)
Scalabilty : up and down (Pay as you go)
Security : Authentiction and Autherization

# Synapse

Storage => Blob,ADLS

ETL => ADF

Modern Tool

DataWarehouse => "Synapse"

1. Synapse is a modern DataWarehouse

2. You can do this in Synapse:

ADF
Connect ADLS
Power BI
Data warehouse

# Meta Store

It is top level of databrick Hierarchy
This Way :

Metastore
Catlog
Schema
Table

To Use Metastore you need "Unity Catlog"

>Unity Catlog=>

For Delta Sharing
Delta Permission
You can Create different Catlog with Metastore

# ETL vs ELT

ETL (Extract transform Load)

Extract: Extract code from datalake that may be in any format

(Json,CSV,Parquet)

Extract to => ADF,Synapse

Transform :

Remove Null
Add New Column => With Column
Clean Data

Load:

Synapse
Snowflake
LakeHouse=>PowerBI => Visualization

# ELT

First we load data and then transform
We Keep data Loaded...

# What are Linked Service in AZURE

Source ====> Destination

Source:

Read from Sources like : 1. BLOB 2. SQLDB

Between===:

Copy data from source to target
Connecting Stringer

Destination:

Write Data

# Big Data 5V`S

Value
Variety
Varacity
Volume
Velocity

# Functions:

1. Show, Display

2. Describe, dtype

3.Select,Alias,WithColumn,WithColumnRenamed

4. Take,head,limit,sort,drop,dropduplicate,isNull,dropna,fillna

4. SQL

A] Topic Covered

B] Question that can be asked in SQL

A] Topic Covered

# Data: Information Stored in database

Data is collection of facts in form of words,numbers or even pictures

# Database: Structured Collection of data (Rows and Columns)

Structured Data has Rows and Columns

# Popular Database Software:

Oracle database
Microsoft SQL Server
My SQL
Postgrese SQL
Mongo DB
IBM DB2

# Five Submodules of SQL

1: DDL [ Definition ] [Auto commit]

2: DML [ Manipulation] [Not Auto commited]

3: DCL [ Control]

4: TCL [ Transaction]

5: DQL [ Query]

# Data WareHouse:

defn:

It support business intellegence activities such as Reporting,Analysis and decision making
Handles only Structured data, Huge Volume, Huge than Database

# Modern Datawarehouse[We use clouds data] Companies:

Azure
AWS
GCP
Databricks
Snowflake

# Why we need Datawarehouse ?

Overall datawarehouse provides a robust and flexible solution for storing and analyzing data, and are essential for organization that want to gain Insights from their data, and make informed decison

# OLAP vs OLTP

1. OLAP [Online Analytical Processing]

OLAP system are designed to support analytical and reporting activites by providing fast and flexible access to large volumes of data
Eg: General Store : Data

2. OLTP [Online Transaction Processing]

OLTP Systems are design to support day to day operations: by Processing high volume of transaction quickly and effectively and provides the real time access to critical business data
Types of database processing used for realtime transaction oriented application
Eg: Online Transaction APP

# Why do we need Datawarehouse ?

overall datawarehouse provides a robust and flexible solution for storing and analyzing data,

and are essential for organization that want to gain Insights from their data, and make informed decision

# Datalake :

a data lake is centralized repository of data that allows organization to store and manage large volume of data which is of 1. Structured, 2. Semi-Structured 3. Unstructured Data

# What is SQL query order of execution ?

Ans:

from

where

select

order by

-----------------------

from

group by

select

order by

----------------------

from

where

group by

having

select

order by

limit

from > where > group by > having > select > order by > limit

Example :

select cat,avg(sales) as avg_sales

from sales data

where year>2000

group by category

having count(*) > 10

order by avg_sales DESC

limit 3

# SQL

SQl is standard language for storing data, manipulate, and retrieving data

# SQL vs MYSQL

SQL is Structured / Standaralized Query Language

While,

> MySQl is opensource relational database management system that uses SQL

>MySQL is platform where SQL queries are runned

Datatypes

Int: Stores Whole Number (Eg: 1,2,3,etc)

dec: Stores Decimal number (Eg: 3.14)

char: Stores Fixed Length Character String (Eg:Hello)

varchar: Stores Variable Length Character String (Eg: Helloworld)

date: Stores date (Eg:2023-05-07)

datetime: Stores both (Eg: 2023-05-07 12:30:00)

money: stores monetary values with upto 4 decimal place

boolean: Stores boolean values (Eg: true/false)

Clauses

clauses are keywords/commands used to perform various operations on data in database table

where: used to filter data based on condition.

group by: used to group data base on 1/more columns in table

order by: sort the data in asec/desc or 1/more columns in table

having: used to filter data based on condition after grouping

select : used to retrieve data from 1/more table

from: used to specify table from to retrieve data

limit: used to limit the number of rows returned by a query

join: used to combine data from two or more tables based on common column

Constraints

Constraints are rules that are applied to database table to ensure the accuracy,consistency and integrity of the data.

1: primary key: ensures that values in specific columns are "unique and not null"

2: foreign key: ensures values in specific columns in one table matches values in primary key column of another table

3: Unique: ensures values in specific column are unique

4: Not Null: ensures values in specified column are not null

5: Check: enforces a specific condition on data in specified column

6: default

Joins

inner, left ,right, full outer join , self join

Views :

Create virtual/temporary table

Syntax: create view <<viewname>> as select * from <<tname>>

Eg: create view temp table as select * from employee

select * from temptable

Five Submodules of SQL

1: DDL [ Definition ] [Auto commit]

2: DML [ Manipulation] [Not Auto commited]

3: DCL [ Control]

4: TCL [ Transaction]

5: DQL [ Query]

1: DDL:

Definition:

Types:

Create
Alter
Drop [Delete all data and table]
Truncate [Delete only data in rows and columns]

Syntax:

1. Create: create table <<tablename>> (coln_name Datatype );

2. Alter: alter table <<tablename>> add <<coln_name>> <<soln_defn>>

3. Drop: drop table <<tablename>>

4. Truncate: truncate table <<tablename>>

2: DML:

Definition: Used to modify database

Types:

Insert
Update
Delete

Syntax:

1. Insert : insert into <<tablename>> ( <<coln1, coln2,coln3>> ) values ( <<values1,values2,..>> ) ;

2. Update : update <<tablename>> set (coln = value1 ) where <<condn>>;

3. Delete : delete from <<tablename>> where <<condn>>;

Example:

1: Insert into emp (id,name) values (1,"dj");

2: Update student set stud_name = "Sauru" where id = 3

3: delete from student where name = 'Sauru'

3: DCL

Definition:

Types:

Grant [Gives Access Privilage]
Revoke [Take Back Access Permission]

Eg:

4: TCL

Definition:

Types:

Commit
Rollback

Eg:

5: DQL

Definition:

Types:

Select

Eg:

Drop:

drop the view

drop view <<viewname>>

# Operators:

Logical Operator

1: AND

used to compare more than one condition TRUE TRUE=> TRUE (Both must be True)

Eg: select * from employee where AGE>25 AND age<30;

Output: 27,28

2: OR

use to compare data with more than one condition

If any condition is true it returns TRUE

Eg: select * from employee where age>25 OR age<30

Output: All records from table => 26,27,28,29

3: Between

The Begin and END values are Included.

To avoid multiple (AND,EQUAL=>)

Eg: select * form student where age between 20 AND 30

Output: 20,21,22,23...24

4:IN

To avoid multiple OR

Eg: Select * from student where fname IN ('vinod', 'sauru','ram')

fname = 'vinod' OR fname = 'sauru' OR fname = 'ram'

5: NOT

Except the records

Eg: select * from student where fname NOT IN ('sauru','vinod')

Eg: select * from student where age NOT Between 20 AND 30

6: IS NULL

select * from student

where name IS NULL;

7: IS NOT NULL

select * from student

where name IS NOT NULL

# UNION and UNION ALL :

Union : REMOVES DUPLICATE

Union ALL : HAS DUPLICATES

# SUBQUERIES

Queries That are embedded within another query

Two Types: 1. Co-related 2. Non Co-related

# Pattern Matching

(%, _)

Wild Cards are used with LIKE operator

% => Represent Zero or More Character

_ => Represent Single Character.

Write SQL query to find the names of employees that begin with 'A' ?

Ans: Select * from employee

where ename like 'A%';

# Functions:

A] Aggregate Functions:

Sum
Count()
Min
Max
Avg

Count()=>

It returns the no. of rows that matches a specific criteria

select count(rno) from stud;

Avg()=>

It returns the average value of a numeric column

select avg(rno) from stud;

Sum ()=>

It returns total sum of a numeric column

select sum(rno) from stud

Max()=>

select max(age) from stud;

Min()=>

select min(age) from stud;

Note: used in companies for employee total salary count

B] String Functions:

String Functions: concat, reverse, replace, upper, lower, len , ltrim, rtrim, left, right

1. Concat

select concat(emp_name, '--> ', dept) from employees

2. Reverse

select reverse(emp_name) from employees

3. Upper

select upper(emp_name) from employees

4. Lower

select lower(emp_name) from employees

5. Replace

select replace (emp_name,'john','saurabh') from employees

6.Len

select len(emp_name) from employees

7. LEFT[JOHN]

select left(emp_name,3) from employees op: JOH

8. RIGHT[JOHN]

select right(emp_name,3) from employee op:OHN

9. SUBSTRING [JOHN]

select substring(emp_name,2,4) from employee op: OHN

10. CharIndex [JOHN]

select charindex ('$',emp_name) from employee op: 1

11. Replace

select replace (emp_name,'JOHN','sauru')

12. LTRIM

select ltrim(emp_name,'JOHN','SAURU')

13. RTRIM

select rtrim (' sauru') op:sauru

14. STUFF [john]

select stuff(emp_name,2,3,'!') op:J!

15: Replicate

select replicate (emp_name,3) op:JohnJohnJohn

16:Space

select space (' ') op: ' '

C] The Datetime Functions

1. getdate()

2. sysDateTime

3.current_timestamp

4. DateName

5. DateAdd()

6.DateDiff

7. DatePart

8.Day(), Month(), Year(), IsDate()

# Where clause:

A) Logical and B)Comparision Operation

A] Logical Operator:

AND
OR
NOT
LIKE
BETWEEN
IN, NOT IN
Exists, Not Exists
IS NULL, IS NOT NULL

B] Comparison Operator

B] Question that can be asked in SQL

Practice Questions=>

Q: Write SQL Query to find duplicate rows in table

Ans: Select salary,count(salary) from employee

group by salary;

having count(salary) > 1;

Q: Write Query to calculate even/ odd records from table

Ans: Select * from employee

where MOD(id, 2) = 0 => EVEN

where MOD(id,2) = 1 => ODD

Q: Find First and Last record

Ans:

First Record:

select * from employee

where id = (select min(id) from employee)

Last Record:

select * from employee

where id = (select max(id) from employee)

Q: How do you copy all records of table ?

Ans:

create table temp as select * from employee;

Q: Print 1 to 10

Ans: declare @start int

declare @end int

set @start = 1

set @end = 10

While @start <= end

begin

print @start

set @start = @start +1

end

Q: Fetch Alternate records

Ans: select * from emp

where (emp%2) = 1

Q: Find Second Highest Salary (Using Max())

Ans:

Select max(salary) from employee

where salary Not in (select max(salary) from employee)

Q. Write Query to find third highest salary from employee

Ans:

select * from employee

Limit 2,1;

(Limit 2,1=> Fetch on row after 2nd)

Q.Write Query to find N th Highest Salary

Ans:

1.Max()

2nd Highest

select max(salary) from employee

where salary < (select max(salary) from employees)

3rd highest

select max(salary) from employee

where salary < (select max(salary) from employee)

where salary< (select max(salary) from employee)

2. Limit
select distinct salary from employee

order by salary desc

limit n-1,1

2000 => limit(0,1)

1000 => limit(1,1) : after 1st row 1st record is fetched

500 => limit(2,1) : after 2nd row 1 st record is fetched

3. Self Join (Without Limit)

select salary from employee e1

where 3-1 = select count(distinct salary)

from employee

where e2.salary > e1.salary

4. Not IN
select max(salary) from employee

where salary NOT IN (select max(salary) from employee)

5. dense_rank

# Tricky Question:

Q. Add New Column To Table

Ans: alter table customer

add phone_number varchar(21);

Q. Change Datatype

Ans: alter table customer

alter column phone_number varchar(21);

Q. View Specific Row

Ans: Select Top * from employee

Q. Add primary Key to table

Ans: alter table student

ADD primarykey(ID);

Q. Add Unique Index/Constraint to table

Ans: alter table <<tblname>>

ADD Constraint <<indexname>>

UNIQUE (Coln_name);

Q. What is Alias Command

Ans: Alias name can be given to any

table or column

Q. What is SQL Query order of execution

Ans:

from

where

select

order by

-----------------------

from

group by

select

order by

----------------------

from

where

group by

having

select

order by

limit

Q. Write SQL query to find the names of employees that begin with 'A' ?

Ans: Select * from employee

where ename like 'A%';

Therotical Question

#List Ways of Getting COunt of records frim table

Ans: Two ways =>

Select * from employee;

select count(*) from employees;

#What do you mean by table and field in SQL?

Table: Collection of data (rows and Columns)

Field: Columns in Table

Employee Info => Table

--------------------------------

emp_id | Name | emp_age => Fields

0 | sauru | 56

1 | akash | 21

# How to insert NULL values ?

Ans: Two Ways =>

1 ] Implicity ommiting Column from Column List

2 ] Explicitly providing NULL keyword in values clause

# Are Null values same as blank Space or ZERO

Ans:

Null=> unavailable or Unknown

Zero=> Number

Blank Space => Character

# SQL System Defined Functions?

Ans: Aggregate, String, Date

# SQL Injection

Ans: @Backend unwanted code is executed and Queries are modified Data is Stored or deleted

# Stored Procedure

--can be saved and reused like functions

create procedure <<procedurename>>

select * from emp

GO;

Execute <<procedure_name>>

# Alias vs Rename

> Alias : Temporary

Select name as Nav from emp;

> Rename : Permanent

exec sp_rename 'emp', 'emp1'

# view and Stored procedure

> View: Querying Data

> Stored Procedure : Used for executing predefined logic and operation on database

# Primary vs Foreign =>

Primary=> Not NULL and Unique Values

Unique => Unique and accepts only 1 NULL (values)

Foreign => Relationship between two tables

5. Python:

1: Python

___________________________________________________________________________________

POST 5]# MAKE Notes Online

Notes(File)

1. Celebal Training:

Databricks
Spark
Cloud
Python (Basic + Adv Topics) [Complete at Pune this notes]
SQL(Basic + Adv Topics) [Few Topics Remain Cover it.. ]

2: MY Project [Complete at Pune this notes]

De First project
IPL
UBER

3: SQL

4:Python

5: Project File

6. Notebook: [One Day 1-5 Notebooks take pdf save and also write fast notes important]

Roadmap Guidance + Azure 900
Data engineer Roadmap
Celebal Mentorship Program
Rajas De (Youtube)
BIG DATA
Course : Apache Spark Notebook

In My Room i Designed this Map

[Python DE SQL]:

1. Python

~Created by : Guido van Rossum

Learn From Data With Darshil DE Course

(Do whole Course again and give this for Interview i did this.....)

2. DE

Photos Of DE roadmap and how DE Works

Identify The Data Engineer Problem and Solve It

3. SQL

Learn SQL Baisc to Advance via proper course

Complete Sql Roadmap 2023 : SQLROADMAPLINK

Here is list of function i use everyday on day to day basics as a Data engineer

Select
Where Clause
Order by
Aggregation Function (Min,Max,avg,count)
JOINS (Inner,left,full)
Union
Group By
Case Statements
Working with Datetime/Timestamp (Adding/Substraction/Extracting Year / Month and Day)
IF NULL
Coalesce
Numbering Functions (Rank,Dense _Rank, Row_number)
Conversion Functions: CAST as INT/FLOAT/STRING
Formating Datetime
Creating Function
Merge Statement (For UPSERT)
Qualify

Search This Blog

Code With Sauru💻 [ALL]