SAURABH NOTES BLOG

 POST 1] Unity Catalog Links and Quick Notes By Nishtha Jain and Pratima Jain

To Do Now =>
        1: Make All Links Proper Here  [  ]
        2: Start Learning and Making Notes  [  ]




>> 6 Augut Call by => Nishita Jain @ Morning

Azure data bricks, UC, Data governance ,Why we use UC ? , Features, Advantages, Limitation, Community edition, Complete UC, Setup db UC, Scheme, Commands?, UC then Migration, Data Plane and Control Plane ,Project : How to UC migrate, UC 3 Level : Hive and then Schema and table, Hive metastore, Catlog Data governace security, Make Documents, Share on group, 2 weeks UC complete


Links:>

Videos [24]

>> How to Setup Databricks Unity Catalog for Azure

 >> How To Setup Databricks Unity Catalog for AWS

>> How To Setup  SCIM in the Account Console

>> How to upgrade your Hive metastore tables to Unity Catalog Using Sync

>> How to setup Databricks Unity Catalog with Terraform 

>> How to Getting Started with Databricks Connect V2

>> How Getting started with Databricks Terraform Modules

>> How to Getting started with Delta Sharing in Databricks

>> How to Getting started with Delta Market Place

>> How to Getting started with Catalog binding in Unity Catalog

>> How to use The Undrop Feature (undrop table in databricks)

>> How to run a job as  a service principo


POST 2] ALL CELEBAL LINKS

 Celebal Tracker:

 Tracker :> Link

Sample :> Link

Colab :> Link

Curiculum :> Link

Youtube Links 

https://www.w3schools.com/sql/sql_ref_sqlserver.asp

Naval Sir Channel

https://www.youtube.com/@thedatamaster

https://youtu.be/tZw4Nrv5X9g

Company Initernals

 Celebal Company Internal : Drive



Naval Yemul Sirs Link:

Databricks: https://learn.microsoft.com/en-gb/azure/databricks/

Delta Lake:   https://delta.io/

Pyspark:>>> https://sparkbyexamples.com/

Azure Linked Service:>> https://youtube.com/playlist?list=PL7S7dD8r4QdUcHTTF-EYS3aYzuZiisuax


-------------------------------------------------------------------------------------------------

30 June :>>>>>>>>

DP 203:  Link  

Azure Data Factory: Youtube

Architecture:> Architecture

Image : img

----------------------------------------------------------------------------------------------------------

27 June :>>>

Naval Sir NoteBook Link: Navalsir27-Link

Datasets Naval Sirs Drive: > Drive

----------------------------------------------------------------------------------------------------------------

23rd June:>


Youtube 

1) Big DATA PySpark :>  https://www.youtube.com/playlist?list=PL7S7dD8r4QdWUXL4zOC6NYnxDD8OmIVLr

2) Delta Lake > 40,41
https://www.youtube.com/playlist?list=PL7S7dD8r4QdVzOYRzIG2UJdCaCasqBv1F
All about Delta Lake:> https://youtu.be/agtUI25LxuA

Naval Yemul:> 9689777700


3) Spark Eg> To Learn Spark
https://sparkbyexamples.com/pyspark-tutorial/?expand_article=1


4) Do These Channel You Will Be Master in SQL:>
Do Hands On....
 Ankit Bansal:> https://youtu.be/dOLBRfwzYcU
Kudvenkat:> https://youtu.be/7GVFYt6_ZFM


SIMPLE FUNDA:>
<<<<<<<<<<<<<<<SQL------PYHTON------PYSPARK>>>>>>>>>>

------------------------------------------------------------------------------------------------------------------------

 

DE Course (SQL) Curiculum :> Link

SQL W3 Schools :---->>> Link  

You Tube :---->>>  Youtube

Joins :------->> Link

Import data From CSV to SQL Server :> Youtube

Big Data: PySpark ~ Naval Sir : Youtube

Azure Cloud | Make Data your Career ~ Naval Sir  : Youtube

Start New > Try Databricks free  : Link

Data Bricks Login : Link 

https://spark.apache.org/docs/latest/api/python/reference/index.html
spark

notebook

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4223192473721802/3413217916972752/2212737435575810/latest.html

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4223192473721802/3413217916972752/2212737435575810/latest.html


 

Databricks: https://learn.microsoft.com/en-gb/azure/databricks/

Delta Lake:   https://delta.io/

Pyspark:>>> https://sparkbyexamples.com/

Azure Linked Service:>> https://youtube.com/playlist?list=PL7S7dD8r4QdUcHTTF-EYS3aYzuZiisuax


-------------------------------------------------------------------------------------------------

30 June :>>>>>>>>

DP 203:  Link  

Azure Data Factory: Youtube

Architecture:> Architecture

Image : img

----------------------------------------------------------------------------------------------------------

27 June :>>>

Naval Sir NoteBook Link: Navalsir27-Link

Datasets Naval Sirs Drive: > Drive

----------------------------------------------------------------------------------------------------------------

23rd June:>


Youtube 

1) Big DATA PySpark :>  https://www.youtube.com/playlist?list=PL7S7dD8r4QdWUXL4zOC6NYnxDD8OmIVLr

2) Delta Lake > 40,41
https://www.youtube.com/playlist?list=PL7S7dD8r4QdVzOYRzIG2UJdCaCasqBv1F
All about Delta Lake:> https://youtu.be/agtUI25LxuA

Naval Yemul:> 9689777700


3) Spark Eg> To Learn Spark
https://sparkbyexamples.com/pyspark-tutorial/?expand_article=1


4) Do These Channel You Will Be Master in SQL:>
Do Hands On....
 Ankit Bansal:> https://youtu.be/dOLBRfwzYcU
Kudvenkat:> https://youtu.be/7GVFYt6_ZFM


SIMPLE FUNDA:>
<<<<<<<<<<<<<<<SQL------PYHTON------PYSPARK>>>>>>>>>>

------------------------------------------------------------------------------------------------------------------------

 

DE Course (SQL) Curiculum :> Link

SQL W3 Schools :---->>> Link  

You Tube :---->>>  Youtube

Joins :------->> Link

Import data From CSV to SQL Server :> Youtube

Big Data: PySpark ~ Naval Sir : Youtube

Azure Cloud | Make Data your Career ~ Naval Sir  : Youtube

Start New > Try Databricks free  : Link

Data Bricks Login : Link 

https://spark.apache.org/docs/latest/api/python/reference/index.html
spark

notebook

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4223192473721802/3413217916972752/2212737435575810/latest.html

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4223192473721802/3413217916972752/2212737435575810/latest.html



POST 3] Fuel Internship Notes

  Internships:

  1. Fuel Business School as Summer Internship
  2. Celebal Technology as Data Engineer


1. Fuel Business School:

What Internship Taught [Technical]:
  1. AI Basics
  2. Python Basics : Numpy, Pandas
Other activities : 
  1. Attended Events
  2. Linked in  Tracking and its PPT presentation
Internship Experience: [FBS]
Title: Future Skills Training in AI (Artificial Intelligence) along with Python and Personality Development
FBS Intro : 
  •     FBS(Fuel Business Schools)
  •     FUEL(Friends Union and  Energizing Lives)
  •     CEO : Founded by Ashoka Fellow (Ketan Deshpande),
  •    Mentored by : Entreprenur : Santosh Huralikoppi 

Fuel offers :
  •  100% CSR Scholarship for girls
  • Courses: PGDM,BBA
  • 2 Months of Sponsered intership for college students
Internship Duration : 2 Months 
  • Feb to March

Fuel Internship Things Done By me:

AI [AI Theory]
Python : Basics,Pandas,Numpy
Personality Development : Sessions
Certificates: Linked in Learnings
AMCAT Test (Technical coding preparation)
Project : (Which Project you did ?)


Fuel Learnings: 
 
 Future Skills Training in AI

Artificial Intelligence [Theory]
  • Types
  • Whats is AI
  • Eg
  • Python packages for AI
  • Need of AI
  • History
Summary of AI : 
  

Machine Learning  [Theory]
  • Defn
  • Supervised,Unsupervised
  • How ML works ?
  • Types
  • Supervised: Regression,Classification 
  • Unsupervised: Clustering, Dimensionality Reduction
  Summary of ML: 

Deep Learning [Theory]
  • Defn
  • Watch Youtube Video
Summary of DL: 

Python

Teachings: 
  • Basics
  • Pandas, Numpy
Python Basics:
   Day 1: 
  •  Datatype
  • List
  • functions in list: Append,Extend,isinstance,pop,clear,del
  • deleting values  from list by various functions
  • sorting functions: sort,reverse,index,count,min,max
  • reverse function
  • Tuple
  • defn,properties,syntax
  • function:index,count,sorted,reverse,max,min
  • Compare list and tuple ?-- refer drive notes
Day 2:
  • String
  • defn
  • indexing,slicing
  • string functions: (1 to 14)
  • functions:capitilize,lower,upper,title,lstrip,rstrip,replace,count,endswith,startswith,islower,isspace,isdigit
Day 3:
  • Dictionary
  • defn
  • Properties
  • Functions:(1 to 8)
  • Functions:get,keys,values,update,delete function,pop item,clear,del
  • 32 Keywords (All with Examples)
Day 4:
  • Questions on functions: 17 Questions
 

Summary of basics python:

Python Libraries:
  1.  Pandas
  2.  Numpy  
  3. Matplotlib
  4. Seaborn
  5. Sklearn


 Numpy
(Write here what was taken in fuel lecture )
Numpy: 
  • Intro
  • numpy typecasting
  • 1D
  • 2D
  • indexing and Slicing
  • Numpy Functions  
  • Zeros and ones
  • random
  • Matrix Operation
  • Math operation
  • Mean,Mode Median
Numpy Functions
1.
2.
3.
4.
6.
7.
8.

Zeros and Ones:
  • np.zeros
  • np.ones
Random Functions
  • Randint
  • rand
Matrix Operation
  • np.add(array,array1)
Math Operation Functions:
  • np.floor
  • np.ceil
  • np.around
Mean,Mode Median(Explain with Eg



Summary of Numpy Lib in Python
Pandas

Pandas >>> Python data analysis It is used for data manipulation and analysis
Topics:
  • Read File
  • Library Intro : Matplotlib,Seaborn
  • Data Structure
  • Use Case: Backend 90%works
  • How to create dataframe
  • Excel File
  • Pandas Function
  • Access Columns and rows
  • Handling Missing Values
  • Drop Null Value
  • Drop and DropNA


data structure:

  1. dataframe =====> 2d
  2. dataseries


Use Cases :
  1. Read CSV,Excel and json
  2. EDA (exploratory data anaylysis) (50% Work done here) (int values kiticoln kiti rows etc...)
  3. Data Cleaning: 1.Mising values 2.Outliers
  4. Encoding: [[ object(character) to integer ]] 1.One hot encoding --> we dont have labels like male,female for encoding here ..... 2.label encoding --> Male 0 Female 1
  5. Feature Selection [[Best Columns Top 50]]
  6. Join Multiple Columns

Note: Backend 90% Works..............

So Pandas is Important



Library Intro
  •  Matplotlib,Seaborn
  • sklearn.linearmodel import linear regression
How to create dataframe:
  1. Array
  2. Dict
  3. List
  4. Tuple
  5. Read CSV file
Excel File
excel_file_path = 'output_data.xlsx'
df1.to_excel(excel_file_path, index=False)

print(f"DataFrame successfully converted and saved to {excel_file_path}.")



Pandas Function

  1. df.info()
  2. df.describe()
  3. df.columns
  4. df.index
  5. df.dtypes

Access Columns and rows
  • df.iloc

Pandas Function

1.df.info()                            #give full information about dataset
2.df.describe()                        #give statistcial value
3.df.columns()                         #only show column
4.df.index()                           #only show index(row)
5.df.dtype()                           #show column datatypes
6.df.shape()                           #show dataset shape(rows and columns)
7.df.iloc[row_index,column_index]      #work like slicing
8.df.loc[row_labels,columns_labels]    #work like slicing
9.df[[column1,column2,......columnN]]  #want to access multiple columns
10.df.head(default=5)
11.df.tail(default=5)
12df[column_name]                      #want to access one column


Drop null value

# 1. df.dropna()
# 2. df.drop

Note: Drop and Dropna is always the Last option of any data scientist
Data Filling must be my priority


  Personality Development [Events]
  
i) 6th Feb: Fuel Conclave - CSR & HR teams shared inspiring experiences.
ii) 9th Feb: Fujitsu Team - Session on Effective Communication.
iii) 24th Feb: MEDIPOINT Team - Health Awareness session.
iv) 26th Feb: Japanese Introductory Session.
v) 27th Feb: Yes Bank Session on Financial Literacy.
vi) 8th March: AXA Team - Job Opportunities.
vii) 11th March: IGS Team - Market Trends Awareness.

Note: 
Q: Write here all 8 Session (Paste the list from linked in)
Q: What you learnt from those session summary.
Summary of What you learnt from all 8 events at fuel.

 Certificates (Linked in)
Total 44 Certificates Done: here is link : Link2DrivePdf
Note: Mention only imp (Properly done)

Links : 
ALL  Jupyter Notebook : DriveFolderLInk



POST 4] Celebal Training

  Celebal Training:

  1. Databricks
  2. Spark
  3. Cloud
  4. SQL
  5. Python

 1. Databricks: [12.2]  LTS (March)

Topics Covered:
  • What is databricks
  • databricks UI Explain
  • Medalllion Architecture (Multihop)
  • Control Plane and Data Plane--- (Components,Architecture)
  • Execution of Query on Databricks
  • Lazy Evolution
  • Wide and Narrow Transform
  • What is Cluster,driver node,Worker Node
Define Terms:
  • Databricks: Analytical platform + Process and Transform Big DATA
  • Azure :  Cloud ,Microsoft, Service=>VM, storage, database, datawarehouse.
  • Apache Spark: Analytical Platform + Process and Tranform : Bigdata [Fast] 
  • Hadoop: Open Source Framework: Store and Process data
  • Storage: BLOB, ADLS[Has Folders]
  • Cluster: Group of Computers(VM)
  • Worker: Executor => Executes Spark tasks and process data
  • Driver : Machine => Databricks Notebook is running
  • Need of Cloud 

23 rd June:>
  • Medallion Architecture : 3 Layers
  • Databricks Lakehouse platform: Datawarehouse + Deltalake
  • Deltalake Features
  • SCD,CDC : Slow changing data, Change Data Capture
# Databricks:
  1. Data + AI company
  2. Its platform for spark (Web based UI)
  3. Databricks Bring DE+DS+ML Together
  4. Its apache-spark based analytical platform
  5. It provides: one click support, streamline workflow and interactive workspaces that enables collaboration between DE,DS, and business Analytics
  6. Spark only built Databricks because of that databricks sit on top of the spark
  7. Whenever we want to process Big Dat and transform it into Some meaningful Info We use Databricks
# How databricks Work:
  • Databricks is just a platform.
  • It is software(Dont have cloud), it totally rely on cloud for storage and processing 
  • So, you need : 1) Cluster[VM] 2)Storage
  • Databricks Rely on : 1)Azure 2)AWS 3)GCP
# Deployment in Databricks

AZURE  => [PAS]  ~ (Manage Compute, VM on own
AWS       => [SAS]  ~ (You just use databricks AWS take care of all)

# Databricks UI
  • Create :Cluster,Notebook,Table
  • Workspaces
  • Recent
  • Searches
  • Data : DBFS,Database Tables
# whats Dataplane and control plane (Architecture)

1. Control Plane
  • Workflows, jobs
  • WebUI, NB, Cluster Manager
  • HANDLED BY DATABRICKS
2. Data Plane
  • VM, Storage
  • Storage: Azure=> BLOB, AWS:S3
  • HANDLED BY AZURE, GCP
#Lazy Evolution:
  • Once transformation is done, output will not be displayed to give output there we use actions
  • Actions: show(), display()
# Narrow and Wider Transformation
1. Narrow
  • 1 to 1
  • Data remain in same portion
  • map(),filter()
2. Wider (Shuffling)
  • group of key
  • reduce by key
  • your rows will be shuffled
#DBFS (Databricks File System)
  • It acts as virtual storage for cloud
  • Abstraction of that storage
  • Its storage account
  • it just build connection between : 1) Storage Account 2) Databricks
# Repos 
  • version control
  • stores all revison history (when add,deletd,updated)
  • Repository like: Git Hub
# Schema
  • Different Ways: List, String, etc..
  • Structtype,dataframe 
  • df(data, schema)
# databricks Workflows Jobs

  • "This is data Ingestion here we use Autoloader for it"  [ETL,ELT]
  • Full orchestration: Automatic,Pipeline
# Delta Live table
  • There is no connection between delta live table and delta lake
  • Its totally "New framework"
  • used for "Data Orchestration"
  • Maintains Quality of Data [bronze-silver-gold]
  • Data Ingestion
  • Data Pipeline
# Databricks SQL:
  • Serverless(dont worry about computation)
  • just use like Datawarehouse
  • TO use SQL : You have to enable : SQL Warehouse
# Streaming  in Databricks
  • We use Structured Streaming API in databricks i.e:=> "Autloader"
# When you want to bring data to databricks:
 1. Autoloader
 
2. Copy Into:
  • SQL code
  • No Streaming Data
  • Everytime you have to Run "COPY INTO" to copy data

Functions
Package=> from pyspark.sql.functions import col

# Select 
Four ways to specify columns=>
  1. string
  2. column
  3. df.<<column_name>>
  4. df['coln_name]
1.String
    > df.select("age") OR >df.select("*")
2.Column:
    > df.select(col("age"))
3.df.colnname
    > df.select(df.age)
4. df[' ']
    > df.select(['age'])



# Alias
    > df.select(col("age").alias("sauru"))

# head,tail
    > df.head(5)    -----------  (Top 5 records) 
    >  df.tail(5)      ----------- (Last 5 records)

# withColumn (To add 1$)
    > df.withColumn("unitprice",col("unitprice")+1)

# withcolumnRenamed
    > df.withColumnRenamed("sauru","sauru_bro")

# describe
    > df.describe()

# dtypes or Print Schema
     > df.dtypes

# len

# limit
    > df.limit(5)

# isNull
    > df.select("id","name").where("name ISNULL")
        
      .where( col("name" ).isNull())

    # drop
    > df.drop("name","age")

# dropduplicates
    > df.dropDuplicates()

# dropna (when entire row is null,it is deleted)
    > df.dropna("All")

# fillna
    > df.fillna("sauru")
    > df.fillna(100)

# sort (orderby)
    > df.sort("season")
    > df.sort("season", ascending = False)
             sort:orderBy

# group
    > df.groupBy("dept").count()

# where
    > df.where("instance = 2")


READ AND WRITE FILES
1. CSV
    > df = spark.read.csv("         ")
    > df = df4.write.csv("         ")

2. JSON
    > df = spark.read.json("   ")
    > df = df4.write.json("    ")

3. Parquet
    > df = spark.read.parquet("  ")
    > df = df4.write.parquet ("    " )

 2. Spark [3.4.1]

Questions:
  1. What is Spark
  2. What are Spark Functions
  3. How to read and write file in spark
  4. Describe about spark architecture
  5. Components of Spark
3. Cloud

INDEX:
  • Cloud : using someone else's resource, Pay as you go
  • Cloud Providers : AWS,AZURE,GCP
  • Azure Analytics: Compute, Storage, database, data warehouse
  • Important abbrevations
  • Azure (categories - Services)
  • Cloud Service Model
      1. IAAS [Virtual Machine] : user can access and manage fundamental computing resources
      2. PAAS [Azure ]: provides complete development and deployment environment
      3. SAAS [Power BI]: user can access and use software application over net via subscripton
  • Types of Cloud
      1. Public: Infrastructure and service (provided by 3rd party) accessible over internet
      2. Private : Infrastructure and service deployed and managed with organization
      3. Hybrid : PUBLIC + PRIVATE
      4. Community
  • DE Architecture
  • ADF :  Azure Data Factory [ETL]
  • Why do we use Spark: Suppports many language, Memory, Fast
  • Nested JSON: Dictionary=>{key,value}
  • Databricks : Analytics Platform run on cloud
  • Datawarehouse tools: Azure Synapse, Aws RedShift
  • Hadoop Diagram
  • Hadoop Components : 1] HDFS: Store 2] Mapreduce: Processing 3]Yarn : Process Manager
  • Diff betn BATCH and STREAMING=>  Batch:OLAP, STREAM: OLTP
  • Diff betn Hadoop and Spark => Spark(100,memory,all lang) | hadoop(Slow,Disk,Java)
  • Memory vs Harddisk => Fast is memory | Slow is Hardisk for Large Data   
  • Spark Architecture
# Cloud Questions:
  1. What is Cloud?
  2. What are cloud providers ?
  3. What is Azure ?
  4. What are Cloud Service Models ?
  5. What are Cloud Deployment models ?
  6. What is Hadoop and its Components ?
  7. What is ADF ?
  8. What is ADLS : Storage, Analytics, Cluster
  9. What is Linked Sevice in AZURE ?
  10. What is BIG DATA?
  11. Diff Betn BATCH and STREAMIG ?
  12. What is DeltaLake?
  13. Diff betn SPARK and HADOOP ?
  14. What is Synapse ?
  15. What is MetaStore ?
  16. ETL vs ELT ?
# Cloud:
  •     using someone else cloud
  • Pay As You Go
# Cloud Providers
  1.  AWS : 34%
  2. Azure : 21%
  3. GCP : 10%
  4. IBM
  5. SAP
  6. Alibaba
  7. VMware
# Azure Analytics
  1. Compute
  2. VM
  3. Storage
  4. Database
  5. Datawarehouse
  6. BLOB, ADLS
  7. Azure SQL
  8. Transformation
  9. ETL
  10. Synapse
  11. ADF
  12. Databricks + HD Insights
# Abbrevations:
  1. BLOB : Binary Large Object
  2. ADLS : Azure Datalake Storage
  3. S3 : Simple Storage Service
  4. RDS : Relational database Service
  5. ADF : Azure Data Factory
  6.  ETL : Extract Transform Load
  7. EMR : Elastic Map Reduce
# Categories : 
AZURE | AWS
Compute: VM |
Storage: BLOB,ADLS(Has Folders) |
Database: Azure SQL
DataWarehouse: Synapse
ETL: ADF
Transformation: Databricks + HD insights

# Cloud service model:
1. IAAS : VM
2. PASS : SQL DB in Cloud
3. SAAS : Power BI

# Types of Cloud
1. Public 2. Private 3. Hybrid Cloud 

# ADF : 
  • for complete ETL and ELT
  • bought by azure
# Note: We just have to drag and drop in ADLS (No Code)

# Why are we using PySpark ? (If no Code)
  • We have to drag and drop in ADLS (But Behind the Scene Pyspark code Runs) 
# Nested Json:
  • Used to handle the Complex Data
# Databricks:
  • Databricks (Cloud Platform ) ---- Rely on---- > Azure,GCP,AWS
# Datawarehouse Tools
  •   Azure: Synapse
  •   AWS : RedShift
  •  Primary Tool in Azure: Synapse
  • Databricks can be Integrated with other Azure Services: 
    1. Azure Data Lake Storage [ ADLS]
    2. Azure SQL DB
    3. Azure Synapse Analytics
All these above 1,2,3 to build Data Analytics Solution

A: ADF(No Streaming),ADLS => No Code
B :Databricks => Complex Code

A: For No code===> ADF,ADLS
B: Why we need to code (If ADLS is available) => For Complex Scenario We need code(For this tool is DataBricks)


Hadoop
# Hadoop Evolution Story

Problem:
 1]Storage [2003]
  •  Google had problem to store data
  • Google released paper to describe how to store large dataset on Distributed Storage
  • This paper was called as Google File System [GFS]
2] Process [2004]
  • Google had problem to process data
  • Google released paper : how to process large dataset on processing
  • This paper was called as Map Reduce
Hadoop [2006]
  • Yahoo took this paper and implemented it.
  • The implementation of GFS was named as HDFS (Hadoop Distributed File System)
  • The Implementaion of Map Reduce was named as Map Reduce

> HDFS Components:
  1. Store : HDFS
  2. Process : Map Reduce
>Yarn:
  • Process Manager
  • Job Scheduling
> Limitation:
  1. Stores data on disk
  2. only batch processing
# Note:  In Hadoop only Batch Processing is Done, While Streaming is not possible 


 The Scientist [~ Matei Zaharia] came and developed Spark (Added:Streaming
So, People Shifted from Hadoop to Spark

# Difference between Batch and Streaming

1.Batch:
  • Previous/last 5 Years data [OLAP]
  • Eg: DMART 
2. Streaming:
  • Runtime/Realtime data [OLTP]
  • Online Purchase / Online Payment
  • Eg: PhonePay 
 Streaming Eg is of Phonepe :
  • No delay,No Latency
  • Sender send :500  his account get deducted by 500 , Receiver receives :500 his account gets 500 added 
# Difference between Hadoop and Spark

Spark:
  1. 100 Times Faster than Hadoop
  2. It Works on Memory
  3. Languages: Spark can be integrated with Many languages
Hadoop
  1. Slow
  2. It works on Disk Calculation
  3. Language: Only Java
# Memory vs Harddisk best eg by naval yemul sir
  • Naval: hey saurabh tell my mail ?
  • Saurabh : Thinking what would be naval sirs mail...hence slow its Harddisk
  • Naval : Hey saurabh tell me your own mail?
  • Saurabh : Quickly says :  saurabh6527@gmail.com (hence Quick it is Memory)

# Story of Spark
  •  Matei Zaharia : Spark Founder
  • He Founded and donated to "Apache Spark"
  • So,To Know databricks you should know "Apache Spark"
  • People Who know and worked with "Apache Spark" they have worked together with "AZURE" developers and they toghether invented "Azure Databricks"
# How Code Executes On Spark ?
1] Write Code
Behind Scene:
Spark Driver =>(Submits Code) => DAG Scheduler 
DAG
  • (Creates Logical and Physical Plan)
  • Optimizes plan of your Code and Break in two stage: 1) Stage 2) Task
(Now, these Tasks are sent to "Worker Node" for "Execution"

# DAG ("Direct Acyclic Graph)
  • When You do transformation -> Nothing Happens....
  • When You take Action => BHS (DAG Works)
  • [Filter, group by, etc...]
  • (How to Show result in Best "OPTIMISED FORMAT" )
# Spark Context:
 Important: "Spark Context" is Entry Point to Start "Spark Driver" 


# Define :
  1. Databricks
  2. Azure
  3. Apache Spark
  4. Hadoop
  5. Cluster
  6. Worker Node
  7. Driver Node
  8. DBU
1. Databricks (Refer above in Databricks Notes)
2. Azure:
  • Azure is a cloud computing platform and service offered by microsoft
  • It provides wide range of cloud services like: Compute,VM,Storage,Database,etc..
3: Apache Spark:
  • Apache Spark is open source distributed computing system that provides a fast and efficient platform for processing large amount of data
  • Databricks on other hand is unified analytics platform built on top of spark
4. Hadoop 
  • Hadoop is an open source framework that allows for distributed storage and processing of large dataset across cluster of computer
Components: 
  1. Hadoop distributed File System : [To Store Large Data ]
      2. Map Reduce  [ process data]
      
      3. YARN (Yet another Resource Negotiator) [Job Scheduling,FrameWork]

# Cluster 
  • Cluster is group of Computers[VM] that work together to process data and Run Computation 
  • Consists of Workers,Drivers
# Worker Node [Executors work in parallel to execute spark task and process data ] (Student)
  • The Worker Node is Called as "Executors".The Executors are responsible for executing and processing data in parallel across the cluster.
(Executor works togethers in parallel to execute spark task and process data)

# Driver Node [Notebook | Job => Running]
  • The Driver Node is Databricks is the machine where the databricks Notebook or job is running
# DBU 
  • Databricks Unit
  • Compulsory 2/3 DBU Consumed per Hour
  • It is based on Costing
#  RDD, Dataframe and Dataset 

ETL ( Extract Transform Load )
  • RDD [Immutable]
  • RDD[Backbone of Spark] : Its DS to handle both Structured and Unstructured
  • Dataframe : Looks Like Table
  • Dataset : RDD+Dataframe => Support only Scala, Java
# Our Work
  1. Formuale 1 Racing Dataset
  2. Raw Data: Azure Datalake (CSV,JSON,Semi-Structured Folder)
  3. ADLS data=> PULL TO => Databricks (Here we do transformation and convert to Parquet /Delta Lake)
  4. Write to DatawareHouse(Clean Data)
  5. Connect to Power Bi
# Apache Spark
  • RDD:Backbone of Spark
  • Allows store and process data,faster data processing => RAM
> Spark Language: 
  • R,scala,python,SQL
  • Real Time Processing
  • Process Big Data
> Spark Architecture:
  • Eg: Standalone Computer => Used to play games,movie,etc..
But to Process BIG DATA => We need group of computer => Cluster => "Framework"(Apache Spark")

> Components: 
  1. Driver Process : BOSS
  2. Executor Process : Worker
 Driver Process : Keeps track of all info about spark,analyze  the work/Task and distribute to worker  node


# How we are going to Do:
  1. SQL
  2. Python
  3. PySpark
  4. DataBricks (We run it on Azure)
  5. Azure
  6. ADF
  7. Small Touch on : A) Synapse B) Power BI
# Medallion Architecture [Multi Hop]
# Databricks Lakehouse Platform
#Delta Lake
# ADF
# Synapse
# MetaStore
# ETL vs ELT
# Unity Catlog
# What are Linked Services in Azure
# Big data 5vs
# Functions


# Medallion Architecture : 
  • It is a data designed pattern used to logically organize data in "LakeHouse"
  • To Improve Quality of data it goes through 3 layers:
    1. Bronze Layer
    2. Silver Layer 
    3. Gold Layer
  • We store Data in Lakehouse
# Databricks Lakehouse Platform
    Datawarehouse + Datalake => Lakehouse(Delta Lake)


# Story:
      B4 we extracted only structure data from datawarehouse and then BI and Dashboard Preparation
Now, We Merged both Dataware house and Data Lake 
To store this type of data we use DELTA LAKE and this is stored in Parquet Form

# DeltaLake 
[Parquett + (transaction log) form]
[Parquett + transaction log [JSON + CRC] ]

Parquett is compressed (50%  - 60 % Reduction)
(Because It is in "Columnar Format" )


Delta has Both :
  1. Streaming 
  2. Batch
Delta lake stores data in Parquet form by default

More about Datalake => Naval Sirs Youtube Chaneel (40,41)

# Delta Lake :
> Diagram => delta.io 
Features:
  1. ACID Transaction
  2. Scalable Metadata
  3. Time Travel
  4. Open Source
  5. Unified Batch / Streaming 
  6. Schema Evolution / Enforcement
  7. Audit History
  8. DML Operations 

Note:
# Mainly we do all these things on realtime data 
            1: SCD (Slowly Changing Dimension)
            2: CDC (Change Data Capture)
# Mainly these three operations are done:
            1: UPDATE 
            2: INSERT
             3: DELETE
# To Know CDC you must know 
                1: Python
                 2: SQL
                   3: Pyspark
# ADF madun ADLS madhe Data JATO...
# CI/CD : Continuous Integration (Bug,Testing) / Continuous Deployment (Pass all test ) 


# Summary 
Note: ADLS == DATALAKE

# Delta lake : 

1. combination of (Datawarehouse + Datalake)
    Datawarehouse: Structured
    Datalake: Both

2. Datalake Madhe apn file la convert to delta Karto
    (df.write.format("delta"))
    tyala delta lake mhnta

3.Delta Lake Provides [Parquet + Transaction Log]

Top Best Famous Features of Transaction Log  :
    1. Time travel (Previous Transaction Data)
    2. ACID Properties
    3. Batch + Streaming [We canntot do this in data lake]
    4. DML operations    [Insert Update Delete]

Data Warehouse +  Data Lake ==> on Top  is Lake House

(Data Warehouse: Structured Data)
(Data Lake : All type of Data)
(Data Lake :  File from data lake is converted to parquet i.e Delta Format) => You see "Lake House"


# ADF (Azure Data Factory)
  1. ADF is  ETL and ELT service
  2. It provides Multiple Connectors
      • Which Helps:
      1. Data Extract
      2. Data Clean
      3. Data Store
  3. Why we use ADF => "Data Orchestration"
                                        (Automatic Process of load and clean data)

>  Benefits=>
  1.     Reduced Cost: No Infrastructure (Subscription)
  2.  Increase productivity : Drag and Drop (No complex Code)
  3. Flexibility : Connect Various Data Sources (On premise cloud)
  4. Scalabilty : up and down (Pay as you go) 
  5. Security : Authentiction and Autherization
# Synapse
Storage => Blob,ADLS
ETL => ADF
Modern Tool
DataWarehouse => "Synapse"

1. Synapse is a modern DataWarehouse
2. You can do this in Synapse:
  • ADF
  • Connect ADLS
  • Power BI
  • Data warehouse
# Meta Store
  • It is top level of databrick Hierarchy
  • This Way :
    1. Metastore
    2. Catlog
    3. Schema
    4. Table
  • To Use Metastore you need "Unity Catlog"
>Unity Catlog=>
  •     For Delta Sharing
  • Delta Permission
  • You can Create different Catlog with Metastore 

# ETL vs ELT

ETL (Extract transform Load)

Extract: Extract code from datalake that may be in any format 
(Json,CSV,Parquet)

Extract to => ADF,Synapse

Transform :
  • Remove Null
  • Add New Column => With Column
  • Clean Data
Load:
  • Synapse
  • Snowflake
  • LakeHouse=>PowerBI => Visualization
# ELT
  • First we load data and then transform
  • We Keep data Loaded...
# What are Linked Service in AZURE
  Source   ====> Destination

Source:
  •     Read from Sources like : 1. BLOB 2. SQLDB

Between===:
  •     Copy data from source to target
  •     Connecting Stringer
Destination:
  • Write Data 
    
# Big Data 5V`S
  1. Value
  2. Variety
  3. Varacity
  4. Volume
  5. Velocity
# Functions: 
    
1. Show, Display
2. Describe, dtype
3.Select,Alias,WithColumn,WithColumnRenamed
4. Take,head,limit,sort,drop,dropduplicate,isNull,dropna,fillna


4. SQL


A] Topic Covered
B] Question that can be asked in SQL


A] Topic Covered


 #  Data: Information Stored in database
    Data is collection of facts in form of words,numbers or even pictures

 Database: Structured Collection of data (Rows and Columns)
    Structured Data has Rows and Columns

Popular Database Software:
  1. Oracle database 
  2. Microsoft SQL Server
  3. My SQL
  4. Postgrese SQL
  5. Mongo DB
  6. IBM DB2
# Five Submodules of SQL
1: DDL [ Definition ] [Auto commit]
2: DML [ Manipulation] [Not Auto commited]
3: DCL [ Control]
4: TCL  [ Transaction]
5: DQL [ Query]


Data WareHouse:
defn: 
  1. It support business intellegence activities such as Reporting,Analysis and decision making
  2. Handles only Structured data, Huge Volume, Huge than Database
Modern Datawarehouse[We use clouds data] Companies:
  1. Azure
  2. AWS
  3. GCP
  4. Databricks
  5. Snowflake
Why we need Datawarehouse ? 
  • Overall datawarehouse provides a robust and flexible solution for storing and analyzing data, and are essential for organization that want to gain Insights from their data, and make informed decison
#   OLAP vs OLTP

1. OLAP [Online Analytical Processing] 
  • OLAP system are designed to support analytical and reporting activites by providing fast and flexible access to large volumes of data
  • Eg: General Store : Data 

2. OLTP [Online Transaction Processing]
  • OLTP Systems are design to support day to day operations: by                                                           Processing high volume of transaction quickly and effectively and provides the real time access to critical business data
  • Types of database processing used for realtime transaction oriented application
  • Eg: Online Transaction APP
# Why do we need Datawarehouse ?
overall datawarehouse provides a robust and flexible solution for storing and analyzing data,
and are essential for organization that want to gain Insights from their data, and make informed decision


# Datalake :
a data lake is centralized repository of data that allows organization to store and manage large volume of data which is of 1. Structured, 2. Semi-Structured 3. Unstructured Data 

# What is SQL query order of execution ?
Ans: 
from 
where
select 
order by
-----------------------
from 
group by
select 
order by
----------------------
from 
where 
group by
having
select 
order by
limit

from > where > group by > having > select > order by > limit
Example : 
select cat,avg(sales) as avg_sales
from sales data
where year>2000
group by category
having count(*) > 10
order by avg_sales DESC
limit 3




 
 # SQL 
 SQl is standard language for storing data, manipulate, and retrieving data

# SQL vs MYSQL
SQL is Structured / Standaralized Query Language
While,
> MySQl is opensource relational database management system that uses SQL
>MySQL is platform where SQL queries are runned


 Datatypes
 Int:  Stores Whole Number (Eg: 1,2,3,etc)
 dec: Stores Decimal number (Eg: 3.14)
 char: Stores Fixed Length Character String (Eg:Hello)
 varchar: Stores Variable Length Character String (Eg: Helloworld)
date: Stores date (Eg:2023-05-07) 
datetime:  Stores both (Eg: 2023-05-07 12:30:00)
money: stores monetary values with upto 4 decimal place
 boolean: Stores boolean values (Eg: true/false)
   

 Clauses 
clauses are keywords/commands used to perform various operations on data in database table 

where: used to filter data based on condition.

 group by: used to group data base on 1/more columns in table
 order by: sort the data in asec/desc or 1/more columns in table
 having: used to filter data based on condition after grouping 
select : used to retrieve data from 1/more table 
from: used to specify table from to retrieve data 
limit: used to limit the number of rows returned by a query
join: used to combine data from two or more tables based on common column
 

Constraints  
Constraints are rules that are applied to database table to ensure the accuracy,consistency and integrity of the data.
1: primary key: ensures that values in specific columns are "unique  and not null"
2:  foreign key: ensures values in specific columns in one table matches values in primary key column of another table
 3: Unique: ensures values in specific column are unique
4: Not Null: ensures values in specified column are not null 
5: Check: enforces a specific condition on data in specified column 
6: default


Joins 
inner, left ,right, full outer join , self join  


Views :
 Create virtual/temporary table
Syntax: create view <<viewname>> as select * from <<tname>>
Eg: create view temp table as select * from employee
select * from temptable


Five Submodules of SQL

1: DDL [ Definition ] [Auto commit]
2: DML [ Manipulation] [Not Auto commited]
3: DCL [ Control]
4: TCL  [ Transaction]
5: DQL [ Query]

1: DDL:
Definition: 
Types:
  • Create
  • Alter
  • Drop [Delete all data and table] 
  • Truncate [Delete only data in rows and columns]
Syntax: 
1. Create: create table <<tablename>> (coln_name Datatype );
2. Alter: alter table <<tablename>> add <<coln_name>> <<soln_defn>>
3. Drop: drop table <<tablename>>
4. Truncate: truncate table <<tablename>>

2: DML:

Definition: Used to modify database
Types:
  • Insert
  • Update 
  • Delete 
Syntax:
 1.  Insert : insert into <<tablename>>  ( <<coln1, coln2,coln3>> ) values  ( <<values1,values2,..>> ) ;
 2. Update : update <<tablename>> set (coln = value1 ) where <<condn>>;
 3. Delete :   delete from  <<tablename>>  where  <<condn>>;

Example: 
1: Insert into emp (id,name) values (1,"dj");
2: Update student set stud_name  = "Sauru" where id = 3
3: delete from student where name = 'Sauru'

 


3: DCL
Definition: 
Types:
  • Grant [Gives Access Privilage]
  • Revoke [Take Back Access Permission]
Eg:

4: TCL
Definition: 
Types:
  • Commit
  • Rollback
Eg:

5: DQL
Definition: 
Types:
  • Select
Eg:

Drop: 
drop the view 
drop view <<viewname>>




# Operators:
Logical Operator
1: AND 
used to compare more than one condition TRUE TRUE=> TRUE (Both must be True)
Eg: select * from employee where AGE>25 AND age<30;
Output: 27,28

2: OR
use to  compare data with more than one condition
If any condition is true it returns TRUE
Eg: select * from employee where age>25 OR age<30
Output: All records from table => 26,27,28,29

3: Between
 The Begin and END values are Included.
To avoid multiple (AND,EQUAL=>)
Eg: select * form student where age between 20 AND 30
Output: 20,21,22,23...24

4:IN
To avoid multiple OR
Eg: Select * from student where fname IN ('vinod', 'sauru','ram')
fname = 'vinod' OR fname = 'sauru' OR fname = 'ram'


5: NOT
Except the records
Eg: select * from student where fname NOT IN ('sauru','vinod')
Eg: select * from student where age NOT Between 20 AND 30


6: IS NULL
select * from student 
where name IS NULL;


7: IS NOT NULL
select * from student 
where name IS NOT NULL




# UNION and UNION ALL :
Union : REMOVES DUPLICATE
Union ALL : HAS DUPLICATES

# SUBQUERIES 
Queries That are embedded within another query
Two Types: 1. Co-related 2. Non Co-related

# Pattern Matching
(%, _)
Wild Cards are used with LIKE operator

% => Represent Zero or More Character
_   => Represent Single Character.

 Write SQL query to find the names of employees that begin with 'A' ?
Ans: Select * from employee
where ename like 'A%';


# Functions:

A] Aggregate Functions: 
  1. Sum
  2. Count()
  3. Min
  4. Max
  5. Avg
Count()=>
It returns the no. of rows that matches a specific criteria
select count(rno) from stud;

Avg()=>
It returns the average value of a numeric column
select avg(rno) from stud;

Sum ()=>
It returns total sum of a numeric column
select sum(rno) from stud

Max()=>
select max(age) from stud;

Min()=>
select min(age) from stud;

Note: used in companies for employee total salary count
 
B] String Functions:

String Functions: concat, reverse, replace, upper, lower, len , ltrim, rtrim, left, right


1. Concat
  • select concat(emp_name, '--> ', dept) from employees
2. Reverse
  • select reverse(emp_name) from employees

3. Upper 
  • select upper(emp_name) from employees
4. Lower
  • select lower(emp_name) from employees
5. Replace
  • select replace (emp_name,'john','saurabh') from employees
6.Len
  • select len(emp_name) from employees
7. LEFT[JOHN]
  • select left(emp_name,3) from employees op: JOH 
8. RIGHT[JOHN]
  • select right(emp_name,3) from employee op:OHN
9. SUBSTRING [JOHN]
  • select substring(emp_name,2,4) from employee op: OHN 
10. CharIndex [JOHN]
  • select charindex ('$',emp_name) from employee op: 1
11. Replace 
  • select replace (emp_name,'JOHN','sauru')
12. LTRIM
  • select ltrim(emp_name,'JOHN','SAURU')
13. RTRIM
  • select rtrim ('     sauru') op:sauru
14. STUFF [john]
  • select stuff(emp_name,2,3,'!') op:J!
15: Replicate 
  • select replicate (emp_name,3) op:JohnJohnJohn
16:Space
  • select space ('  ') op: '   '

C] The Datetime Functions
1. getdate()
2. sysDateTime
3.current_timestamp
4. DateName
5. DateAdd()
6.DateDiff
7. DatePart
8.Day(), Month(), Year(), IsDate()


# Where clause:
A) Logical and B)Comparision Operation
A] Logical Operator: 
  1. AND
  2. OR
  3. NOT 
  4. LIKE
  5. BETWEEN
  6. IN, NOT IN
  7. Exists, Not Exists
  8. IS NULL, IS NOT NULL
B] Comparison  Operator



B] Question that can be asked in SQL

Practice Questions=>

Q: Write SQL Query to find duplicate rows in table
Ans: Select salary,count(salary) from employee
group by salary;
having count(salary) > 1;

Q: Write Query to calculate even/ odd records from table
Ans: Select * from employee 
where  MOD(id, 2) = 0 => EVEN 
where MOD(id,2) = 1  => ODD

Q: Find First and Last record
Ans: 
First Record:
select * from employee
where id = (select min(id) from employee)

Last Record: 
select * from employee
where id  = (select max(id) from employee)


Q: How do you copy all records of table ?
Ans:
create table temp as select * from employee;

Q: Print 1 to 10
Ans: declare @start int 
declare @end int 
set @start = 1
set @end = 10

While @start <= end
begin 
print @start 
set @start =  @start +1
end

Q: Fetch Alternate records
Ans: select * from emp
where (emp%2) = 1


Q: Find Second Highest Salary (Using Max())
Ans:
Select max(salary) from employee 
where salary Not in (select max(salary) from employee)

Q. Write Query to find third highest salary from employee
Ans:
select * from employee
Limit 2,1;
(Limit 2,1=> Fetch on row after 2nd)

Q.Write Query to find N th Highest Salary
Ans: 

1.Max()

2nd Highest 
select max(salary) from employee
where salary < (select max(salary) from employees)


3rd highest
select max(salary) from employee
where salary < (select max(salary) from employee)
where salary< (select max(salary) from employee)

2. Limit
select distinct salary from employee
order by salary desc
limit n-1,1

2000 => limit(0,1)
1000 => limit(1,1) : after 1st row 1st record is fetched
500   => limit(2,1) : after 2nd row 1 st record is fetched

3. Self Join (Without Limit)
select salary from employee e1 
where 3-1 = select count(distinct salary)
from employee
where e2.salary > e1.salary


4. Not IN
select max(salary) from employee
where salary NOT IN (select max(salary) from employee)

5. dense_rank




# Tricky Question: 
Q. Add New Column To Table 
Ans: alter table customer
add phone_number varchar(21);

Q. Change Datatype
Ans: alter table customer 
alter column phone_number varchar(21);

Q. View Specific Row
Ans: Select Top * from employee

Q. Add primary Key to table
Ans: alter table student
ADD primarykey(ID);

Q. Add Unique Index/Constraint to table
Ans: alter table <<tblname>> 
ADD Constraint <<indexname>>
UNIQUE (Coln_name);

Q. What is Alias Command
Ans: Alias name can be given to any
table or column

Q. What is SQL Query order of execution
Ans: 
from 
where
select 
order by
-----------------------
from 
group by
select 
order by
----------------------
from 
where 
group by
having
select 
order by
limit

Q. Write SQL query to find the names of employees that begin with 'A' ?
Ans: Select * from employee
where ename like 'A%';


Therotical Question

#List Ways of Getting COunt of records frim table
Ans: Two ways =>
Select * from employee;
select count(*) from employees;

#What do you mean by table and field in SQL?
Table: Collection of data (rows and Columns)
Field: Columns in Table

Employee Info                     => Table
--------------------------------   
emp_id | Name | emp_age   =>  Fields 

0          |  sauru |    56
1          |  akash |   21 

# How to insert NULL values ? 
Ans: Two Ways => 
1 ] Implicity ommiting Column from Column List 
2 ] Explicitly providing NULL keyword in values clause

# Are Null values same as blank Space or ZERO
Ans: 
Null=> unavailable or Unknown
Zero=> Number
Blank Space => Character

# SQL System Defined Functions?
Ans: Aggregate, String, Date

# SQL Injection 
Ans: @Backend unwanted code is executed and Queries are modified Data is Stored or deleted

# Stored Procedure
  --can be saved and reused like functions
create procedure <<procedurename>>
AS
select * from emp
GO;

Execute <<procedure_name>>

# Alias vs Rename
> Alias : Temporary 
Select name as Nav from emp;

> Rename : Permanent
exec sp_rename 'emp', 'emp1'

# view and Stored procedure
> View: Querying Data
> Stored Procedure : Used for executing predefined logic and operation on database

# Primary vs Foreign =>
Primary=> Not NULL and Unique Values
Unique => Unique and accepts only 1 NULL (values)
Foreign => Relationship between two tables

5. Python:

1: Python

___________________________________________________________________________________

POST 5]# MAKE Notes Online 

 Notes(File)

1. Celebal Training:

  • Databricks 
  • Spark
  • Cloud
  • Python (Basic + Adv Topics)  [Complete at Pune this notes]
  • SQL(Basic + Adv Topics)  [Few Topics Remain Cover it.. ]
 2: MY Project [Complete at Pune this notes]
  • De First project
  • IPL
  • UBER


3: SQL

4:Python

5: Project File


6. Notebook: [One Day 1-5 Notebooks take pdf save and also write fast notes important]
  1. Roadmap Guidance + Azure 900
  2. Data engineer Roadmap
  3. Celebal Mentorship Program
  4. Rajas De (Youtube)
  5. BIG DATA
  6. Course : Apache Spark Notebook

In My Room i Designed this Map
 [Python DE SQL]:

1. Python 
~Created by : Guido van Rossum


Learn From Data With Darshil DE Course

(Do whole Course again and give this for Interview i did this.....)



2. DE
Photos Of DE roadmap and how DE Works

Identify The Data Engineer Problem and Solve It



3. SQL

Learn SQL Baisc to Advance via proper course

Complete Sql Roadmap 2023 : SQLROADMAPLINK

Here is list of function i use  everyday on day to day basics as a Data engineer 
  • Select
  • Where Clause
  • Order by
  • Aggregation Function (Min,Max,avg,count)
  • JOINS (Inner,left,full)
  • Union
  • Group By
  • Case Statements
  • Working with Datetime/Timestamp (Adding/Substraction/Extracting Year / Month and Day)
  • IF NULL
  • Coalesce
  • Numbering Functions (Rank,Dense _Rank, Row_number)
  • Conversion Functions: CAST as INT/FLOAT/STRING
  • Formating Datetime
  • Creating Function
  • Merge Statement (For UPSERT)
  • Qualify






Comments

Popular posts from this blog

[ STRUCTURED ] Unity Catalog Concept in Databricks