SAURABH NOTES BLOG
POST 1] Unity Catalog Links and Quick Notes By Nishtha Jain and Pratima Jain
>> 6 Augut Call by => Nishita Jain @ Morning
Azure data bricks, UC, Data governance ,Why we use UC ? , Features, Advantages, Limitation, Community edition, Complete UC, Setup db UC, Scheme, Commands?, UC then Migration, Data Plane and Control Plane ,Project : How to UC migrate, UC 3 Level : Hive and then Schema and table, Hive metastore, Catlog Data governace security, Make Documents, Share on group, 2 weeks UC complete
Links:>
Videos [24]
>> How to Setup Databricks Unity Catalog for Azure
>> How To Setup Databricks Unity Catalog for AWS
>> How To Setup SCIM in the Account Console
>> How to upgrade your Hive metastore tables to Unity Catalog Using Sync
>> How to setup Databricks Unity Catalog with Terraform
>> How to Getting Started with Databricks Connect V2
Celebal Tracker:
Tracker :> Link
Sample :> Link
Colab :> Link
Curiculum :> Link
Youtube Links
https://www.w3schools.com/sql/sql_ref_sqlserver.asp
Naval Sir Channel
https://www.youtube.com/@thedatamaster
https://youtu.be/tZw4Nrv5X9g
Company Initernals
Celebal Company Internal : Drive
Databricks: https://learn.microsoft.com/en-gb/azure/databricks/
-------------------------------------------------------------------------------------------------
30 June :>>>>>>>>
----------------------------------------------------------------------------------------------------------
27 June :>>>
----------------------------------------------------------------------------------------------------------------
23rd June:>
Naval Yemul:> 9689777700
------------------------------------------------------------------------------------------------------------------------
DE Course (SQL) Curiculum :> Link
SQL W3 Schools :---->>> Link
You Tube :---->>> Youtube
Joins :------->> Link
Import data From CSV to SQL Server :> Youtube
Databricks: https://learn.microsoft.com/en-gb/azure/databricks/
-------------------------------------------------------------------------------------------------
30 June :>>>>>>>>
----------------------------------------------------------------------------------------------------------
27 June :>>>
----------------------------------------------------------------------------------------------------------------
23rd June:>
Naval Yemul:> 9689777700
------------------------------------------------------------------------------------------------------------------------
DE Course (SQL) Curiculum :> Link
SQL W3 Schools :---->>> Link
You Tube :---->>> Youtube
Joins :------->> Link
Import data From CSV to SQL Server :> Youtube
Internships:
- Fuel Business School as Summer Internship
- Celebal Technology as Data Engineer
- AI Basics
- Python Basics : Numpy, Pandas
- Attended Events
- Linked in Tracking and its PPT presentation
- FBS(Fuel Business Schools)
- FUEL(Friends Union and Energizing Lives)
- CEO : Founded by Ashoka Fellow (Ketan Deshpande),
- Mentored by : Entreprenur : Santosh Huralikoppi
- 100% CSR Scholarship for girls
- Courses: PGDM,BBA
- 2 Months of Sponsered intership for college students
- Feb to March
Python : Basics,Pandas,Numpy
Personality Development : Sessions
Certificates: Linked in Learnings
AMCAT Test (Technical coding preparation)
Project : (Which Project you did ?)
- Types
- Whats is AI
- Eg
- Python packages for AI
- Need of AI
- History
- Defn
- Supervised,Unsupervised
- How ML works ?
- Types
- Supervised: Regression,Classification
- Unsupervised: Clustering, Dimensionality Reduction
- Defn
- Watch Youtube Video
- Basics
- Pandas, Numpy
- Datatype
- List
- functions in list: Append,Extend,isinstance,pop,clear,del
- deleting values from list by various functions
- sorting functions: sort,reverse,index,count,min,max
- reverse function
- Tuple
- defn,properties,syntax
- function:index,count,sorted,reverse,max,min
- Compare list and tuple ?-- refer drive notes
- String
- defn
- indexing,slicing
- string functions: (1 to 14)
- functions:capitilize,lower,upper,title,lstrip,rstrip,replace,count,endswith,startswith,islower,isspace,isdigit
- Dictionary
- defn
- Properties
- Functions:(1 to 8)
- Functions:get,keys,values,update,delete function,pop item,clear,del
- 32 Keywords (All with Examples)
- Questions on functions: 17 Questions
- Pandas
- Numpy
- Matplotlib
- Seaborn
- Sklearn
- Intro
- numpy typecasting
- 1D
- 2D
- indexing and Slicing
- Numpy Functions
- Zeros and ones
- random
- Matrix Operation
- Math operation
- Mean,Mode Median
- np.zeros
- np.ones
- Randint
- rand
- np.add(array,array1)
- np.floor
- np.ceil
- np.around
- Read File
- Library Intro : Matplotlib,Seaborn
- Data Structure
- Use Case: Backend 90%works
- How to create dataframe
- Excel File
- Pandas Function
- Access Columns and rows
- Handling Missing Values
- Drop Null Value
- Drop and DropNA
data structure:
- dataframe =====> 2d
- dataseries
- Read CSV,Excel and json
- EDA (exploratory data anaylysis) (50% Work done here) (int values kiticoln kiti rows etc...)
- Data Cleaning: 1.Mising values 2.Outliers
- Encoding: [[ object(character) to integer ]] 1.One hot encoding --> we dont have labels like male,female for encoding here ..... 2.label encoding --> Male 0 Female 1
- Feature Selection [[Best Columns Top 50]]
- Join Multiple Columns
Note: Backend 90% Works..............
So Pandas is Important
- Matplotlib,Seaborn
- sklearn.linearmodel import linear regression
- Array
- Dict
- List
- Tuple
- Read CSV file
- df.iloc
Pandas Function
1.df.info() #give full information about dataset
2.df.describe() #give statistcial value
3.df.columns() #only show column
4.df.index() #only show index(row)
5.df.dtype() #show column datatypes
6.df.shape() #show dataset shape(rows and columns)
7.df.iloc[row_index,column_index] #work like slicing
8.df.loc[row_labels,columns_labels] #work like slicing
9.df[[column1,column2,......columnN]] #want to access multiple columns
10.df.head(default=5)
11.df.tail(default=5)
12df[column_name] #want to access one column
Drop null value
ii) 9th Feb: Fujitsu Team - Session on Effective Communication.
iii) 24th Feb: MEDIPOINT Team - Health Awareness session.
iv) 26th Feb: Japanese Introductory Session.
v) 27th Feb: Yes Bank Session on Financial Literacy.
vi) 8th March: AXA Team - Job Opportunities.
vii) 11th March: IGS Team - Market Trends Awareness.
Celebal Training:
- Databricks
- Spark
- Cloud
- SQL
- Python
- What is databricks
- databricks UI Explain
- Medalllion Architecture (Multihop)
- Control Plane and Data Plane--- (Components,Architecture)
- Execution of Query on Databricks
- Lazy Evolution
- Wide and Narrow Transform
- What is Cluster,driver node,Worker Node
- Databricks: Analytical platform + Process and Transform Big DATA
- Azure : Cloud ,Microsoft, Service=>VM, storage, database, datawarehouse.
- Apache Spark: Analytical Platform + Process and Tranform : Bigdata [Fast]
- Hadoop: Open Source Framework: Store and Process data
- Storage: BLOB, ADLS[Has Folders]
- Cluster: Group of Computers(VM)
- Worker: Executor => Executes Spark tasks and process data
- Driver : Machine => Databricks Notebook is running
- Need of Cloud
- Medallion Architecture : 3 Layers
- Databricks Lakehouse platform: Datawarehouse + Deltalake
- Deltalake Features
- SCD,CDC : Slow changing data, Change Data Capture
- Data + AI company
- Its platform for spark (Web based UI)
- Databricks Bring DE+DS+ML Together
- Its apache-spark based analytical platform
- It provides: one click support, streamline workflow and interactive workspaces that enables collaboration between DE,DS, and business Analytics
- Spark only built Databricks because of that databricks sit on top of the spark
- Whenever we want to process Big Dat and transform it into Some meaningful Info We use Databricks
- Databricks is just a platform.
- It is software(Dont have cloud), it totally rely on cloud for storage and processing
- So, you need : 1) Cluster[VM] 2)Storage
- Databricks Rely on : 1)Azure 2)AWS 3)GCP
- Create :Cluster,Notebook,Table
- Workspaces
- Recent
- Searches
- Data : DBFS,Database Tables
- Workflows, jobs
- WebUI, NB, Cluster Manager
- HANDLED BY DATABRICKS
- VM, Storage
- Storage: Azure=> BLOB, AWS:S3
- HANDLED BY AZURE, GCP
- Once transformation is done, output will not be displayed to give output there we use actions
- Actions: show(), display()
- 1 to 1
- Data remain in same portion
- map(),filter()
- group of key
- reduce by key
- your rows will be shuffled
- It acts as virtual storage for cloud
- Abstraction of that storage
- Its storage account
- it just build connection between : 1) Storage Account 2) Databricks
- version control
- stores all revison history (when add,deletd,updated)
- Repository like: Git Hub
- Different Ways: List, String, etc..
- Structtype,dataframe
- df(data, schema)
- "This is data Ingestion here we use Autoloader for it" [ETL,ELT]
- Full orchestration: Automatic,Pipeline
- There is no connection between delta live table and delta lake
- Its totally "New framework"
- used for "Data Orchestration"
- Maintains Quality of Data [bronze-silver-gold]
- Data Ingestion
- Data Pipeline
- Serverless(dont worry about computation)
- just use like Datawarehouse
- TO use SQL : You have to enable : SQL Warehouse
- We use Structured Streaming API in databricks i.e:=> "Autloader"
- SQL code
- No Streaming Data
- Everytime you have to Run "COPY INTO" to copy data
- string
- column
- df.<<column_name>>
- df['coln_name]
- What is Spark
- What are Spark Functions
- How to read and write file in spark
- Describe about spark architecture
- Components of Spark
- Cloud : using someone else's resource, Pay as you go
- Cloud Providers : AWS,AZURE,GCP
- Azure Analytics: Compute, Storage, database, data warehouse
- Important abbrevations
- Azure (categories - Services)
- Cloud Service Model
- IAAS [Virtual Machine] : user can access and manage fundamental computing resources
- PAAS [Azure ]: provides complete development and deployment environment
- SAAS [Power BI]: user can access and use software application over net via subscripton
- Types of Cloud
- Public: Infrastructure and service (provided by 3rd party) accessible over internet
- Private : Infrastructure and service deployed and managed with organization
- Hybrid : PUBLIC + PRIVATE
- Community
- DE Architecture
- ADF : Azure Data Factory [ETL]
- Why do we use Spark: Suppports many language, Memory, Fast
- Nested JSON: Dictionary=>{key,value}
- Databricks : Analytics Platform run on cloud
- Datawarehouse tools: Azure Synapse, Aws RedShift
- Hadoop Diagram
- Hadoop Components : 1] HDFS: Store 2] Mapreduce: Processing 3]Yarn : Process Manager
- Diff betn BATCH and STREAMING=> Batch:OLAP, STREAM: OLTP
- Diff betn Hadoop and Spark => Spark(100,memory,all lang) | hadoop(Slow,Disk,Java)
- Memory vs Harddisk => Fast is memory | Slow is Hardisk for Large Data
- Spark Architecture
- What is Cloud?
- What are cloud providers ?
- What is Azure ?
- What are Cloud Service Models ?
- What are Cloud Deployment models ?
- What is Hadoop and its Components ?
- What is ADF ?
- What is ADLS : Storage, Analytics, Cluster
- What is Linked Sevice in AZURE ?
- What is BIG DATA?
- Diff Betn BATCH and STREAMIG ?
- What is DeltaLake?
- Diff betn SPARK and HADOOP ?
- What is Synapse ?
- What is MetaStore ?
- ETL vs ELT ?
- using someone else cloud
- Pay As You Go
- AWS : 34%
- Azure : 21%
- GCP : 10%
- IBM
- SAP
- Alibaba
- VMware
- Compute
- VM
- Storage
- Database
- Datawarehouse
- BLOB, ADLS
- Azure SQL
- Transformation
- ETL
- Synapse
- ADF
- Databricks + HD Insights
- BLOB : Binary Large Object
- ADLS : Azure Datalake Storage
- S3 : Simple Storage Service
- RDS : Relational database Service
- ADF : Azure Data Factory
- ETL : Extract Transform Load
- EMR : Elastic Map Reduce
- for complete ETL and ELT
- bought by azure
- We have to drag and drop in ADLS (But Behind the Scene Pyspark code Runs)
- Used to handle the Complex Data
- Databricks (Cloud Platform ) ---- Rely on---- > Azure,GCP,AWS
- Azure: Synapse
- AWS : RedShift
- Primary Tool in Azure: Synapse
- Databricks can be Integrated with other Azure Services:
- Azure Data Lake Storage [ ADLS]
- Azure SQL DB
- Azure Synapse Analytics
- Google had problem to store data
- Google released paper to describe how to store large dataset on Distributed Storage
- This paper was called as Google File System [GFS]
- Google had problem to process data
- Google released paper : how to process large dataset on processing
- This paper was called as Map Reduce
- Yahoo took this paper and implemented it.
- The implementation of GFS was named as HDFS (Hadoop Distributed File System)
- The Implementaion of Map Reduce was named as Map Reduce
- Store : HDFS
- Process : Map Reduce
- Process Manager
- Job Scheduling
- Stores data on disk
- only batch processing
- Previous/last 5 Years data [OLAP]
- Eg: DMART
- Runtime/Realtime data [OLTP]
- Online Purchase / Online Payment
- Eg: PhonePay
- No delay,No Latency
- Sender send :500 his account get deducted by 500 , Receiver receives :500 his account gets 500 added
- 100 Times Faster than Hadoop
- It Works on Memory
- Languages: Spark can be integrated with Many languages
- Slow
- It works on Disk Calculation
- Language: Only Java
- Naval: hey saurabh tell my mail ?
- Saurabh : Thinking what would be naval sirs mail...hence slow its Harddisk
- Naval : Hey saurabh tell me your own mail?
- Saurabh : Quickly says : saurabh6527@gmail.com (hence Quick it is Memory)
- Matei Zaharia : Spark Founder
- He Founded and donated to "Apache Spark"
- So,To Know databricks you should know "Apache Spark"
- People Who know and worked with "Apache Spark" they have worked together with "AZURE" developers and they toghether invented "Azure Databricks"
- (Creates Logical and Physical Plan)
- Optimizes plan of your Code and Break in two stage: 1) Stage 2) Task
- When You do transformation -> Nothing Happens....
- When You take Action => BHS (DAG Works)
- [Filter, group by, etc...]
- (How to Show result in Best "OPTIMISED FORMAT" )
- Databricks
- Azure
- Apache Spark
- Hadoop
- Cluster
- Worker Node
- Driver Node
- DBU
- Azure is a cloud computing platform and service offered by microsoft
- It provides wide range of cloud services like: Compute,VM,Storage,Database,etc..
- Apache Spark is open source distributed computing system that provides a fast and efficient platform for processing large amount of data
- Databricks on other hand is unified analytics platform built on top of spark
- Hadoop is an open source framework that allows for distributed storage and processing of large dataset across cluster of computer
- Hadoop distributed File System : [To Store Large Data ]
- Cluster is group of Computers[VM] that work together to process data and Run Computation
- Consists of Workers,Drivers
- The Worker Node is Called as "Executors".The Executors are responsible for executing and processing data in parallel across the cluster.
- The Driver Node is Databricks is the machine where the databricks Notebook or job is running
- Databricks Unit
- Compulsory 2/3 DBU Consumed per Hour
- It is based on Costing
- RDD [Immutable]
- RDD[Backbone of Spark] : Its DS to handle both Structured and Unstructured
- Dataframe : Looks Like Table
- Dataset : RDD+Dataframe => Support only Scala, Java
- Formuale 1 Racing Dataset
- Raw Data: Azure Datalake (CSV,JSON,Semi-Structured Folder)
- ADLS data=> PULL TO => Databricks (Here we do transformation and convert to Parquet /Delta Lake)
- Write to DatawareHouse(Clean Data)
- Connect to Power Bi
- RDD:Backbone of Spark
- Allows store and process data,faster data processing => RAM
- R,scala,python,SQL
- Real Time Processing
- Process Big Data
- Eg: Standalone Computer => Used to play games,movie,etc..
- Driver Process : BOSS
- Executor Process : Worker
- SQL
- Python
- PySpark
- DataBricks (We run it on Azure)
- Azure
- ADF
- Small Touch on : A) Synapse B) Power BI
- It is a data designed pattern used to logically organize data in "LakeHouse"
- To Improve Quality of data it goes through 3 layers:
- Bronze Layer
- Silver Layer
- Gold Layer
- We store Data in Lakehouse
- Streaming
- Batch
- ACID Transaction
- Scalable Metadata
- Time Travel
- Open Source
- Unified Batch / Streaming
- Schema Evolution / Enforcement
- Audit History
- DML Operations
- ADF is ETL and ELT service
- It provides Multiple Connectors
- Which Helps:
- Data Extract
- Data Clean
- Data Store
- Why we use ADF => "Data Orchestration"
- Reduced Cost: No Infrastructure (Subscription)
- Increase productivity : Drag and Drop (No complex Code)
- Flexibility : Connect Various Data Sources (On premise cloud)
- Scalabilty : up and down (Pay as you go)
- Security : Authentiction and Autherization
- ADF
- Connect ADLS
- Power BI
- Data warehouse
- It is top level of databrick Hierarchy
- This Way :
- Metastore
- Catlog
- Schema
- Table
- To Use Metastore you need "Unity Catlog"
- For Delta Sharing
- Delta Permission
- You can Create different Catlog with Metastore
- Remove Null
- Add New Column => With Column
- Clean Data
- Synapse
- Snowflake
- LakeHouse=>PowerBI => Visualization
- First we load data and then transform
- We Keep data Loaded...
- Read from Sources like : 1. BLOB 2. SQLDB
- Copy data from source to target
- Connecting Stringer
- Write Data
- Value
- Variety
- Varacity
- Volume
- Velocity
- Oracle database
- Microsoft SQL Server
- My SQL
- Postgrese SQL
- Mongo DB
- IBM DB2
- It support business intellegence activities such as Reporting,Analysis and decision making
- Handles only Structured data, Huge Volume, Huge than Database
- Azure
- AWS
- GCP
- Databricks
- Snowflake
- Overall datawarehouse provides a robust and flexible solution for storing and analyzing data, and are essential for organization that want to gain Insights from their data, and make informed decison
- OLAP system are designed to support analytical and reporting activites by providing fast and flexible access to large volumes of data
- Eg: General Store : Data
- OLTP Systems are design to support day to day operations: by Processing high volume of transaction quickly and effectively and provides the real time access to critical business data
- Types of database processing used for realtime transaction oriented application
- Eg: Online Transaction APP
- Create
- Alter
- Drop [Delete all data and table]
- Truncate [Delete only data in rows and columns]
- Insert
- Update
- Delete
- Grant [Gives Access Privilage]
- Revoke [Take Back Access Permission]
- Commit
- Rollback
- Select
- Sum
- Count()
- Min
- Max
- Avg
- select concat(emp_name, '--> ', dept) from employees
- select reverse(emp_name) from employees
- select upper(emp_name) from employees
- select lower(emp_name) from employees
- select replace (emp_name,'john','saurabh') from employees
- select len(emp_name) from employees
- select left(emp_name,3) from employees op: JOH
- select right(emp_name,3) from employee op:OHN
- select substring(emp_name,2,4) from employee op: OHN
- select charindex ('$',emp_name) from employee op: 1
- select replace (emp_name,'JOHN','sauru')
- select ltrim(emp_name,'JOHN','SAURU')
- select rtrim (' sauru') op:sauru
- select stuff(emp_name,2,3,'!') op:J!
- select replicate (emp_name,3) op:JohnJohnJohn
- select space (' ') op: ' '
- AND
- OR
- NOT
- LIKE
- BETWEEN
- IN, NOT IN
- Exists, Not Exists
- IS NULL, IS NOT NULL
select distinct salary from employee
select max(salary) from employee
Notes(File)
1. Celebal Training:
- Databricks
- Spark
- Cloud
- Python (Basic + Adv Topics) [Complete at Pune this notes]
- SQL(Basic + Adv Topics) [Few Topics Remain Cover it.. ]
- De First project
- IPL
- UBER
- Roadmap Guidance + Azure 900
- Data engineer Roadmap
- Celebal Mentorship Program
- Rajas De (Youtube)
- BIG DATA
- Course : Apache Spark Notebook
- Select
- Where Clause
- Order by
- Aggregation Function (Min,Max,avg,count)
- JOINS (Inner,left,full)
- Union
- Group By
- Case Statements
- Working with Datetime/Timestamp (Adding/Substraction/Extracting Year / Month and Day)
- IF NULL
- Coalesce
- Numbering Functions (Rank,Dense _Rank, Row_number)
- Conversion Functions: CAST as INT/FLOAT/STRING
- Formating Datetime
- Creating Function
- Merge Statement (For UPSERT)
- Qualify
Comments
Post a Comment