[STRUCTURED] Data with Darshil [Apache Spark With Databricks For Data Engineering ]

[STRUCTURED Apache Spark] (Quick in 10 Min)

# Reference=> Apache in 10 Min ~ DatawithDarshil

Index

Intro
Hadoop
Apache Spark

1: Intro

Big Data: Need to Be Organize for business Insights
Old Tech : Hadoop
New Tech : Apache Spark[Online Streaming], Databricks [Cloud Based Platform, Supports Apache Spark and other Big Data Frame Work], Delta Lake [ACID], Lakehouse [Data lake + Data Warehouse]

2: Hadoop [Batch Processing]

>> Components

HDFS => "Storage"
Map Reduce => "Process"

>> Limitations

Batches => " Wait..."
Disk => "Slow"

3: Apache Spark [Online Process]

>> Components

Driver Process => " Manages Task" [HEART: Manages]
Executor Process => " Actual Work" [Worker Node]

>> Features

Memory => "RDD" [Backbone]
10x
Multilanguage Supports

>> Architecture

Driver process
Cluster Manager (Manages both)
Executor process

Photo:

Cluster Manager: Manages and Co-rdinates execution of task across cluster of computers

>> How does Apache Spark executes Code in parallel?

>Compulsory Step :

create spark session

> Spark session [Multi language support]:

python, java, scala

> Sample Code of Spark Session

# Code:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getorCreate()

myrange = spark.range(1000).toDF("number")

myRange.show()

>>> Whats Dataframe in Python and Spark langauge:

Dataframe => (Rows and Colns) Eg: Ms Excel

Python : Single Computer

Spark : Multiple Computer

>>> Terms

Lazy Evolution : DAG

Transformation : filter by gender

Actions : .show()

>>> Miniproject on sparkpreprocessing

Steps

Create Spark session
Import Dataset
Convert to table
Write SQL Query on it

Addn : You can Perform Numpy,Pandas Functions on it also.....

# Code

>>> Apache Spark Optimizing Techniques

Coalse Function : Less Shuffling
Re-partitiong : More Shuffling

____________________________________________________________________________

# Reference Darshil Notes : Obsedian Link

Main Topics in Apache Spark in this Course:[Total 38 Topics] [Doing...]

A] Apache Spark Guide [1-8]

1:What is Apache Spark ?
2:Why do we need Spark? Big Data Problem
3:Spark Architecture
4:Spark DataFrame
5:Partitions
6:Transformations
7:Lazy Evolution
8:Actions

B] End to End Example [9]

9. End-to-End Example

C] Structured API [10 - 16]

10. Structured API overview
11. Basic Structured operations
12. Working with different types of data
13. User Defined Functions
14. Joins
15. Data Sources
16. Spark

D] Lower Level API [17 - 19]

17. Resilient Distributed Dataset (RDD)
18. Advance RDDS
19. Distributed Shared Variable

E] Prodcution Application (Deployment and debuugging) [20 - 24]
20. How Spark Runs on a Cluster
21. The Lifecycle of spark application
22. Spark Deployment
23. Monitoring and debugging
24. Debugging and common errors

F] Databricks [25- 38]
25. Overview of Databricks and ecosystem
26. Databricks architecture
27. Databricks Lakehouse Architecture
28. Setting up databricks environment
29. Understand databricks workspace
30. Creating Cluster
31. Databricks NB and File System (DBFS)
32. Delta Lake
33. Advance Delta Lake (Time, Travel, Optimize, Vaccum)
34. Database and Tables on Databricks
35. Views in Databricks
36. Delta Table (Working with Files)
37. Medalion Architecture Complete Guide
38. In Depth Parquet Files

_________________________________________________________________________________

Now My Language [English + Marathi + Hindi Notes]

Search This Blog

Code With Sauru💻 [ALL]

[STRUCTURED] Data with Darshil [Apache Spark With Databricks For Data Engineering ]

Comments

Post a Comment

Popular posts from this blog

QUICK

[ STRUCTURED ] Unity Catalog Concept in Databricks