[STRUCTURED] Data with Darshil [Apache Spark With Databricks For Data Engineering ]

[STRUCTURED Apache Spark] (Quick in 10 Min)

# Reference=> Apache in 10 Min ~ DatawithDarshil 

Index

  1. Intro
  2. Hadoop
  3. Apache Spark
1: Intro
  • Big Data: Need to Be Organize for business Insights
  • Old Tech : Hadoop
  • New Tech : Apache Spark[Online Streaming], Databricks [Cloud Based Platform, Supports Apache Spark and other Big Data Frame Work], Delta Lake [ACID], Lakehouse [Data lake + Data Warehouse]
2: Hadoop [Batch Processing]

>> Components
  1. HDFS            => "Storage"
  2. Map Reduce  => "Process"
>> Limitations
  1. Batches          => " Wait..."
  2. Disk              =>   "Slow"
3: Apache Spark [Online Process]

>> Components
  1. Driver Process       =>  " Manages Task" [HEART: Manages]
  2. Executor Process   =>  " Actual Work" [Worker Node]
>> Features 
  1. Memory => "RDD" [Backbone]
  2. 10x
  3. Multilanguage Supports
>> Architecture
  • Driver process
  • Cluster Manager (Manages both)
  • Executor process

Photo:



Cluster Manager: Manages and Co-rdinates execution of task across cluster of computers

>> How does Apache Spark executes Code in parallel?

>Compulsory Step :
  • create spark session
> Spark session [Multi language support]:
  • python, java, scala
> Sample Code of Spark Session
# Code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getorCreate()
myrange = spark.range(1000).toDF("number")
myRange.show()


>>> Whats Dataframe in Python and Spark langauge:
Dataframe => (Rows and Colns) Eg: Ms Excel
Python : Single Computer
Spark : Multiple Computer


>>> Terms 
Lazy Evolution : DAG
Transformation : filter by gender
Actions :  .show()

>>> Miniproject on sparkpreprocessing
Steps
  1. Create Spark session
  2. Import Dataset
  3. Convert to table
  4. Write SQL Query on it
Addn : You can Perform Numpy,Pandas Functions on it also.....
# Code

>>> Apache Spark Optimizing Techniques
  • Coalse Function : Less Shuffling
  • Re-partitiong :  More Shuffling

____________________________________________________________________________

# Reference Darshil Notes :  Obsedian Link

Main Topics in Apache Spark in this Course:[Total 38 Topics] [Doing...]

A] Apache Spark Guide  [1-8]
1:What is Apache Spark ? 

2:Why do we need Spark? Big Data Problem 
3:Spark Architecture 
4:Spark DataFrame 
5:Partitions 
6:Transformations 
7:Lazy Evolution 
8:Actions

B] End to End Example [9]
9. End-to-End Example

C] Structured API [10 - 16]
10. Structured API overview
11. Basic Structured operations
12. Working with different types of data
13. User Defined Functions
14. Joins
15. Data Sources
16. Spark

D] Lower Level API  [17 - 19]
17. Resilient Distributed Dataset (RDD)
18. Advance RDDS
19. Distributed Shared Variable

E] Prodcution Application (Deployment and debuugging) [20 - 24]
20. How Spark Runs on a Cluster
21. The Lifecycle of spark application
22. Spark Deployment
23. Monitoring and debugging
24. Debugging and common errors

F] Databricks [25- 38]
25. Overview of Databricks and ecosystem
26. Databricks architecture
27. Databricks Lakehouse Architecture
28. Setting up databricks environment
29. Understand databricks workspace
30. Creating Cluster
31. Databricks NB and File System (DBFS)
32. Delta Lake
33. Advance Delta Lake (Time, Travel, Optimize, Vaccum)
34. Database and Tables on Databricks
35. Views in Databricks
36. Delta Table (Working with Files)
37. Medalion Architecture Complete Guide
38. In Depth Parquet Files

_________________________________________________________________________________

Now My Language [English + Marathi + Hindi Notes]






Comments

Popular posts from this blog

[ STRUCTURED ] Unity Catalog Concept in Databricks