Posts

Showing posts from August, 2024

[STRUCTURED] Data with Darshil [Apache Spark With Databricks For Data Engineering ]

Image
[STRUCTURED Apache Spark] (Quick in 10 Min) # Reference=> Apache in 10 Min ~ DatawithDarshil  Index Intro Hadoop Apache Spark

[ STRUCTURED Databricks ]

Image
 >>>>> Databricks Cluster UI (~ Ref Youtube Video : ~Mr.Ktalks Tech ) # Cluster Diagram Explanation:  Cluster : Nodes and Executors Set of VM 1 Node = 1 Executor Each Executor :  has 1/more Cores 1 Core =  1 Partition Each Core : takes 1 task and used for parallism Driver : Step 1: Write Code in Driver Step 2:  Everything get's Divides into STAGES and TASKS This is Done with help of DAG DAG: divides all jobs in form of stages and task each TASK: executes on executor each EXECUTOR: it made of essential cores each CORE: define degree of parallelism that will happen when JOB RUNS.... #  Cluster: Set of Virtual Machines to do work Creating Computer Resources for processing BigData Types: 1] All Purpose Compute 2] Job Compute 1] All-Purpose Compute [Everything] Analyze data in NB Create, Terminate and Restart Cost: Expensive 2] JOB Compute [Just to run NB as Job with ADF pipeline and Databricks] Just Support Running a NB as Job No restart Cost : L...

[Unstructured to Structured ]Learn Relearn and Unlearn Technology

Image
# Terms Used by Data engineer >Tables: Only Structured > Volume : All Types  [Structured + Unstructured] > Catalog Binding :  Restricts User Permission diffferent categories: 1. production 2. development 3. testing > Scheme Provisioning : Dont need to explicitly define link/sync azure active directory when resource is leaving we dont need to explicitly remove id IF resource is leaving company is called scheme provsioning > UC:  Storage Credentials  External locations Celebal Tech Utility (UNITY LAUNCHER)  => UCX Similar to unity launcher (UR Utility) > Sync and deep clone > Migrating Notebook to Hive [Steps]: Two level  --> Three level namespace (UC only supports Three Level Namespace) If we use mount : we need to replace with external location If we use RDD replace with dataframe > Goup vs Service Principle Group: Handles by user Service Principle: Handled by Machine > CI/CD and Branching Strategies in Databricks > Q: Ho...