https://i124.fastpic.org/big/2024/0920/60/7dd38a35ebeb84282934a6041a19fa60.jpg
Data Engineering Using Databricks On Aws And Azure
Last updated 3/2023
MP4 | Video: h264, 1280x720 | Audio: AAC, 44.1 KHz
Language: English (US) | Size: 10.36 GB | Duration: 18h 57m
Build Data Engineering Pipelines using Databricks core features such as Spark, Delta Lake, cloudFiles, etc.
What you'll learn
Data Engineering leveraging Databricks features
Databricks CLI to manage files, Data Engineering jobs and clusters for Data Engineering Pipelines
Deploying Data Engineering applications developed using PySpark on job clusters
Deploying Data Engineering applications developed using PySpark using Notebooks on job clusters
Perform CRUD Operations leveraging Delta Lake using Spark SQL for Data Engineering Applications or Pipelines
Perform CRUD Operations leveraging Delta Lake using Pyspark for Data Engineering Applications or Pipelines
Setting up development environment to develop Data Engineering applications using Databricks
Building Data Engineering Pipelines using Spark Structured Streaming on Databricks Clusters
Incremental File Processing using Spark Structured Streaming leveraging Databricks Auto Loader cloudFiles
Overview of Auto Loader cloudFiles File Discovery Modes - Directory Listing and File Notifications
Differences between Auto Loader cloudFiles File Discovery Modes - Directory Listing and File Notifications
Differences between traditional Spark Structured Streaming and leveraging Databricks Auto Loader cloudFiles for incremental file processing.
Requirements
Programming experience using Python
Data Engineering experience using Spark
Ability to write and interpret SQL Queries
This course is ideal for experienced data engineers to add Databricks as one of the key skill as part of the profile
Description
As part of this course, you will learn all the Data Engineering using cloud platform-agnostic technology called Databricks.About Data EngineeringData Engineering is nothing but processing the data depending on our downstream needs. We need to build different pipelines such as Batch Pipelines, Streaming Pipelines, etc as part of Data Engineering. All roles related to Data Processing are consolidated under Data Engineering. Conventionally, they are known as ETL Development, Data Warehouse Development, etc.About DatabricksDatabricks is the most popular cloud platform-agnostic data engineering tech stack. They are the committers of the Apache Spark project. Databricks run time provide Spark leveraging the elasticity of the cloud. With Databricks, you pay for what you use. Over a period of time, they came up with the idea of Lakehouse by providing all the features that are required for traditional BI as well as AI & ML. Here are some of the core features of Databricks.Spark - Distributed ComputingDelta Lake - Perform CRUD Operations. It is primarily used to build capabilities such as inserting, updating, and deleting the data from files in Data Lake.cloudFiles - Get the files in an incremental fashion in the most efficient way leveraging cloud features.Databricks SQL - A Photon-based interface that is fine-tuned for running queries submitted for reporting and visualization by reporting tools. It is also used for Ad-hoc Analysis.Course DetailsAs part of this course, you will be learning Data Engineering using Databricks.Getting Started with DatabricksSetup Local Development Environment to develop Data Engineering Applications using DatabricksUsing Databricks CLI to manage files, jobs, clusters, etc related to Data Engineering ApplicationsSpark Application Development Cycle to build Data Engineering ApplicationsDatabricks Jobs and ClustersDeploy and Run Data Engineering Jobs on Databricks Job Clusters as Python ApplicationDeploy and Run Data Engineering Jobs on Databricks Job Clusters using NotebooksDeep Dive into Delta Lake using Dataframes on Databricks PlatformDeep Dive into Delta Lake using Spark SQL on Databricks PlatformBuilding Data Engineering Pipelines using Spark Structured Streaming on Databricks ClustersIncremental File Processing using Spark Structured Streaming leveraging Databricks Auto Loader cloudFilesOverview of AutoLoader cloudFiles File Discovery Modes - Directory Listing and File NotificationsDifferences between Auto Loader cloudFiles File Discovery Modes - Directory Listing and File NotificationsDifferences between traditional Spark Structured Streaming and leveraging Databricks Auto Loader cloudFiles for incremental file processing.Overview of Databricks SQL for Data Analysis and reporting.We will be adding a few more modules related to Pyspark, Spark with Scala, Spark SQL, and Streaming Pipelines in the coming weeks.Desired AudienceHere is the desired audience for this advanced course.Experienced application developers to gain expertise related to Data Engineering with prior knowledge and experience of Spark.Experienced Data Engineers to gain enough skills to add Databricks to their profile.Testers to improve their testing capabilities related to Data Engineering applications using Databricks.PrerequisitesLogisticsComputer with decent configuration (At least 4 GB RAM, however 8 GB is highly desired)Dual Core is required and Quad-Core is highly desiredChrome BrowserHigh-Speed InternetValid AWS AccountValid Databricks Account (free Databricks Account is not sufficient)Experience as Data Engineer especially using Apache SparkKnowledge about some of the cloud concepts such as storage, users, roles, etc.Associated CostsAs part of the training, you will only get the material. You need to practice on your own or corporate cloud account and Databricks Account.You need to take care of the associated AWS or Azure costs.You need to take care of the associated Databricks costs.Training ApproachHere are the details related to the training approach.It is self-paced with reference material, code snippets, and videos provided as part of Udemy.One needs to sign up for their own Databricks environment to practice all the core features of Databricks.We would recommend completing 2 modules every week by spending 4 to 5 hours per week.It is highly recommended to take care of all the tasks so that one can get real experience of Databricks.Support will be provided through Udemy Q&A.Here is the detailed course outline.Getting Started with Databricks on AzureAs part of this section, we will go through the details about signing up to Azure and setup the Databricks cluster on Azure.Getting Started with Databricks on AzureSignup for the Azure AccountLogin and Increase Quotas for regional vCPUs in AzureCreate Azure Databricks WorkspaceLaunching Azure Databricks Workspace or ClusterQuick Walkthrough of Azure Databricks UICreate Azure Databricks Single Node ClusterUpload Data using Azure Databricks UIOverview of Creating Notebook and Validating Files using Azure DatabricksDevelop Spark Application using Azure Databricks NotebookValidate Spark Jobs using Azure Databricks NotebookExport and Import of Azure Databricks NotebooksTerminating Azure Databricks Cluster and Deleting ConfigurationDelete Azure Databricks Workspace by deleting Resource GroupAzure Essentials for Databricks - Azure CLIAs part of this section, we will go through the details about setting up Azure CLI to manage Azure resources using relevant commands.Azure Essentials for Databricks - Azure CLIAzure CLI using Azure Portal Cloud ShellGetting Started with Azure CLI on MacGetting Started with Azure CLI on WindowsWarming up with Azure CLI - OverviewCreate Resource Group using Azure CLICreate ADLS Storage Account with in Resource GroupAdd Container as part of Storage AccountOverview of Uploading the data into ADLS File System or ContainerSetup Data Set locally to upload into ADLS File System or ContainerUpload local directory into Azure ADLS File System or ContainerDelete Azure ADLS Storage Account using Azure CLIDelete Azure Resource Group using Azure CLIMount ADLS on to Azure Databricks to access files from Azure Blob StorageAs part of this section, we will go through the details related to mounting Azure Data Lake Storage (ADLS) on to Azure Databricks Clusters.Mount ADLS on to Azure Databricks - IntroductionEnsure Azure Databricks WorkspaceSetup Databricks CLI on Mac or Windows using Python Virtual EnvironmentConfigure Databricks CLI for new Azure Databricks WorkspaceRegister an Azure Active Directory ApplicationCreate Databricks Secret for AD Application Client SecretCreate ADLS Storage AccountAssign IAM Role on Storage Account to Azure AD ApplicationSetup Retail DB DatasetCreate ADLS Container or File System and Upload DataStart Databricks Cluster to mount ADLSMount ADLS Storage Account on to Azure DatabricksValidate ADLS Mount Point on Azure Databricks ClustersUnmount the mount point from DatabricksDelete Azure Resource Group used for Mounting ADLS on to Azure DatabricksSetup Local Development Environment for DatabricksAs part of this section, we will go through the details related to setting up of local development environment for Databricks using tools such as Pycharm, Databricks dbconnect, Databricks dbutils, etc.Setup Single Node Databricks ClusterInstall Databricks ConnectConfigure Databricks ConnectIntegrating Pycharm with Databricks ConnectIntegrate Databricks Cluster with Glue CatalogSetup AWS s3 Bucket and Grant PermissionsMounting s3 Buckets into Databricks ClustersUsing Databricks dbutils from IDEs such as PycharmUsing Databricks CLIAs part of this section, we will get an overview of Databricks CLI to interact with Databricks File System or DBFS.Introduction to Databricks CLIInstall and Configure Databricks CLIInteracting with Databricks File System using Databricks CLIGetting Databricks Cluster Details using Databricks CLIDatabricks Jobs and ClustersAs part of this section, we will go through the details related to Databricks Jobs and Clusters. Introduction to Databricks Jobs and ClustersCreating Pools in Databricks PlatformCreate Cluster on Azure DatabricksRequest to Increase CPU Quota on AzureCreating Job on DatabricksSubmitting Jobs using Databricks Job ClusterCreate Pool in DatabricksRunning Job using Interactive Databricks Cluster Attached to PoolRunning Job Using Databricks Job Cluster Attached to PoolExercise - Submit the application as a job using Databricks interactive clusterDeploy and Run Spark Applications on DatabricksAs part of this section, we will go through the details related to deploying Spark Applications on Databricks Clusters and also running those applications.Prepare PyCharm for DatabricksPrepare Data SetsMove files to ghactivityRefactor Code for DatabricksValidating Data using DatabricksSetup Data Set for Production DeploymentAccess File Metadata using Databricks dbutilsBuild Deployable bundle for DatabricksRunning Jobs using Databricks Web UIGet Job and Run Details using Databricks CLISubmitting Databricks Jobs using CLISetup and Validate Databricks Client LibraryResetting the Job using Databricks Jobs APIRun Databricks Job programmatically using PythonDetailed Validation of Data using Databricks NotebooksDeploy and Run Spark Jobs using NotebooksAs part of this section, we will go through the details related to deploying Spark Applications on Databricks Clusters and also running those applications using Databricks Notebooks.Modularizing Databricks NotebooksRunning Job using Databricks NotebookRefactor application as Databricks NotebooksRun Notebook using Databricks Development ClusterDeep Dive into Delta Lake using Spark Data Frames on DatabricksAs part of this section, we will go through all the important details related to Databricks Delta Lake using Spark Data Frames.Introduction to Delta Lake using Spark Data Frames on DatabricksCreating Spark Data Frames for Delta Lake on DatabricksWriting Spark Data Frame using Delta Format on DatabricksUpdating Existing Data using Delta Format on DatabricksDelete Existing Data using Delta Format on DatabricksMerge or Upsert Data using Delta Format on DatabricksDeleting using Merge in Delta Lake on DatabricksPoint in Snapshot Recovery using Delta Logs on DatabricksDeleting unnecessary Delta Files using Vacuum on DatabricksCompaction of Delta Lake Files on DatabricksDeep Dive into Delta Lake using Spark SQL on DatabricksAs part of this section, we will go through all the important details related to Databricks Delta Lake using Spark SQL.Introduction to Delta Lake using Spark SQL on DatabricksCreate Delta Lake Table using Spark SQL on DatabricksInsert Data to Delta Lake Table using Spark SQL on DatabricksUpdate Data in Delta Lake Table using Spark SQL on DatabricksDelete Data from Delta Lake Table using Spark SQL on DatabricksMerge or Upsert Data into Delta Lake Table using Spark SQL on DatabricksUsing Merge Function over Delta Lake Table using Spark SQL on DatabricksPoint in Snapshot Recovery using Delta Lake Table using Spark SQL on DatabricksVacuuming Delta Lake Tables using Spark SQL on DatabricksCompaction of Delta Lake Tables using Spark SQL on DatabricksAccessing Databricks Cluster Terminal via Web as well as SSHAs part of this section, we will see how to access terminal related to Databricks Cluster via Web as well as SSH.Enable Web Terminal in Databricks Admin ConsoleLaunch Web Terminal for Databricks ClusterSetup SSH for the Databricks Cluster Driver NodeValidate SSH Connectivity to the Databricks Driver Node on AWSLimitations of SSH and comparison with Web Terminal related to Databricks ClustersInstalling Softwares on Databricks Clusters using init scriptsAs part of this section, we will see how to bootstrap Databricks clusters by installing relevant 3rd party libraries for our applications.Setup gen_logs on Databricks ClusterOverview of Init Scripts for Databricks ClustersCreate Script to install software from git on Databricks ClusterCopy init script to dbfs locationCreate Databricks Standalone Cluster with init scriptQuick Recap of Spark Structured StreamingAs part of this section, we will get a quick recap of Spark Structured streaming.Validate Netcat on Databricks Driver NodePush log messages to Netcat Webserver on Databricks Driver NodeReading Web Server logs using Spark Structured StreamingWriting Streaming Data to FilesIncremental Loads using Spark Structured Streaming on DatabricksAs part of this section, we will understand how to perform incremental loads using Spark Structured Streaming on Databricks.Overview of Spark Structured StreamingSteps for Incremental Data Processing on DatabricksConfigure Databricks Cluster with Instance ProfileUpload GHArchive Files to AWS s3 using Databricks NotebooksRead JSON Data using Spark Structured Streaming on DatabricksWrite using Delta file format using Trigger Once on DatabricksAnalyze GHArchive Data in Delta files using Spark on DatabricksAdd New GHActivity JSON files on DatabricksLoad Data Incrementally to Target Table on DatabricksValidate Incremental Load on DatabricksInternals of Spark Structured Streaming File Processing on DatabricksIncremental Loads using autoLoader Cloud Files on DatabricksAs part of this section we will see how to perform incremental loads using autoLoader cloudFiles on Databricks Clusters.Overview of AutoLoader cloudFiles on DatabricksUpload GHArchive Files to s3 on DatabricksWrite Data using AutoLoader cloudFiles on DatabricksAdd New GHActivity JSON files on DatabricksLoad Data Incrementally to Target Table on DatabricksAdd New GHActivity JSON files on DatabricksOverview of Handling S3 Events using AWS Services on DatabricksConfigure IAM Role for cloudFiles file notifications on DatabricksIncremental Load using cloudFiles File Notifications on DatabricksReview AWS Services for cloudFiles Event Notifications on DatabricksReview Metadata Generated for cloudFiles Checkpointing on DatabricksOverview of Databricks SQL ClustersAs part of this section, we will get an overview of Databricks SQL Clusters.Overview of Databricks SQL Platform - IntroductionRun First Query using SQL Editor of Databricks SQLOverview of Dashboards using Databricks SQLOverview of Databricks SQL Data Explorer to review Metastore Databases and TablesUse Databricks SQL Editor to develop scripts or queriesReview Metadata of Tables using Databricks SQL PlatformOverview of loading data into retail_db tablesConfigure Databricks CLI to push data into the Databricks Platfor