Databricks Overview

Databricks Overview

What is Databricks?

Databricks, developed by the creators of Spark, offers a comprehensive solution for all data needs. From storage to insights via SparkSQL, visualization with tools like Tableau and PowerBI, to model building with SparkML, it serves as a versatile platform facilitating cross-functional collaboration among data analytics, data engineering, and data science teams. Compared to the Hadoop MapReduce system, Databricks provides a compelling alternative.

The platform seamlessly integrates with leading cloud services such as Microsoft Azure, Google Cloud Platform, and Amazon Web Services, enabling efficient handling of large datasets and model execution.

For data engineers and data scientists, Databricks serves as an indispensable tool for everyday tasks, empowering them to develop applications using languages like R, SQL, Scala, or Python.

Databricks simplifies Big Data Analytics through its LakeHouse architecture, bridging the gap between data lakes and data warehouses. This integration helps eliminate data silos commonly encountered when dealing with disparate data lakes.

What are Databricks Clusters?

The workloads are run as commands in a notebook or as automated tasks. There are two types of Databricks Clusters:

  • All-purpose Clusters: These clusters are designed for collaborative data analysis using interactive notebooks, and they can be created through the CLI, UI, or REST API. All-purpose clusters provide the flexibility of manual termination and restart. Moreover, they support multiple users for interactive collaborative tasks.

  • Job Clusters: These clusters are designed for executing fast and reliable automated tasks. They're initiated when a job is launched on a new Job Cluster and are terminated automatically once the job concludes. It's worth noting that a Job Cluster cannot be restarted once it's terminated.

Databricks Architecture

Databricks is designed to facilitate secure collaboration across different teams and functions.

  1. Control Plane - The control plane encompasses the backend services managed by Databricks within your account. It stores notebook commands and various workspace configurations, ensuring they are securely encrypted while at rest.

  2. Compute Plane - In Azure Databricks, the compute resources primarily reside within your Azure subscription, known as the classic compute plane. This includes the network infrastructure and associated resources within your Azure subscription. Azure Databricks utilizes the classic compute plane for managing notebooks, executing jobs, and hosting both pro and classic Databricks SQL warehouses.

What are Cluster Node Types?

  1. Driver Node - The driver node plays a crucial role by overseeing the status of all notebooks within the cluster. Additionally, it serves as the host for the Apache Spark master, responsible for orchestrating tasks with the Spark executors and managing the SparkContext effectively.

  2. Worker Node - The terms executor and worker are interchangeable in the Databricks architecture. In Spark, when you leverage its distributed computing capabilities, the bulk of processing occurs on worker nodes.

  3. GPU Instance types - Databricks offers GPU-accelerated clusters tailored for computationally intensive tasks, particularly beneficial for high-performance computing needs like deep learning.

What are the types of clusters in Azure Databricks?

In Databricks, there are three primary types of clusters, each catering to different use cases:

  1. Standard Clusters: These clusters are well-suited for individual users. They typically consist of a driver node and one or more worker nodes, offering a balanced configuration for various workloads.

  2. High Concurrency Clusters: Designed to handle concurrent workloads efficiently, these clusters are optimized for scenarios where multiple users need to run jobs simultaneously without compromising performance.

  3. Single Node Clusters: Unlike standard clusters which require at least one worker node in addition to the driver node, single node clusters run all jobs solely on the driver node. They are suitable for lightweight workloads or testing purposes where the scalability provided by worker nodes is not necessary.