Apache Spark on Azure Databricks
This article describes how Apache Spark is related to Azure Databricks and the Databricks Data Intelligence Platform.
Apache Spark is at the heart of the Azure Databricks platform and is the technology powering compute clusters and SQL warehouses. Azure Databricks is an optimized platform for Apache Spark, providing an efficient and simple platform for running Apache Spark workloads.
Spark transformations and actions
In Apache Spark, all operations are defined as either transformations or actions.
- Transformations: add some processing logic to the plan. Examples include reading data, joins, aggregations, and type casting.
- Actions: trigger processing logic to evaluate and output a result. Examples include writes, displaying or previewing results, manual caching, or getting the count of rows.
Apache Spark uses a lazy execution model, meaning that none of the logic defined by a collection of operations are evaluated until an action is triggered. To avoid unnecessary evaluation of logic, only use actions to save results back to a target table.
Because actions represent a processing bottleneck for optimizing logic, Azure Databricks has added numerous optimizations on top of those already present in Apache Spark to ensure optimal logic execution. These optimizations consider all transformations triggered by a given action at once and find the optimal plan based on the physical layout of the data. Manually caching data or returning preview results in production pipelines can interrupt these optimizations and lead to increases in cost and latency.
What is the relationship of Apache Spark to Azure Databricks?
The Databricks company was founded by the original creators of Apache Spark. As an open source software project, Apache Spark has committers from many top companies, including Databricks.
Databricks continues to develop and release features to Apache Spark. The Databricks Runtime includes additional optimizations and proprietary features that build on and extend Apache Spark, including Photon, an optimized version of Apache Spark rewritten in C++.
How does Apache Spark work on Azure Databricks?
When you deploy a compute cluster or SQL warehouse on Azure Databricks, Apache Spark is configured and deployed to virtual machines. You don’t need to configure or initialize a Spark context or Spark session, as these are managed for you by Azure Databricks.
Can I use Azure Databricks without using Apache Spark?
Azure Databricks supports a variety of workloads and includes open source libraries in the Databricks Runtime. Databricks SQL uses Apache Spark under the hood, but end users use standard SQL syntax to create and query database objects.
Databricks Runtime for Machine Learning is optimized for ML workloads, and many data scientists use primary open source libraries like TensorFlow and SciKit Learn while working on Azure Databricks. You can use jobs to schedule arbitrary workloads against compute resources deployed and managed by Azure Databricks.
Why use Apache Spark on Azure Databricks?
The Databricks platform provides a secure, collaborative environment for developing and deploying enterprise solutions that scale with your business. Databricks employees include many of the world’s most knowledgeable Apache Spark maintainers and users. The company continuously develops and releases new optimizations to ensure users can access the fastest environment for running Apache Spark.
How can I learn more about using Apache Spark on Azure Databricks?
To get started with Apache Spark on Azure Databricks, dive right in! The Apache Spark DataFrames tutorial walks through loading and transforming data in Python, R, or Scala. See Tutorial: Load and transform data using Apache Spark DataFrames.
Additional information on Python, R, and Scala language support in Spark is found in the PySpark on Azure Databricks, SparkR overview, and Azure Databricks for Scala developers sections, as well as in Reference for Apache Spark APIs.