This write up talks about Apache Spark 3.0 features and improvements. Apache Spark 3.0 is a major release and currently available in preview mode. Release 3.0.0 is a major release bundled with new features, and performance improvement besides deprecation of some of the old APIs.
Apache Spark 3.0 is also 17 times faster compared with current versions on the TPCDS benchmark which is quite impressive.
Apache Spark 3.0 Feature List
- Language Support
- Adaptive Execution
- Binary files data source
- Dynamic Partition Pruning
- DataSource V2 Improvements
- Enhanced Support for Deep Learning
- Better Kubernetes Integration
- Graph Feature
- ACID Transaction with Delta Lake
- Growing integration with Apache Arrow data format
- YARN Features
Programming Language Support
- Apache Spark 3.0 runs on
- Java 8 prior to version 8u92 support is
deprecatedas of Spark 3.0.0.
- Python 2 and Python 3 prior to version 3.6 support is deprecated as of Spark 3.0.0.
- R prior to version 3.4 support is
deprecatedas of Spark 3.0.0.
- For the Scala API, Spark 3.0.0-preview uses Scala 2.12.
- You will need to use a compatible Scala version (2.12.x).
DPP : Dynamic Partition Pruning
Apache Spark 3.0 brought the concept of Dynamic Partition Pruning (DPP) which is a key performance improvement for SQL analytics workloads that in term can make integration with BI tools much better.
The concept behind Dynamic Partition Pruning (DPP) is to apply the filter set on the dimension table — mostly small and used in a broadcast hash join — directly on the fact table so it could skip scanning unneeded partitions.
Spark SQL: Adaptive Execution
This feature helps in where statistics on the data source do not exists or are in accurate. So far Spark had some optimizations which could be set only in the planning phase and according to the statistics of data (e.g. the ones captured by the ANALYZE command when deciding weather to perform a Broadcast-hash join over an expensive Sort-merge join. In cases in which these statistics are missing or not accurate BHJ might not kick in. with adaptive execution in Spark 3.0 spark can examine that data at runtime once he had loaded it and opt-in to BHJ at runtime even it could not detect it on the planning phase.
Graph processing can be used in data science for several application including recommendation engine and fraud detections.
ACID Transactions with Delta Lake
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark 3.0 and through easy implementation and upgrading of existing Spark applications, it brings reliability to Data Lakes.
Binary files data source
Apache Spark 3.0 supports binary file data source. You can use it like this:
The above will read binary files and converts each one to a single row that contains the raw content and metadata of the file. The DataFrame will contain the following columns and possibly partition columns:
writing back a binary DataFrame/RDD is currently not supported.
Check out the Apache Spark 3.0.0 docs for more info on how to get the most out of Spark.