Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning at scale on bare-metal or Kubernetes clusters. In this talk we will show you how to set up and run Spark workloads on Kubernetes using Charmed Spark, that is a set of tools supported by Canonical that make the life of every data scientist, data engineer and/or administrator simple.
To do so, we will start by deploying a fully functional Kubernetes cluster using MicroK8s. Once Kubernetes is up and running, we will use the Spark Client snap to simply configure roles and permissions required by Spark. In this demo we focus our attention on a single user but multiple users can easily be managed. Consequently, we will demonstrate how to use the spark-shell and pyspark utilities provided in the Snap to use Spark in an interactive way, such that a user can simply test Spark functionalities in Scala or Python. Alternatively, we will also show you how to submit regular jobs via the spark-submit command provided in the snap. We will show you how to monitor the status of the different jobs using the Spark history server, a component that will be deployed and managed via a charmed operator on top of Juju.
Finally, we will also show how to integrate this Spark solution with other Data Platform products such as Kafka and use the streaming engine provided by Spark to compute metrics over streams of data produced by Kafka.
Session author's bio
Paolo Sottovia is a software engineer working on the Data Platform team at Canonical. He is passionate about distributed systems, data processing and data explanation. He spent almost 10 years in research in the database field, working on different projects to help users extract knowledge from their data. He currently works on developing the Charmed Spark solution, a complete suite of tools to easily run Spark on Kubernetes.
|Level of Difficulty||Intermediate|