How to Accelerate Your Big Data and Associated Deep Learning Applications with Hadoop and Spark: A Hands-On Approach
Event Type
VisDataAnalytics Tags
Tutorial Tags
Tutorial Half-Day
TimeTuesday, July 241:30pm - 5pm
LocationSterlings 1
DescriptionApache Hadoop and Spark are gaining prominence in handling Big Data analytics and
associated Deep Learning. Recent studies have shown that default Hadoop and
Spark can not leverage the high-performance networking and storage
architectures on modern HPC clusters efficiently, like Remote Direct Memory
Access (RDMA) enabled high-performance interconnects and heterogeneous and
high-speed storage systems (e.g. HDD, SSD, NVMe-SSD, and Lustre). These
middleware are traditionally written with sockets and do not deliver the best
performance on modern high-performance networks. In this tutorial, we will
provide an in-depth overview of the architecture of Hadoop components (HDFS,
MapReduce, etc.) and Spark. We will examine the
challenges in re-designing networking and I/O components of these middleware
with modern interconnects, protocols (such as InfiniBand and RoCE) with RDMA and storage architectures. Using the publicly available
software packages in the High-Performance Big Data (HiBD, project, we will provide case studies of the
new designs for several Hadoop/Spark components and their associated
benefits. Through these case studies, we will also examine the interplay
between high-performance interconnects, high-speed storage systems, and
multi-core platforms to achieve the best solutions for these components, Big
Data processing, and Deep Learning applications on modern HPC clusters. This tutorial will provide hands-on sessions of Hadoop and Spark on SDSC Comet supercomputer.