Annual Meeting 2021

The UCF Consortium held its 2021 annual meeting and workshop virtually in December. The annual meeting covered multiple topics around the consortium’s growing projects, such as UCX, UCC, OpenSNAPI, the latest developments, usage and futures.

Date Time Topic Speaker/Moderator
11/30 08:00-08:15
Opening Remarks and UCF – slides, video

Unified Communication Framework (UCF) – Collaboration between industry,laboratories, and academia to create production grade communication frameworks and open standards for data-centric and high-performance applications. In this talk we will present recent advances in development UCF projects including Open UCX, Apache Spark UCX as well incubation projects in the area of SmartNIC programming, benchmarking, and other areas of accelerated compute.

Gilad Shainer, NVIDIA

Gilad Shainer serves as senior vice-president of marketing for Mellanox networking at NVIDIA, focusing on high- performance computing, artificial intelligence and the InfiniBand technology. Mr. Shainer joined Mellanox in 2001 as a design engineer and later served in senior marketing management roles since 2005. Mr. Shainer serves as the chairman of the HPC-AI Advisory Council organization, the president of UCF and CCIX consortiums, a member of IBTA and a contributor to the PCISIG PCI-X and PCIe specifications. Mr. Shainer holds multiple patents in the field of high-speed networking. He is a recipient of 2015 R&D100 award for his contribution to the CORE-Direct In-Network Computing technology and the 2019 R&D100 award for his contribution to the Unified Communication X (UCX) technology. Gilad Shainer holds a MSc degree and a BSc degree in Electrical Engineering from the Technion Institute of Technology in Israel.

08:15-09:00
Accelerating recommendation model training using ByteCCL and UCX – slides, video

BytePS[1] is a state-of-the-art open source distributed training framework for machine learning in the field of computer vision (CV), natural language processing (NLP), and speech. Recently, there has been growing interest in leveraging GPUs for training recommendation systems and computational advertising models (e.g. DLRM[2]), which involves large scale datasets requires distributed training. While most models in CV/NLP/speech can fit onto a single GPU, DLRM-like models require model parallelism which shards model parameters across devices, and they exhibits much lower computation-communication ratio. To accelerate large scale training for these models, we develop ByteCCL, the next generation of BytePS with UCX as the communication backend, providing communication primitives such as alltoall, allreduce, gather, scatter, and send/recv. The library is designed with (1) asynchronous APIs, (2) multiple deep learning framework support, (3) zero-copy and GPU-direct RDMA support, (4) multiple hardware support (CPU and GPU), (5) and topology-aware optimizations. We discuss how ByteCCL leverages UCX for high performance data transfers, and how model parallelism for DLRM-like models can be supported by ByteCCL primitives for both synchronous and asynchronous training. As the result, ByteCCL-enabled asynchronous DLRM-like model training with GPUs, and provides significant speedup over CPU-based asynchronous training systems. For synchronous DLRM-like model training, we present alltoall micro-benchmarks that the performance of ByteCCL is on par with NCCL in 800 Gb/s CX6 A100 GPU clusters, and HPCX on 200 Gb/s CX6 AMD CPU cluster respectively. We further show that ByteCCL provides up to 9% and 12% speedup for end-to-end model training over NCCL and HPCX respectively. We plan to open source ByteCCL, and will finally discuss the future work for ByteCCL. [1] Jiang, Yimin, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. “A Unified Architecture for Accelerating Distributed {DNN} Training in Heterogeneous GPU/CPU Clusters.” In 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20), pp. 463-479. 2020. [2] Naumov, Maxim, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang et al. “Deep learning recommendation model for personalization and recommendation systems.” arXiv preprint arXiv:1906.00091 (2019).

Haibin Lin*, Bytedance Inc.

Haibin Lin is a research scientist at Bytedance Inc. He works on machine learning systems.

Mikhail Brinskii, NVIDIA

Mikhail Brinskii is a UCX developer with main focus on networking and HPC solutions. Before joining Nvidia, Mikhail had been working at Intel developing highly optimized Intel MPI Library.

Yimin Jiang, Bytedance Inc.

Speaker Bio

Yulu Jia, Bytedance Inc.

Speaker Bio

Chengyu Dai, Bytedance Inc.

Speaker Bio

Yibo Zhu, Bytedance Inc.

Speaker Bio

09:00-09:30
UCX on Azure HPC/AI Clusters – slides, video

Recent technology advancements have substantially improved the performance potential of virtualization. As a result, the performance gap between bare-metal and cloud clusters are continuing to shrink. This is quite evident as public clouds such as Microsoft Azure has climbed up into the top rankings in Graph500 and Top500 lists. Moreover, public clouds democratize these technology advancements with focus on performance, scalability, and cost-efficiency. Though the platform technologies and features are continuing to evolve, communication runtimes such as UCX play a key role in enabling applications to make use of the technology advancements, and with high performance. This talk focuses on how UCX efficiently enables the latest technology advancements in Azure HPC and AI clusters. This talk will also provide an overview of the latest HPC and AI offerings in Microsoft Azure HPC along with their performance characteristics. It will cover the Microsoft Azure HPC marketplace images that include UCX powered MPI libraries, as well as recommendations and best practices on Microsoft Azure HPC. We will also discuss the performance and scalability characteristics using microbenchmark and HPC applications.

Jithin Jose, Microsoft

Speaker Bio

09:30-10:00
MPICH + UCX: State of the Union – slides, video

In this talk, we will discuss the current state of MPICH support for the UCX library, focusing on changes since the last annual meeting. Topics covered will include build configuration, point-to-point communication, RMA, multi-threading, MPI-4 partitioned communication, and more. We also look towards future UCX development items for the coming year.

Ken Raffenetti, Argonne National Laboratory

Ken is a Principal Software Development Specialist at Argonne National Laboratory.

10:00-10:30 Break
10:30-11:00
UCX.jl — Feature rich UCX bindings for Julia – slides, video

Julia is a high-level programming language with a focus on technical programming. It has seen increased adoption in HPC fields like CFD, due to it’s expressiveness, high-performance, and support for GPU based programming. In this talk I will introduce UCX.jl a Julia package that contains both low-level and high-level bindings to UCX from Julia. I will talk about the challenges of integrating UCX Julia, in particular integrating with Julia’s task runtime.

Valentin Churavy, MIT

Speaker Bio

11:00-11:30
Go bindings for UCX – slides, video

Go is a popular programming language for cloud services. UCX can provide big benefits not only for traditional HPC applications but for other network-intensive, distributed applications. Java and Python bindings are already being used by the community in various applications and frameworks. UCX v1.12 will include bindings for the GO language. Will present an example of API usage and next steps for Go bindings development.

Peter Rudenko, NVIDIA

Peter Rudenko is a software engineer in High Performance Computing team, focusing on accelerating data intensive applications, developing UCX communication library and various big data solutions.

11:30-12:00
UCX-Py: Stability and Feature Updates – slides, video

UCX-Py is a high-level library enabling UCX connectivity for Python applications. Because Python is a high-level language, it is very difficult to directly interface with specialized hardware to take advantage of high-performance communication. UCX is a crucial component of modern HPC clusters, and UCX-Py delivers that capability to various Python data science applications, including RAPIDS and Dask. UCX-Py has been under development for 3 years, starting as a simplified wrapper for UCP that evolved to support Python asyncio, and today both implementations are available depending on the application’s programming model and needs. The first couple years of development brought many features to UCX-Py and served to prove UCX performance as the de-facto necessity for HPC data science in Python. Its third year of existence was vastly focused on stability improvements and made UCX-Py more stable than ever, but despite the focus on stability some new features were also included. Handling of endpoint errors is now done via UCX endpoint callbacks allowing for cleanest endpoint lifetime. Newly-introduced Active Message support now provides an alternative communication to the TAG API, providing a more familiar communication pattern for Python developers. Support for RMA operations and creating endpoints directly from a remote worker via the UCP address add to the list of new UCX-Py features. Heavy testing and benchmarking on hardware providing connectivity such as InfiniBand and NVLink allowed UCX-Py to be an important tool in catching numerous UCX bugs as well. Following a complete split of UCX-Py asyncio implementation from the core synchronous piece of the library, this now allows a smooth complete upstreaming from UCX-Py’s own repository into the mainline OpenUCX repository. This process has already started but is still in early stages.

Peter Entschev, NVIDIA

Peter Entschev is a system software engineer at NVIDIA, working on distributed computing and communications on the RAPIDS team.

12:00-12:30 Break
12:30-13:00
OpenSHMEM and Rust – slides, video

Many institutions engaged in High Performance Computing (HPC) are interested in moving from C-based setups to newer languages like Rust and Go to improve code safety and security. The Partitioned Global Address Space (PGAS) programming family is a set of HPC libraries and languages that employs a relaxed parallelism model utilizing Remote Memory Access (RMA or RDMA). Traditionally most libraries are targeted at C/C++ (and also Fortran) and are then used to implement PGAS programming languages or models. The OpenSHMEM library provides a communication and memory management API for C/C++. However, the use of raw C pointers creates safety issues in languages like Rust, and that detracts from the end-user’s experience. This presentation will look at how we can integrate OpenSHMEM into Rust while retaining Rust’s safety guarantees.

Tony Curtis*, Stony Brook University

Speaker Bio

Rebecca Hassett, Stony Brook University

Speaker Bio

13:00-13:30
Towards Cost-Effective and Elastic Cloud Database Deployment via Memory Disaggregation – slides, video

It is challenging for cloud-native relational databases to meet the ever-increasing needs of scaling compute and memory resources independently and elastically. The recent emergence of memory disaggregation architecture, relying on high-speed RDMA network, offers opportunities to build cost-effective and elastic cloud-native databases. There exist proposals to let unmodified applications run transparently on disaggregated systems. However, running relational database kernel atop such proposals experiences notable performance degradation and time-consuming failure recovery, offsetting the benefits of disaggregation. To address these challenges, in this paper, we propose a novel database architecture called LegoBase, which explores the co-design of database kernel and memory disaggregation. It pushes the memory management back to the database layer for bypassing the Linux I/O stack and re-using or designing (remote) memory access optimizations with an understanding of data access patterns. LegoBase leverages RDMA and further splits the conventional ARIES fault tolerance protocol to independently handle the local and remote memory failures for fast recovery of compute instances. We implemented LegoBase atop MySQL. We compare LegoBase against MySQL running on a standalone machine and the state-of-the-art disaggregation proposal Infiniswap. Our evaluation shows that even with a large fraction of data placed on the remote memory, LegoBase’s system performance in terms of throughput (up to 9.41% drop) and P99 latency (up to 11.58% increase) is comparable to the monolithic MySQL setup, and significantly outperforms (1.99×-2.33×, respectively) the deployment of MySQL over Infiniswap. Meanwhile, LegoBase introduces an up to 3.87× and 5.48× speedup of the recovery and warm-up time, respectively, over the monolithic MySQL and MySQL over Infiniswap, when handling failures or planned re-configurations.

Cheng Li, University of Science and Technology of China

Speaker Bio

13:30-13:45 Adjourn
12/01 08:00-08:45
Opening Remarks and UCX – slides
Pavel Shamis (Pasha), Arm

Pavel Shamis is a Principal Research Engineer at Arm. His work is focused on co-design software, and hardware building blocks for high-performance interconnect technologies, development of communication middleware, and novel programming models. Prior to joining ARM, he spent five years at Oak Ridge National Laboratory (ORNL) as a research scientist at Computer Science and Math Division (CSMD). In this role, Pavel was responsible for research and development multiple projects in high-performance communication domains including Collective Communication Offload (CORE-Direct & Cheetah), OpenSHMEM, and OpenUCX. Before joining ORNL, Pavel spent ten years at Mellanox Technologies, where he led Mellanox HPC team and was one of the key drivers in the enablement Mellanox HPC software stack, including OFA software stack, OpenMPI, MVAPICH, OpenSHMEM, and other. Pavel is a board member of UCF consortium and co-maintainer of Open UCX. He holds multiple patents in the area of in-network accelerator. Pavel is a recipient of 2015 R&D100 award for his contribution to the development CORE-Direct in-network computing technology and the 2019 R&D100 award for the development of Open Unified Communication X (Open UCX) software framework for HPC, data analytics, and AI.

08:45-09:00
Porting UCX for Tofu-D interconnect of Fugaku – slides, video

Tofu-D is an interconnect designed for Fugaku Supercomputer. It has a 6-dimensional Torus network for high scalability up to 160K nodes in Fugaku. The Tofu-D’s network interface is integrated into Fujitsu A64FX CPU chip. A library called uTofu API is provided by Fujitsu to make use of Tofu-D from userland. Currently, we are working on porting UCX for Tofu-D so that users can use UCX ecosystems with native support. Before the porting, as a preliminary evaluation, we evaluated UCX’s TCP mode with tag messaging API on Tofu-D. We measured the bandwidth by ping-pong benchmark and compare it with Fujitsu MPI, which supports Tofu-D natively. The result was disappointing that the bandwidth of UCX with TCP mode is about 170MB/s with 4KB message while that of Fujitsu MPI is about 6.2GB/s. This implies we need to port UCX directly on top of uTofu to achieve better performance. With our latest implementation, we are successfully able to run UCT’s zero-copy PUT API on Tofu-D without using TCP mode. With performance evaluation and comparison with Fujitsu MPI and uTofu, modified UCX achieves about 6.3GB/s with a 4KB message, matching the performance of uTofu. It is a promising result that UCX has better scaling than MPI.

Yutaka Watanabe*, University of Tsukuba

Speaker Bio

Mitsuhisa Sato, RIKEN

Speaker Bio

Miwako Tsuji, RIKEN

Speaker Bio

Hitoshi Murai, RIKEN

Speaker Bio

Taisuke Boku, University of Tsukuba

Speaker Bio

09:00-10:00
UCP Active Messages – slides, video

Active messages is a common messaging interface for various PGAS (Partitioned Global Address Space) APIs/libraries. UCP provides rich and performance efficient API suitable for implementing networking layer in different type of frameworks. In this talk we will describe UCP Active Message API, highlighting its advantages for those application and frameworks which do not require tag-matching capabilities. With Active Messages API, application can benefit from all major UCX features available for well-known tag-matching API, such as: GPU memory and multi HCA support, error-handling support, etc. The API provides optimal performance by auto-selecting proper protocols depending on the message size being transferred. Besides that, Active Messages API has several benefits comparing to tag-matching API: 1) No extra memory copies on the receiver even with eager protocols 2) Ability for peer-to-peer communication (message can be received in the scope of the endpoint) 3) Better error handling support 4) Lack of tag-matching overhead (for those frameworks where tag-matching is not really needed). All these capabilities make UCP Active Message API a preferable choice when implementing networking layer for HPC and AI communication frameworks.

Mikhail Brinskii*, NVIDIA

Mikhail Brinskii is a UCX developer with main focus on networking and HPC solutions. Before joining Nvidia, Mikhail had been working at Intel developing highly optimized Intel MPI Library.

Yossi Itigin, NVIDIA

Yossi Itigin is a UCX team lead at NVIDIA, focuses on high-performance communication middleware, and a maintainer of OpenUCX project. Prior to joining NVIDIA, Mr. Itigin spent nine years at Mellanox Technologies in different technical roles, all related to developing and optimizing RDMA software.

10:00-10:30 Break
10:30-12:00
UCX GPU support – slides, video

We would like to discuss last year changes and future plans for UCX GPU support, including but not limited to: Pipeline protocols, Out-of-box performance optimization, Shared memory 2-stage pipeline, Device memory pipeline, GPU-CPU locality information, DMA-buf support and moving registration cache to UCP, Limiting/extending GPU memory registration, Memory type cache

Yossi Itigin*, NVIDIA

Yossi Itigin is a UCX team lead at NVIDIA, focuses on high-performance communication middleware, and a maintainer of OpenUCX project. Prior to joining NVIDIA, Mr. Itigin spent nine years at Mellanox Technologies in different technical roles, all related to developing and optimizing RDMA software.

Akshay Venkatesh, NVIDIA

Speaker Bio

Devendar Bureddy, NVIDIA

Speaker Bio

12:00-12:30 Break
12:30-13:00
UCX on AWS: Adding support for Amazon EFA in UCX – slides, video

HPC in cloud is an emerging trend that provides users with easier, more flexible, and more affordable access to HPC resources. Amazon Web Services (AWS) provides its users with compute instances that are equipped with a low-latency high-bandwidth network interface, as well as multiple NVIDIA GPUs. More specifically, AWS has introduced the Elastic Fabric Adapter (EFA): an HPC-optimized 100 Gbps network interface with GPUDirect RDMA capability to satisfy the high-performance communication requirements of HPC/DL applications. EFA exposes two transport protocols: 1) Unreliable Datagram (UD), and 2) Scalable Reliable Datagram (SRD). The UD transport is very similar to the UD transport from IB. However, EFA is not an IB device, so it lacks certain features and assumptions that the existing transport layers in UCX depend on. SRD is a new transport developed by Amazon that is exposed as a new QP type in RDMA core Verbs. Therefore, it is not compatible with any of the existing transports in UCX. SRD is a reliable unordered datagram protocol that relies on sending packets over many network paths to achieve the best performance on AWS. Out-of-order packet arrival is very frequent with SRD. Therefore, efficient handling of such packets is the key to achieve high performance communications with UCX on AWS. Currently, UCX does not support EFA which means UCX usage on AWS is limited to TCP which has a per-flow bandwidth cap of ~1.2 GB/s. Adding the support for EFA through SRD allows UCX to achieve ~12 GB/s bandwidth on AWS network and deliver lower latency too. It also enables UCX-optimized software such as GPU-accelerated libraries from the RAPIDS suite to run directly on AWS. In this talk, we discuss how we add support for EFA in UCX. More specifically, 1) we add a new EFA memory domain that extends from the existing IB memory domain to capture EFA-specific features and limitations, 2) we add a new UCT interface for SRD and discuss the challenges and solutions for the reliable but relaxed ordered protocol, and 3) we update the existing UD transport in UCX to make it work over EFA too. Our micro-benchmark results show good performance for MPI communications between two EC2 p4d.24xlarge instances. Using Open MPI plus UCX with a prototype SRD interface, we achieve the maximum bandwidth offered by the EFA device (12 GB/s) for both host and GPU communications. This is 10x higher than what we achieve without EFA support, i.e., 1.2 GB/s with UCX + TCP. Latency results are encouraging too. For host memory, we achieve a latency close to RDMA core performance test, i.e., ~19 us with UD and ~20 us with SRD, respectively. This is 44% lower than the latency we achieve with UCX+TCP. Moreover, by taking advantage of GPUDirect RDMA in UCX, we achieve the same latency and bandwidth as host communications for large-message GPU communications.

Hessam Mirsadeghi*, NVIDIA

Speaker Bio

Akshay Venkatesh, NVIDIA

Speaker Bio

Jim Dinan, NVIDIA

Speaker Bio

Sreeram Potluri, NVIDIA

Speaker Bio

13:00-13:30
rdma-core update – slides, video

30 min talk to review changes in rdma-core and the kernel over the last year. The RDMA stack provides the low level hardware interfaces that UCX rides on top of.

Jason Gunthorpe, NVIDIA

Jason is the maintainer for RDMA in the Linux kernel and userspace rdma-core.

13:30-14:00
Congestion Control for Large Scale RoCEv2 Deployments – slides, video

100 Gbps Ethernet with RoCEv2 is currently being deployed in High Performance Computing (HPC) and Machine Learning (ML) environments. For large scale applications, the network congestion control (CC) plays a big role in application performance. RoCEv2 does not scale well if only Priority-based Flow Control (PFC) is used. The use of PFC can lead to head-of-line blocking that results in traffic interference. RoCEv2 with ECN-based congestion control schemes scales well without requiring PFC. In this talk, we will present the benefits of hardware-based congestion control for RoCEv2. We will also discuss congestion control enhancements to further improve RoCEv2 performance in large scale deployments.

Hemal Shah*, Broadcom

Speaker Bio

Moshe Voloshin, Broadcom

Speaker Bio

14:00-14:15 Adjourn
12/02 08:00-08:15
Opening Remarks

Opening Remarks

Pavel Shamis (Pasha), Arm

Pavel Shamis is a Principal Research Engineer at Arm. His work is focused on co-design software, and hardware building blocks for high-performance interconnect technologies, development of communication middleware, and novel programming models. Prior to joining ARM, he spent five years at Oak Ridge National Laboratory (ORNL) as a research scientist at Computer Science and Math Division (CSMD). In this role, Pavel was responsible for research and development multiple projects in high-performance communication domains including Collective Communication Offload (CORE-Direct & Cheetah), OpenSHMEM, and OpenUCX. Before joining ORNL, Pavel spent ten years at Mellanox Technologies, where he led Mellanox HPC team and was one of the key drivers in the enablement Mellanox HPC software stack, including OFA software stack, OpenMPI, MVAPICH, OpenSHMEM, and other. Pavel is a board member of UCF consortium and co-maintainer of Open UCX. He holds multiple patents in the area of in-network accelerator. Pavel is a recipient of 2015 R&D100 award for his contribution to the development CORE-Direct in-network computing technology and the 2019 R&D100 award for the development of Open Unified Communication X (Open UCX) software framework for HPC, data analytics, and AI.

8:15-09:30
Unified Collective Communication (UCC) State of the Union 2021 – slides, video

UCC is an open source collective library for current and emerging programming models. The goal is to provide a single library for collectives serving the various use cases. Particularly, UCC aims to unify the collective communication interfaces and semantics for (i) parallel programming models including HPC (message passing and PGAS), deep-learning and I/O (ii) collectives execution for CPUs and GPUs and (iii) unify collectives for software and hardware transports In this talk, first we will highlight some of design principles, abstractions, and the current implementation. Then, we provide highlights of recent advances in the project and upcoming plans for the project.

Manjunath Gorentla Venkata, NVIDIA

Manjunath Gorentla Venkata is an HPC software architect at NVIDIA. His focus is on programming models and network libraries for HPC systems. Previously, he was a research scientist and the Languages Team lead at Oak Ridge National Laboratory. He’s served on open standards committees for parallel programming models, including OpenSHMEM and MPI for many years, and he is the author of more than 50 research papers in this area. Manju earned Ph.D. and M.S. degrees in computer science from the University of New Mexico.

09:30-10:00
High Performance Compute Availability (HPCA) Benchmark Project for Smart Networks – slides, video

Benchmarking has always been a critical piece at procuring and accepting new systems, as well as providing concrete data to application developers about expected performance from specific computing platforms. While the community already benefits from many benchmarking suites, the latest improvements in hardware technologies are creating a gap between the metrics captured by these benchmarks and new hardware-level capabilities. An example of such a gap, and the associated response by the High-Performance Computing (HPC) community, is the introduction of the exaflop metric that proved to be more suitable for modern workloads that include machine learning or deep learning algorithms running on Graphical Processing Units (GPUs). That new metric is not meant to replace other metrics such as floating-point operations per second (FLOPS) but instead to complete them. The focus of the High-Performance Compute Availability (HPCA) group is to provide a benchmark suite for smart networks that are based on devices such as smart switches, Data Processing Units (DPUs), or Network integrated Processing Units (IPUs). These innovative technologies provide a unique opportunity to offload computation to the network, freeing valuable resources on the hosts and enabling new optimizations. For example, smart networks can be used to offload collective operations, such as non-blocking operations from the Message Passing Interface (MPI) standard. Our project is therefore an effort to create and compile a new set of metrics for ranking HPC and Artificial Intelligence (AI) system performance and capabilities, by providing all the necessary benchmarks to capture the benefits enabled by state-of-the-art smart network devices. HPCA is intended as a complement to existing benchmarks, such as HPCC and HPCG. The current prototype from our group, called OpenHPCA, provides a set of metrics based on MPI. Two existing benchmarks, an extension of an existing benchmark and a brand-new suite have been integrated together. The two existing benchmarks are the Ohio State University (OSU) micro-benchmarks and the Sandia micro-benchmarks (SMB). The modified benchmark is an extension of the OSU micro-benchmarks for non-contiguous memory, which is useful for the evaluation of new capabilities such as User-Mode Memory Registration (UMR). The new benchmark aims at completing the overlapping evaluation provided by both OSU and SMB, by adding a benchmark suite that provides a set of metrics based on a different statistical methodology to compute overlap. The new overlap benchmarks therefore provide a more complete overview of overlapping capabilities when combined with the other metrics. The current implementation of OpenHPCA provides a framework for the execution of all the individual benchmarks and analysis of the results. It also provides a graphical interface that makes it easier for users to visualize and analyze the results. Our future work includes the development of benchmarks for new programming interfaces targeting state-of-the-art smart networks such as OpenSNAPI. By doing so, we plan to provide a novel set of metrics that capture the benefits of current and future smart network devices and the impact on applications.

Geoffroy Vallee*, NVIDIA

Speaker Bio

Richard Graham, NVIDIA

Speaker Bio

Steve Poole, Los Alamos National Laboratory

Speaker Bio

10:00-10:30 Break
10:30-11:00
Cloud-Native Supercomputing Performance isolation – slides, video

High-performance computing and artificial intelligence have evolved to be the primary data processing engines for wide commercial use, hosting a variety of users and applications. While providing the highest performance, supercomputers must also offer multi-tenancy security isolation and multi-tenancy performance isolation. Therefore, they need to be designed as cloud-native platforms. In this session we will focus on the later part, discussing the key technological network elements enabling performance isolation, including the data processing unit (DPU). DPU, which is a fully integrated data-center-on-a-chip platform, and an innovative proactive/reactive end-to-end congestion control. We will present applications performance results from on premise supercomputing systems, as well as Azure HPC cloud infrastructure.

Gilad Shainer*, NVIDIA

Speaker Bio

Richard Graham, NVIDIA

Speaker Bio

Jithin Jose, Microsoft

Speaker Bio

11:00-12:00
ifunc: UCX Programming Interface for Remote Function Injection and Invocation – slides, video

Network library APIs have historically been developed with the emphasis on data movement, placement, and communication semantics. Many communication semantics are available across a large variety of network libraries, such as send-receive, data streaming, put/get/atomic, RPC, active messages, collective communication, etc. In this work we introduce new compute and data movement APIs that overcome the constraints of the single-program, multiple-data (SPMD) programming model by allowing users to send binary executable code between processing elements. Our proof-of-concept implementation of the API is based on the UCX communication framework and leverages the RDMA network for fast compute migration. We envision the API being used to dispatch user functions from a host CPU to a SmartNIC (DPU), computational storage drive (CSD), or remote servers. In addition, the API can be used by large-scale irregular applications (such as semantic graph analysis), composed of many coordinating tasks operating on a data set so big that it has to be stored on many physical devices. In such cases, it may be more efficient to dynamically choose where code runs as the applications progresses.

Wenbin Lu*, Stony Brook University

Speaker Bio

Luis E. Peña, Arm Research

Speaker Bio

12:00-12:30 Break
12:30-13:00
Remote OpenMP Offloading with UCX – slides, video

OpenMP has a long and successful history in parallel programming for CPUs. Since the introduction of accelerator offloading it has evolved into a promising candidate for all intra-node parallel computing needs. While this addition broke with the shared memory assumption OpenMP was initially developed with, efforts to employ OpenMP in other non-shared memory domains are practically non-existent. Through Remote OpenMP offloading, we have shown that the OpenMP accelerator offloading model is sufficient to seamlessly and efficiently utilize more than a single compute node, and its connected accelerators. Our runtime allows an OpenMP offload capable program, without source code or compiler modifications, to be run on a remote CPU, or remote accelerator (e.g., GPU), as if it was a local one. For applications that support multi-device offloading, any combination of local and remote CPUs and accelerators can be utilized simultaneously, fully transparent to the user. The original backend, which was based on gRPC, has been upstreamed into LLVM. The new backend in development is based on UCX, in order to support InfiniBand, as well as to efficiently utilize other features such as RDMA, and inter-GPU communication in the near future. Our initial implementation has been tested on up to 72 remote GPUs with two popular HPC proxy apps – RSBench and XSBench – that were most amenable to utilizing multiple remote accelerators through OpenMP target offloading. In this talk, we will discuss our progress so far, on-going challenges in development, as well as a roadmap for the future.

Atmn Patel*, University of Waterloo

Speaker Bio

Johannes Doerfert, Argonne National Laboratory

Speaker Bio

13:00-13:30
From naive to smart: leveraging offloaded capabilities to enable intelligent NICs – slides, video

We are witnessing a resurgence of interest in SmartNICs, i.e., network interfaces with advanced offloading capabilities. Typically, in this context to be “smart” is to be “programmable” in a broad sense of the term that covers diverse types of hardware, from CPUs to FPGAs. While much of the research regarding the design and use of SmartNICs is focused on the Cloud or datacenters, HPC is taking notice as well, with recent work investigating how SmartNICs might indirectly or directly assist scientific computing. While standard contemporary HPC networks offer some forms of offloading, they are not “smart” in the sense currently implied by the term “SmartNIC”: they do not expose general-purpose compute capabilities to communication middleware or scientific software developers. One way to bring HPC NICs into the fold is to extend current offloaded capabilities with general-purpose compute hardware. In this talk I present an alternative strategy that leverages existing offloaded capabilities — namely, message matching, triggered operations, and atomics — to provide general-purpose compute functionality. I describe lower- and higher-level languages for expressing codes written for this architecture, and provide simulated performance results. Finally, I briefly discuss possible use cases.

Whit Schonbein, Sandia National Laboratories

Speaker Bio

13:30-14:00
Using Data Processing Units to manage large sets of small files – slides, video

The data flow model in HPC centers is being increasingly challenged by new workflows. Researchers are ingesting vast amounts of data to train IA models, send input to many small simulations, and other tasks that are performed many times over large datasets. These results are also increasingly in the form of small individual chunks of data. HPC center high performance file systems are straining under the load of small data and industry databases are architected for socket connections over TCP/IP, leaving much HPC interconnects under utilized. Data Processing Units (DPUs) offer the chance to marry the very best of HPC hardware capabilities and industry software to create better solutions for data movement and retrieval. This presentation will detail how these challenges affected the Oak Ridge National Laboratory’s Gigadock project, detailed in the paper, “Supercomputer-Based Ensemble Docking Drug Discovery Pipeline with Application to Covid-19”[1]. This presentation will cover the challenges and failed attempts at managing 1 billion+ entry datasets for use in this project, and early work using DPUs and UCX-py to address these challenges. [1] Acharya A, Agarwal R, Baker MB, et al. Supercomputer-Based Ensemble Docking Drug Discovery Pipeline with Application to Covid-19. J Chem Inf Model. 2020;60(12):5832-5852. doi:10.1021/acs.jcim.0c01010

Matthew Baker, Oak Ridge National Laboratory

Matthew graduated from East Tennessee State University (ETSU) with a masters in applied computer science. Matthew’s past work has included working on programming models and high performance network libraries. Matthew’s current work includes monitoring systems at large scale and allowing applications to make decisions at run time.

14:00-14:15 Adjourn
Monday, 10 December, 2018

UCX – State Of the Union
Tuesday, 11 December, 2018

UCX – API Discussions
Wednesday, 12 December, 2018

RDMA CORE
8:30 AM Reception, Arm Lobby Reception, Arm Lobby Reception, Arm Lobby
9:00 AM Keynote
Future of HPC, Steve Poole, Los Alamos National Laboratory
Keynote
UCX and MPI on Astra at Sandia National Laboratory
Keynote
RDMA Core Overview, Jason Gunthorpe
10:00 AM Topic
UCX Roadmap 2019 and UCX 2.0Abstract
Things we would like to change/optimize/cleanup in next UCP API, and backward compatibility considerations

Speaker
Pasha/ARM and Yossi/Mellanox

Topic
UCT component architectureAbstract
Split UCT to modules, and load them dynamically, so missing dependencies would disable only the relevant transports

Speaker
Yossi/Mellanox

Topic
RDMA-CM discussionSpeaker
Yossi/Mellanox

Topic
Verbs API send/complete disaggregation API directions

Speaker
Jason Gunthorpe / Mellanox

11:00 AM Topic
Open MPI integration with UCX UpdatesSpeaker
Mellanox
Topic
UCP Active message APIAbstract
Discuss active messages implementation on UCP level

Speaker
Yossi/Mellanox

Topic
Verbs, DevX and DV / how UCX will be using these featuresSpeaker
Jason Gunthorpe/Mellanox
12:00 PM ** Working Lunch **

Topic
Regression and testing for multiple uarch (x86/Power/ARM) and interconnects ( RoCE, iWARP, TCP, etc.)

Speaker
All

** Working Lunch **

Topic
1. Multi-uarch support for various architectures,
2. Internal memcpy (DPDK style)

Speaker
All

** Working Lunch **

Topic
UCX Upstream RDMA-core support status

Speaker
Yossi/Mellanox

Topic
SELinux

Speaker
Daniel Jurgens/Mellanox

1:00 PM Topic
UCX specification and manages updateSpeaker
Brad/AMD
Topic
1. Async progress for protocols,
2. Internal memcpy (DPDK style)Abstract
Progress various protocols, such as rendezvous, stream, disconnect, RMA/AMO emulation using progress thread

Speaker
All

Topic
1. Verbs ODP MR improvements,
2. RDMA and containersSpeaker
Parav/Mellanox
2:00 PM Topic
Support for shmem signaled putAbstract
How to support new OpenSHMEM primitive – put with signal

Speaker
Yossi/Mellanox

Topic
OpenMPI BTL over UCTAbstract
UCT API freeze

Speaker
Nathan/Los Alamos National Laboratory

Topic
Thread safety, fine-grained lockingAbstract
Discuss what is needed in UCP and UCT to support better concurrency than a big global lock

Speaker
Yossi/Mellanox

3:00 PM Topic
MPICH with UCX – State of the unionSpeaker
Ken/Argonne National Laboratory
Topic
OpenSHMEM context to UCX worker mappingSpeaker
Manju/Mellanox
Topic
Xpmem support for tag matchingAbstract
Use 1-copy for expected eager messages using UCT tag-offload API

Speaker
Yossi/Mellanox

4:00 PM Topic
OSSS SHMEM with UCX updateSpeaker
Tony/Stony Brook University
Topic
UCX GPU Support.Abstract
1. State of the Union by AMD and NVIDIA
2. Datatypes for GPU devices

Speaker
Akshay/NVIDIA, Khaled/AMD, Brad/AMD

Topic
Stream API and close protocolAbstract
Using stream API as replacement for TCP and considerations of closing/flushing a connection

Speaker
Yossi/Mellanox

5:00 PM Topic
MVAPICH StatusAbstract
Latest development around MVAPICH

Speaker
Ohio State University

Topic
UCX CollectivesSpeaker
Khaled/AMD

Topic
UCX + Python bindings

Speaker
Akshay/NVIDIA

Topic
High availability, failoverAbstract
How to implement fabric error recovery by using multiple devices/ports

Speaker
Yossi/Mellanox

6:00 PM Open Discussion Open Discussion Open Discussion