Visitor Dr. Teranishi (Sandia National Labs)

On January 21, 2019, Dr. Keita Teranishi, Principal Member of Technical Staff, Sandia National Laboratories, USA, will be visiting and give the following presentation (precise time and location tbd):

Title: Local Failure Local Recovery Toward Scalable Resilient Parallel Programing Model

Abstract: With growing scale and complexity of computational systems, HPC applications are increasingly susceptible to a wide variety of hardware and software faults. Accordingly, applications are ill- equipped to deal with the full spectrum of possible faults and often their response, particularly in synchronous programming models, is disproportionate to fault rate. Alternatively, Local Failure Local Recovery (LFLR), is based on the notion that a fault recovery that is localized around their occurrence is more scalable and efficient than a bulk response characterized by the traditional checkpoint/restart. LFLR is more amenable with an asynchronous programming model as opposed to synchronous ones. In this study, we review the existing resilient parallel programming models and then demonstrate the efficiency and scalability of our resilient programming model for the traditional message passing and emerging asynchronous many task programming models.