Robust Heterogeneous Computing for CPS

Muhammad Shafique

In today’s smart era, billions of heterogeneous devices, ranging from embedded to high-end computing machines, are getting increasingly integrated to realize complex abundant-data systems that need to process and classify a massive amount of data reliably under tight performance and energy constraints. Such, robust computing systems are crucial for the infrastructures of individuals, organizations, industries, and nations bearing significant safety-related, social and economic impacts. Application that would greatly benefit belong to domains like automotive, security, avionics, industrial systems, wearable healthcare devices, and many others from the emerging areas of Internet-of-Things (IoT) and Cyber-Physical Systems (CPS).
Computing devices fabricated with nano-scale transistors are susceptible to a wide range of robustness threats like soft errors, thermal stresses, process variations, and diverse aging effects (like Negative/Positive Bias Temperature Instability, Hot Carrier Injections, and Time-Dependent Dielectric Breakdown). These threats jeopardize the correct execution of applications, leading to functional and timing errors, which can pose catastrophic risks (like malfunctions of healthcare-equipment and automotive crashes) and enormous economic losses in financial and banking systems. Therefore, robustness is an extremely important design criteria for computing systems deployed in CPS and IoT.
This talk will provide an overview of important robustness issues, prominent state-of-the-art techniques, and various hardware-software modeling and optimization techniques developed by my team. A key focus will be on bridging the gap between hardware and software to achieve highly accurate reliability models for the higher system layers at different levels of granularity. This provides a foundation to develop and employ diverse robustness optimizations at different system layers. Towards the end, this talk will shed light on the dark silicon problem, and how can it be leveraged to explore new challenges and opportunities for design and management of thermally-constrained computing systems to improve quality metrics (reliability, performance, etc.) within peak power and thermal constraints.