With many different components, communication layers and custom hardware accelerators to manage, a specific Operating System (OS) and Runtime (RTM) layer are required to maximize overall performance. This is the focus of this research area.
We are building a complex, hierarchical computing infrastructure revolving around the presence of FPGAs as basic computational blocks. Research is focused on both the interconnections between and among the components of the system.
We are building a complex, hierarchical computing infrastructure revolving around the presence of FPGAs as basic computational blocks. Research is focused on both the interconnections between and among the components of the system.
The workload must be effectively spread among all the available nodes in the datacenter. What we envision here is a system where every node has a thin OS layer capable of receiving data and reconfigure FPGAs at runtime, depending on the current executing HPC workload.
The mapping which is done at design time is an estimate of the best possible solution. However, being an estimate, it might be off from the optimal solution, which can be known only at runtime. Moreover, in a large system, nodes fail every now and then; their assigned workload must be reassigned to another node. For these reason, the DRTM must load balance the workload among the available nodes, at runtime, for the sake of efficiency and reliability.
When computing systems scale up, the probability that one of its component fails at any moment increases proportionally to that scaling. This implies that any computation occurring in the system might fail, potentially compromising the overall computation. For this reason, a Fault Management system must be in place to effectively recover from such situations.
This research focuses on how to monitor the performance of a large scale computing system, like power, throughput, network-related and other measures. This is the first step to implement any kind of adaptive performance system, another active research area.