Distributed Runtime Management (DRTM)
The workload must be effectively spread among all the available nodes in the datacenter. What we envision here is a system where every node has a thin OS layer capable of receiving data and reconfigure FPGAs at runtime, depending on the current executing HPC workload.
Inter-cluster runtime load balancing (LB)
The mapping which is done at design time is an estimate of the best possible solution. However, being an estimate, it might be off from the optimal solution, which can be known only at runtime. Moreover, in a large system, nodes fail every now and then; their assigned workload must be reassigned to another node. For these reason, the DRTM must load balance the workload among the available nodes, at runtime, for the sake of efficiency and reliability.
Fault Management (FM)
When computing systems scale up, the probability that one of its component fails at any moment increases proportionally to that scaling. This implies that any computation occurring in the system might fail, potentially compromising the overall computation. For this reason, a Fault Management system must be in place to effectively recover from such situations.
Performance Monitoring (PM)
This research focuses on how to monitor the performance of a large scale computing system, like power, throughput, network-related and other measures. This is the first step to implement any kind of adaptive performance system, another active research area.