Operating Systems and Runtime

With many different components, communication layers and custom hardware accelerators to manage, a specific Operating System (OS) and Runtime (RTM) layer are required to maximize overall performance. This is the focus of this research area.

Operating Systems and Runtime

We are building a complex, hierarchical computing infrastructure revolving around the presence of FPGAs as basic computational blocks. Research is focused on both the interconnections between and among the components of the system.

Operating Systems and Runtime

We are building a complex, hierarchical computing infrastructure revolving around the presence of FPGAs as basic computational blocks. Research is focused on both the interconnections between and among the components of the system.

Selected topics

Distributed Runtime Management (DRTM)

The workload must be effectively spread among all the available nodes in the datacenter. What we envision here is a system where every node has a thin OS layer capable of receiving data and reconfigure FPGAs at runtime, depending on the current executing HPC workload.

Inter-cluster runtime load balancing (LB)

The mapping which is done at design time is an estimate of the best possible solution. However, being an estimate, it might be off from the optimal solution, which can be known only at runtime. Moreover, in a large system, nodes fail every now and then; their assigned workload must be reassigned to another node. For these reason, the DRTM must load balance the workload among the available nodes, at runtime, for the sake of efficiency and reliability.

Fault Management (FM)

When computing systems scale up, the probability that one of its component fails at any moment increases proportionally to that scaling. This implies that any computation occurring in the system might fail, potentially compromising the overall computation. For this reason, a Fault Management system must be in place to effectively recover from such situations.

Performance Monitoring (PM)

This research focuses on how to monitor the performance of a large scale computing system, like power, throughput, network-related and other measures. This is the first step to implement any kind of adaptive performance system, another active research area.

...and many more!


Think you might be interested in an exaFPGA-related thesis? Just drop an email to Riccardo Cattaneo !
(and have a look at the people and contacts section!)