OmpSs-2@Cluster

OmpSs-2@Cluster extends OmpSs-2 [1] to support transparent task offloading for execution on other nodes. Any OmpSs-2 program with a full specification of task dependencies is compatible with OmpSs-2@Cluster. The OmpSs-2@Cluster programming model inherits sequential semantics, tasks with dependencies, and a common address space from OmpSs-2. It uses the advanced dependency system of OmpSs-2 to enable work to be offloaded concurrently with task execution. The Nanos6@Cluster runtime system builds the distributed dependency graph, tracks dependencies among tasks, schedules tasks for execution and performs data transfers as necessary. In DEEP-SEA, we implemented many optimizations to achieve scalability on up to 16 nodes [2] and we extended the programming model to allow subtask memory allocation and improve programmability [3]. Figure 1 illustrates the OmpSs-2@Cluster architecture and components.

Dynamic load balancing using OmpSs-2@Cluster

Load imbalance is an important source of inefficiency in high performance computing. In DEEP-SEA, we automatically improve the load balance of MPI + OmpSs-2 programs by employing OmpSs-2@Cluster to offload tasks for among nodes [4]. The runtime system starts additional processes on each node that are able to execute offloaded tasks. If the program is already load balanced, then it will execute as normal. But if there is imbalance among the loads on different nodes, then OmpSs-2@Cluster will offload tasks to balance the loads. We combine a work stealing scheduler and BSC’s Dynamic Load Balancing (DLB) library, which reallocates compute resources on each node. Figure 2 shsows the OmpSs-2@Cluster load balancing scheme.

Malleability

In DEEP-SEA, we are extending OmpSs-2@Cluster to transparently support node-level malleability, i.e., to allow a running job to dynamically add and remove nodes and use their computational resources. All interaction with Slurm and the MPI library, as well as application data redistribution, is handled by the Nanos6@Cluster runtime library. We have prototyped an implementation based on MPI_Comm_spawn and the MPI world model [2], and we will move to MPI Sessions in the second half of the project. This approach enables an application to support malleability, without any changes to the application.

References

  1. Jimmy Aguilar Mena, Omar Shaaban, Vicenç Beltran, Paul Carpenter, Eduard Ayguade, and Jesus Labarta. OmpSs-2@Cluster: Distributed memory execution of nested OpenMP-style tasks. European Conference on Parallel Processing, Euro-Par 2022.
    https://doi.org/10.1007/978-3-031-12597-3_20.
  2. Jimmy Aguilar Mena. Methodology for malleable applications on distributed memory systems. PhD thesis (on deposit) Advisor: Paul M. Carpenter.
  3. Omar Shaaban, Jimmy Aguilar Mena, Vicenç Beltran, Paul Carpenter, Eduard Ayguade and Jesus Labarta Mancho. Automatic aggregation of subtask accesses for nested OpenMP-style tasks. SBAC-PAD 2022.
  4. Jimmy Aguilar Mena, Omar Shaaban, Victor Lopez, Marta Garcia, Paul Carpenter, Eduard Ayguade, and Jesus Labarta. Transparent load balancing of MPI programs using OmpSs-2@Cluster and DLB. ICPP2022.
    Transcript available at https://www.youtube.com/watch?v=Bnyr_IWhyOo.

Additional Links

BSC OmpSs-2@Cluster
GitHub