Dynamic Load Balancing (DLB)

DLB optimises the performance of hybrid (MPI & OpenMP/OmpSs) parallel applications and maximises their utilization of computational resources. It is a dynamic library transparent to the user and does not require any modifications of the application code. DLB interacts with several levels of a typical HPC SW stack as illustrated by Figure 1.

DLB improves the load balance of hybrid applications (see Figure 2) by managing the number of threads used on the compute nodes. The library is compatible with MPI, OpenMP and OmpSs. Since version 3, it includes three modules which implements orthogonal balancing methods and policies: Lend When Idle (LeWI), Dynamic Resource Ownership Management (DROM), Tracking Application Live Performance (TALP), as described below.

LeWI: Lend

When Idle This module optimises the performance of hybrid applications without a previous analysis or modifying the code. It improves the load balance of the outer level of parallelism by redistributing the computational resources at the inner level of parallelism. This readjustment of resources will be done dynamically at runtime. This dynamism allows DLB to react to different sources of imbalance: Algorithm, data, hardware architecture and resource availability among others. The DLB approach to redistributing the computational resource at runtime depending on the instantaneous demand can improve the performance in different situations:

  • Hybrid applications with an imbalance problem at the outer level of parallelism
  • Hybrid applications with an imbalance problem at the inner level of parallelism
  • Hybrid applications with serialized parts of the code
  • Multiple applications with different parallelism patterns

In fig. 3 we can see an example of LeWI. In this case, the application is running two MPI processes in a computing node, with two OpenMP threads each one. When MPI process 1 reaches a blocking MPI call it will lend its assigned CPUs (number 1 and 2) to the second MPI process running in the same node. This will allow MPI process 2 to finish its computation faster.

DROM: Dynamic Resource Ownership Management

This module allows reassigning computational resources from one process to another depending on the demand. DROM offers and API that can be used by an external entity, like a Job Scheduler or resource manager. With the DROM API a CPU can be removed from a running application and given to another one:

  • To a new application to allow co-location of applications
  • To an existing application to speed up its execution

TALP: Tracking Application Live Performance

TALP is a lightweight, portable, extensible, and scalable tool for online parallel performance measurement. The efficiency metrics reported by TALP allow HPC users to evaluate the parallel efficiency of their executions, both post-mortem and at runtime. The API that TALP provides allows the running application or resource managers to collect performance metrics at runtime. This enables the opportunity to adapt the execution based on the metrics collected dynamically. The set of metrics collected by TALP are well defined, independent of the tool, and consolidated.

DLB in DEEP-SEA

The DLB library is being extended to support OpenMP applications using OMPT, the tools interface provided by the OpenMP standard. A new API will enable to enforce resource assignment and to enable malleability at node level. This API will be used by any resource manager, such as Slurm, or other runtime systems.

References

  1. Dynamic Load Balancing for Hybrid Applications. PhD Thesis of Marta Garcia-Gasulla, 2017.
    Available at https://pm.bsc.es/ftp/dlb/doc/tesi_mgg.pdf.
  2. Hints to improve automatic load balancing with LeWI for hybrid applications, ICPP2009
    Available at https://pm.bsc.es/ftp/dlb/doc/LeWI_ICPP09.pdf 
  3. DROM: Enabling Efficient and Effortless Malleability for Resource Managers, ICPP 2018
    Available at https://pm.bsc.es/ftp/dlb/doc/drom_preprint.pdf
  4. TALP: A Lightweight Tool to Unveil Parallel Efficiency of Large-scale Executions, PERMAVOST 2021
    Available at https://pm.bsc.es/ftp/dlb/doc/TALP_PERMAVOST21.pdf 

Additional Links

BSC Dynamic Load Balancing