The MSA architecture provides unprecedented flexibility, efficiency and performance by combining modules with different characteristics. Moreover, some modules can also be heterogeneous, combining different computing, memory and network devices on the same node. These two levels of intra- and inter-node heterogeneity are hard to leverage with programming models that rely only on traditional fork-join and Single Program Multiple Data (SPMD) execution models. In the DEEP projects, we have developed a hybrid programming model that leverages the OmpSs-2 dataflow execution model to orchestrate computations and memory transfers between multi-core CPUs and accelerators and intra-node communications using MPI.

CUDA Support

We have extended OmpSs-2 to support CUDA C kernels that can be invoked like regular tasks, easing the development of hybrid applications. The runtime transparently manages the synchronisation of CUDA C kernels and other tasks, and the hardware manages the memory transfers if CUDA Unified Memory (UM) is used. Otherwise, the OmpSs-2 runtime uses a software directory and cache to explicitly manage the memory copies between the host and the accelerator and vice-versa. CUDA-enabled libraries such as cuBLAS or cuFFT are also supported. In DEEP-SEA, we are developing a new high-performance implementation of the directory and cache to support fine-grained CUDA C kernels.

Task-Aware MPI

A clean integration of OmpSs-2 with Parastation MPI was achieved using the Task-Aware MPI (TAMPI) library. This library improves the interoperability between task-based programming models and MPI by allowing both blocking and non-blocking MPI operations inside tasks. On the one hand, the library avoids potential deadlocks between blocking MPI primitives. On the other hand, non-blocking primitives are directly integrated into the dataflow model, linking the release of the dependencies of a given task to the completion of all non-blocking MPI operations that have been executed inside it.

Multi-Core Support

In DEEP-SEA, we have augmented the OmpSs-2 runtime system with an online monitoring infrastructure to gather real-time information about the performance and energy consumed by application tasks [1]. The information provided by this framework has been used to improve the integration of OmpSs-2 runtime with the DLB library [2].

We have also extended OmpSs-2 with a user-friendly API to allocate memory on NUMA systems. The information provided by the monitoring infrastructure and the memory placement API has been used to enhance the scheduling policies to maximise application performance and minimise overall energy consumption by scheduling each task to the best-suited available computing device [3].


  1. Combining Dynamic Concurrency Throttling with Voltage and Frequency Scaling on Task-based Programming Models. ICPP 2021
    Available at
  2. Enhancing Resource Management Through Prediction-Based Policies. Euro-Par 2020
    Available at 
  3. Combining Dynamic Concurrency Throttling with Voltage and Frequency Scaling on Task-based Programming Models. ICPP 2021)
    Available at

Additional Links

BSC OmpSs-2
Git Hub