

## **Towards Highly Efficient Heterogeneous Supercomputers - the DEEP Approach**

Hans-Christian Hoppe, Jülich Supercomputing Centre Plasma Physics Towards the ExaScale Era WS at HiPEAC 2024



### **Road to Exascale – Slower than Expected**



#### Top #1: HPL Rpeak [PFLOP/s]



**1997:** First **1TFlop/s** computer: (*ASCI Red/9152*)

**2008:** First **1 PFlop/s** computer: (*Roadrunner*)

So.... First 1 EFlop/s computer: 2018 !!

– Well... not really

It took 4 years longer.... **2022** for *Frontier* to appear

### **Exascale Challenges**

Application parallelism

- Applications must support billions of individual threads
- Lower-scaling applications / parts of applications should not run on a full Exascale system
- Truly scalable systems
  - Huge numbers of devices need to exchange data with each other
  - Collective communication operations are "slowing down" due to larger system sizes
  - Network contention and reliability become worries

Energy efficiency

- Accelerators clearly beat CPUs for many (most?) codes
- System heterogeneity is a must
- Yet portable accelerator programming is hard

Memory and storage

- Ever growing gap between compute throughput and memory bandwidth
- New technologies like HBM suffer from capacity limitations & high energy consumption

Workload diversity

- Exascale centers must run a wide variety of HPC, AI and data analytics workloads with highest energy efficiency
- One size does not fit all









3





### **Heterogenous Systems – HPC Centre View**







#### Accelerated Nodes + Special Nodes



Different workloads need different CPU vs. accelerator ratios

- Statically configured systems are always a compromise
- "Dark" silicon eating energy for nothing for some WLs

Restriction of achievable performance on other WLs

Adding "special" nodes only helps so much ...

Really want to be able to compose arbitrary mixes of CPUs plus accelerators







### **DEEP Series of Prototype Systems**





#### DEEP Prototype 128 Xeon + 284 KNC nodes InfiniBand + 1.5Gbit Extoll 550 TFlop/s



**DEEP-ER Prototype** 16 Xeon + 8 KNL nodes 100Gbit Extoll 40 TFlop/s



DEEP-EST Prototype 55 Cluster + 75 Booster + 16 Data Analytics 100 Gbit Extoll + InfiniBand + Eth 800 TFlop/s



### **Heterogenous Systems – Application View**



- Space Weather simulation
  - Simulates plasma produced in solar eruptions and its interaction with the Earth magnetosphere
  - Particle-in-Cell (PIC) code
  - Authors: KU Leuven
- Two solvers:
  - Field solver: Computes electromagnetic (EM) field evolution
    - Limited code scalability
    - Frequent, global communication
  - Particle solver: Calculates motion of charged particles in EM-fields
    - Highly parallel
    - Billions of particles
    - Long-range communication



**A. Kreuzer,** J. Amaya, N. Eicker, E. Suarez, *"Application performance on a Cluster-Booster system",* 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), HCW (20th International Heterogeneity in Computing Workshop), Vancouver (2018), p: 69 - 78. [doi: 10.1109/IPDPSW.2018.00019]



### • Overall performance gain:

1× 28% × gain compared to Cluster alone
node 21% × gain compared to Booster alone

Particle solver: 1.35 x faster on Booster

Field solver: 6x faster on Cluster

8×38% × gain compared to Cluster onlynodes34% × gain compared to Booster only

- 3%-4% overhead per solver for C+B communication (point to point)

(IPDPSW), Vancouver, Canada, p 69 - 78 (2018) [10.1109/IPDPSW.2018.00019]

A. Kreuzer et al. "Application Performance on a Cluster-Booster System", 2018 IEEE IPDPS Workshops

### xPic – Small Scale Performance Results



45

40

35

**Cluster** 

Booster

C+B





8

### xPic – Strong Scaling Behaviour



Variable-ratio modular strong scaling



(4 Cluster nodes) Number of Booster nodes

| #cells per node         | 36864        |
|-------------------------|--------------|
| #particles per cell     | 1024         |
| #blocks per MPI process | 12, 32 or 64 |

- JSC Jureca system Intel<sup>®</sup> Xeon<sup>®</sup> plus Intel<sup>®</sup> Xeon Phi<sup>™</sup> (KNL)
- Code portions can be scaled-up independently
  - Particles scale almost linearly on Booster
  - Fields kept constant on the Cluster (4CNs)
- A configuration is reached where same time is spent on Cluster and Booster
  - Additional 2× time-saving can be reached by co-scheduling "matching"xPic jobs

### Integrated Exascale-Ready SW Stack





#### Heterogeneous / Modular Hardware

Public release at https://gitlab.jsc.fz-juelich.de/deep-sea/wp3/software/easybuild-repository-deep-sea

### **Seven Co-Design Applications**







### **Application Mapping Optimisation Cycle**





### Use Case: PATMOS

Solves the neutron transport equations to simulate evolution of physical quantities for complex systems

Cross-sections computation represents 60% to 90% of total runtime

- Porting cross section computation to GPU
- Offload batch-size particles at a time



#### Split of application depends on batch size







### Compute Express Link (CXL): high-speed interface to accelerators and memory modules

Scratchpad (Embedded systems-on-chip, GPUs)

High bandwidth memory (Intel Xeon Phi, GPUs)

Examples...

DDR DRAM



Heterogeneous/Hierarchical Memory

Byte addressable non-volatile memory (HP's Machine, Intel Optane)

Projects





### **Heterogeneous/Hierarchical Memory Tools**

- To which degree do the applications need to be modified?
- Which layer manages the memory? When?
- How much can the applications benefit?







#### SHAMBLES scatter plot example for sparse kernel





### Malleability

Usual HPC workload resource reservation (constant # cores or nodes over time)

Actual use of resources varies over time (yellow curve)

Workload is able to use more resources in certain phases (arrow)

Ideal resource allocation for the workload in green

Malleable applications

- Release resources not required
- Acquire more resources if advantageous

Change in # of nodes do require data redistribution in the workload

DEEP-SEA provides MPI & Slurm prototypes for enabling application-driven (active) malleability







## **Funding Acknowledgement**









SPONSORED BY THE

# **fwo** bpifrance

e Fei of an

Swiss guide to European

research and innovation

Federal Ministry of Education and Research



Research

Council

The DEEP Projects have received funding from the European Commission's FP7, H2020, and EuroHPC JU Programmes, under Grant Agreements n° 287530, 610476, 754304, and 955606. The DEEP-SEA project receives also support from Belgium, France, Germany, Greece, Spain, Sweden, and Switzerland



www.deep-projects.eu @DEEPprojects