# FPGA acceleration for HPC supercapacitor simulations

#### **Charles Prouveur**

Work done in collaboration with M. Haefele, T. Kenter & N. Voss



#### What is an FPGA ?

- A chip historically used in embedded systems
- Can be found in accelerator cards nowadays







Xilinx U200

Intel Stratix10

Intel Agilex 7 (F)

#### **FPGA** acceleration

• Why is it different ?

Physical design vs instructions executions

• How does it work ?

Data is streamed from DRAM into pipelines on the chip

• What is the point ?

FPGAs' TDP's ~ 5-6 x less than GPUs'











## FPGA parallelism

- Acceleration is proportional to the number of pipes
- BUT : also proportionnal to the length of the pipe(!)
- Computing time = number of operations / (frequency x number of pipes x length of the pipe) + latency



- Molecular dynamic production code used by Sorbonne university researchers to simulate electrochemical systems such as «supercapacitors»
- Fortran 90 base code, parallelised with MPI
- Most of the computing time = electrostatic potential computing
- Computing efficiency published : model running during several weeks on 512 cores while maintaining a parallel efficiciency above 75 %
- Currently using CPU and GPU implementations (OpenAcc)

#### Miniapp extracted from the production code



# Target device on Jumax at Juelich supercomputing center

| Machine (8101MB total)                                                                                |                         |  |  |  |
|-------------------------------------------------------------------------------------------------------|-------------------------|--|--|--|
| Package L#0                                                                                           | 8,0 8,0 4,0 PCI 03:00.0 |  |  |  |
| NUMANode L#0 P#0 (8101MB)                                                                             | 4,0 4,0 PCI 04:00.0     |  |  |  |
| L3 (6144KB)                                                                                           | 4,0 4,0 PCI 05:00.0     |  |  |  |
| L2 (256KB) L2 (256KB) L2 (256KB) L2 (256KB)                                                           | 4,0 4,0 PCI 06:00.0     |  |  |  |
| L1d (32KB) L1d (32KB) L1d (32KB) L1d (32KB)                                                           | 4,0 PCI 07:00.0         |  |  |  |
| L1i (32KB) L1i (32KB) L1i (32KB) L1i (32KB)                                                           | 4,0 4,0 PCI 08:00.0     |  |  |  |
| Core L#0 Core L#1 Core L#2 Core L#3                                                                   | 4,0 4,0 PCI 09:00.0     |  |  |  |
| PU L#0         PU L#2         PU L#4         PU L#6           P#0         P#1         P#2         P#3 | 4,0 4,0 PCI 0a:00.0     |  |  |  |
| PU L#1 PU L#3 PU L#5 PU L#7<br>P#4 P#5 P#6 P#7                                                        | 4,0 4,0 PCI 0b:00.0     |  |  |  |
|                                                                                                       | Net ib0                 |  |  |  |
|                                                                                                       | OpenFabrics mlx4_0      |  |  |  |
|                                                                                                       | 4,0 4,0 PCI 0c:00.0     |  |  |  |
|                                                                                                       | Net ib1                 |  |  |  |
|                                                                                                       | OpenFabrics mlx4_1      |  |  |  |
|                                                                                                       | PCI 00:02.0             |  |  |  |
|                                                                                                       | PCI 00:19.0             |  |  |  |
| Net eth0                                                                                              |                         |  |  |  |
|                                                                                                       | PCI 00:1f.2             |  |  |  |
| Host: pandora<br>Date: ven. 14 mai 2021 15:38:52                                                      |                         |  |  |  |



Figure 1: Overview of a Maxeler acceleration system

- CPU host : AMD EPYC 7601
- 8 nodes Xilinx VU9P on one blade as target devices
- 8x PCIe 2.0 lanes  $\rightarrow$  4.0 GB/s for all nodes
- Each node has three 16-GB DDR4 Dual In-Line Memory Modules (DIMMs), which provide a theoretical peak bandwidth of 15 GB/s each

### The FPGA used here



- Max5C is a U200 with one less DIMM
- 3 SLR : the chip is divided in 3 Super Logic Region connected by solders (very low bandwidth)
- SLR communication should be avoided as much as possible
- Design done using Maxj with Maxeler toolchain

|                      | MAX5C           | Alveo U200    |  |
|----------------------|-----------------|---------------|--|
| FPGA                 | VU9P            | VU9P          |  |
| SLRs                 | 3               | 3             |  |
| LUTs (k)             | 1,182           | 1,182         |  |
| FFs (k)              | 2,364           | 2,364         |  |
| DSPs                 | 6,840           | 6,840         |  |
| BRAM18s              | 4,320           | 4,320         |  |
| URAMs                | 960             | 960           |  |
| On Chip              | 42.0 MButo      | 43.2 MByte    |  |
| Memory Capacity      | 43.2 MByte      |               |  |
| DDR DIMMs            | 3               | 4             |  |
| DDR Capacity         | 16 GB           | 16 CP         |  |
| per DIMM             | 10 GB           | 16 GB         |  |
| Supported DDR        | 933, 1066, 1200 | 1200          |  |
| Frequencies (MHz)    | 933, 1000, 1200 | 1200          |  |
|                      | DIMM_0 ->SLR0   | DIMM_0 ->SLR0 |  |
| DIMM to              | DIMM_1 ->SLR1   | DIMM₋1 ->SLR1 |  |
| SLR Mapping          | DIMM_2 ->SLR2   | DIMM_2 ->SLR1 |  |
|                      |                 | DIMM_3 ->SLR2 |  |
| PCIe Placement       | SLR1            | SLR0          |  |
| PCle                 | PCle Gen 2 x8   | PCle Gen 2 x8 |  |
| Networking           | 1 x 100 GBit/s  | 2 x 100GBit/s |  |
| Networking Placement | SLR2            | SLR2          |  |

## Designing all kernels on one FPGA

- Due to hardware contraints, each kernel is put into one SLR (Super Logic Region)
- For simplicity, the conjugate gradient is computed on the host
- We want the highest frequency possible
- We also want the biggest number of separate pipelines
- All kernels work synchronously
- Use as much as possible the device's DRAM to reduce communications

### Challenges

- Limited ressources
- More logic available  $\rightarrow$  higher design frequency likely
- More pipelines  $\rightarrow$  less logic available
- Using DRAM makes meeting timings harder

 $\rightarrow$  harder compilation, ie we are less likely to achieve high frequency with as many pipelines

### Balancing the kernels

- Need for synchronicity  $\rightarrow$  balance needed
- Theoretical time for each sequential kernel is known : N² / freq , N²/2 / freq and N x M / freq
- Balance is case dependent (i.e. : dependent on N and M)
- For the production test case considered here, the balance is (8,4,1) :

8 pipes for the kernel in  $N^2$ , 4 pipes for the kernel in  $N^2/2$  and one pipe for the kernel in N x M

#### Numerical accuracy analysis



We can save ressources but there is a catch : the number of iterations to converge increases

# Ressources usage of the multiple kernels designs

| Design Name                                                     | 64 bits Design | 40 bits Design | Final design |
|-----------------------------------------------------------------|----------------|----------------|--------------|
| Design frequency (MHz )                                         | 200            | 200            | 300          |
| Pipes $(U_{\mathrm{lr},0}, U_{\mathrm{sr}}, U_{\mathrm{lr},+})$ | (8,4,1)        | (16, 8, 2)     | (32, 16, 4)  |
| Logic (LUTs & FFs)                                              | 27.7%          | 33.4%          | 44.6 %       |
| DSPs                                                            | 33.42%         | 29.52%         | 53.3%        |
| On-chip Mem                                                     | 22.7%          | 20.3%          | 28.8%        |

DSP limited in the U\_Ir,0 kernel (87 % DSPs used in its SLR)

## Design using the device's DRAM

- The previous design does not use the DRAM
- Q has to be sent every iteration from and to the FPGA
- (x,y,z) they can be stored
- Using LMEM makes the design harder to compile

 $\rightarrow$  have to make concessions

• Best design is (24,12,3) with a frequency of 260 MHz

# Airview of the compiled final design with all kernels on the FPGA



#### Results



Measured FPGA power efficiency is twice the GPU's

# Ressources usage of the single kernel designs

| Design Name             | Design $U_{\rm lr,0}$ | Design $U_{\rm sr}$ | Design $U_{\rm lr,+}$ |
|-------------------------|-----------------------|---------------------|-----------------------|
| Design frequency (MHz ) | 300                   | 300                 | 300                   |
| Total number of pipes   | 96                    | 48                  | 42                    |
| Logic (LUTs & FFs)      | 51.9%                 | 63.3%               | 62.8%                 |
| DSPs                    | 87.2%                 | 55.4%               | 83.5%                 |
| On-chip Mem             | 27.2%                 | 25.8%               | 38.4%                 |

- DSP limited in two kernels and Logic limited in one kernel
- Adapting the balance to a number of FPGAs allocated per kernel

### Speedup using multiple FPGAs



Charles Prouveur

27/05/23, Davos, PASC23

### FPGA targeting with OneAPI

- Development done in C++
- Easy to use and to make you first design targeting Intel FPGAs
- Another hardware and another toochain
- Long term support is more likely

### OneAPI implementation of Metalwalls

- Targeting Stratix 10 at Paderborn supercomputing center
- Poor resources usage (max 8 pipes @ 400 Mhz)
- Mixed precision did not save as many resources (if any)
- Obtaining proper throughput was a big challenge
- Single kernel designs scaling done up to 14 nodes (we could only use one out of two FPGAs on each node)
- Higher TDP with the Stratix 10

#### Speedup at PC<sup>2</sup>



Good scaling, but ~10x worse than with maxeler design (but 1 month of development vs ~2 years)

#### Conclusion

- We Implemented Metalwalls kernels on FPGA using Maxeler tools
- Our FPGA implementation showed 2x better performance per watt than a GPU of similar silicon technology and even better against skylake CPU with maxeler design
- Multiple FPGAs performance bottlenecked by old interconnect technology but achieved nonetheless
- OneAPI implementation : promising technology, better development environment, not mature enough for HPC yet but developping fast

## Thank you for your attention

#### Annexes

#### Original code vs Maxcompiler

#### \$acc data present(xyz\_atoms(:,3), q\_elec(:), V(:))

```
itype = 1, num elec types
count n = electrodes(itype)%count
offset n = electrodes(itype)%offset
do jtype = 1, itype-1
   count m = electrodes(jtype)%count
   offset m = electrodes(itype)%offset
   num block full = localwork%pair atom2atom num block full local(itype,jtype)
   iblock offset = localwork%pair atom2atom full iblock offset(itype, itype)
   do iblock = 1. num block full
      call update other block boundaries(iblock offset+iblock. &
            offset n. count n. offset m. count m. &
            istart, iend, istart, iend)
       !Sacc parallel loop private(zij, zijsg, pot ij, vi)
      do i = istart. iend
         vi = 0.0 wp
         !$acc loop reduction(+:vi)
         do j = jstart, jend
            pot ij = volfactor * (sqrpialpha*exp(-zijsq*alphasq) + pi*zij*erf(zij*alpha))
            V(1) = V(1) - q elec(1) * pot 11
            vi = vi + q elec(j) * pot ij
         !sacc end loop
      !Sacc end parallel loop
num block diag = localwork%pair atom2atom num block diag local(itype)
iblock offset = localwork%pair atom2atom diag iblock offset(itype)
do iblock = 1, num_block_diag
   call update diag block boundaries(iblock offset+iblock, offset n, count n, istart, iend)
    !$acc parallel loop private(zij, zijsq, pot_ij, vi)
   do i = istart, iend
      vi = 0.0 wp
      !$acc loop reduction(+:vi)
      do j = istart, iend
         pot_ij = volfactor * (sqrpialpha*exp(-zijsq*alphasq) + pi*zij*erf(zij*alpha))
      !$acc end loop
    !$acc end loop
```

num block full = localwork%pair atom2atom num block full local(itype.itype)

#### DFEVector<DFEStruct> xyz = io.input("xyz", xyztypeVec,load)

DFEVectorType<DFEVar> vectorType = new DFEVectorType<DFEVar>(dfeFloat(11, 53), numPipes); DFEVector<DFEVar> q = io.input("q", vectorType,load);

#### ArrayList-Memory-OFEVar>> ZRam = new ArrayList-Memory-OFEVar>>(); ArrayList-Memory-OFEVar>> qRam = new ArrayList-Memory-OFEVar>>(); for(int i=0; i=numPipes; i++) { zRam.add(nem.alloc(scalar, maxParticlesPerPipe)); qRam.add(nem.alloc(scalar, maxParticlesPerPipe));

or(int i=0; iscumPipes; i++) {
 DFEVar Address 0 = counter\_atoms.getCount().cast(dfeUInt(MathUtils.bitsToAddress(maxParticlesPerPipe)));
 DFEVar q\_j = q[i].cast(scalar);
 DFEVar z\_j = (DFEVar) xyz[i][\*z\*].cast(scalar);
 zRam[i].write(Address\_0, q\_j, load);

#### DFEVar[] k0sum = new DFEVar[numPipes];

refinite(execution) [finites:temPipes; i++) { DFEVar jCount = i + counter\_atoms.getCount()\*numPipes; DFEVar Address = counter\_atoms.getCount().cast(dfeUInt(MathUtils.bitsToAddress(maxParticlesPerPipe))); DFEVar zj = load ? (DFEVar) xyz[i][\*z'].cast(scalar) : ztAan[i].read(Address); DFEVar zj = load ? (Di.cast(scalar) : qRem[i].read(Address);

#### // Compute the potential

DFEVar zij = zj - zi; DFEVar zijSq = zij \* zij; DFEVar erfZij = ErrorFunction.erf(alpha \* zij); DFEVar expZij = KernelMath.exp(-alphasq \* zijSq); DFEVar vij = counter zi.getCount().cast(dfeUInt(32)) < numWall\_without\_padding & jCount < numWall\_without\_padding ? qj \* ((sqrpialpha \* expZi)) + (pirzi)\*erfZij)) : constant.var(o).cast(scalar);

k0sum[i] = FloatingPointAccumulator.accumulateFixedLength(vij.cast(dfeFloat(11, 53)), constant.var(true), numWallPerPipe.cast(dfeUInt(32)), true);

// sum contributions from each pipe
DFEVar k0pot = k0PotFactor \* FloatingPointMultiAdder.add(k0sum);

// Output
io.output("k0pot", k0pot, dfeFloat(11, 53),counter\_atoms.getWrap());

#### OneAPI:

```
d kernel_k0( int num_atoms, double alpha, double alphasq, double sqrpialpha, double k0PotFactor,
double *z, double *q, double *k0 ){
```

//"first\_index\_i", (num\_start\_slr + num\_slr\_in\_fpga) \* Nperslr)

```
double q_loc[num_max];
double z_loc[num_max];
for (int i = 0; i < num_atoms; i++) {
        z_loc[i] = z[i];
        _ q_loc[i] = q[i];
```

```
for (int i = 0; i < num_atoms; i++) { // MPI division here
    double zi = z loc[i];
    double s = 0.0;
    for (int j = 0; j < num_atoms; j++) { // pipe parallelization here
        double zij = z loc[j] - zi;
        double zijsq = zij*zij;
        double pot_ij = sgrziplapha*exp(-zijsq*alphasq) + M_PI*zij*erf(zij*alpha);
        s += - q_loc[j] * pot_ij;
        }
        k0[i] = s * k00tFactor;
    }
</pre>
```

```
#ifndef K0 INNER L
#define K0 INNER U 16
#ifndef K0 OUTER U
#define K0 READ U 16
const int num max = 43008; //8*1024;
class k0Kernel;
void kernel k0(const int num atoms, const double alpha,
       const double alphasg, const double sgrpialpha.
       const double k0PotFactor,
       double *z, double *q, double *k0 ){
       Real q loc[num max];
       Real z loc[num max];
       #pragma unroll K0 READ U
       for (int i = 0; i < num atoms; ++i) {</pre>
            q_loc[i] = q[i];//fpga_reg(q[i]);
       for (int ib = 0; ib < num atoms; ib+=K0 OUTER U) { // MPI division here</pre>
            Real zi[K0 OUTER U];
            #pragma unroll
            for (int ii = 0; ii < K0 OUTER U; ii++) {</pre>
            [[intel::initiation interval(12)]] // 12 seems perfect
            for (int jb = 0; jb < num atoms; jb+=K0 INNER U) {</pre>
               #pragma unroll
                    Real s block = 0.0;
                    #pragma unroll
                    for (int jj = 0; jj < K0_INNER_U; jj++) {</pre>
                        Real zij = z_loc[jb+jj]-zi[ii];//fpga_reg(z_loc[jb+jj]) - fpga_reg(zi[ii]);
                        Real zijsq = zij*zij;
                        Real c = q_loc[jb+jj] * (sqrpialpha*my_exp(-zijsq*(Real)alphasq) + M_PI*zij*my_erf(zij*alpha))
                        s block += c;
                    s[ii] -= (double) s block;
            #pragma unroll
            for (int ii = 0; ii < K0 OUTER U; ii++) {</pre>
               k0[ib+ii] = s[ii] * k0PotFactor;
```

| Tool/Target    | oneAPI/Stratix 10 |                 | Maxeler/VU9P |             |
|----------------|-------------------|-----------------|--------------|-------------|
| Data Type      | double            | ap_float<10,35> | double       | float<8,32> |
| Multiplication | [3253, 14]        | [504, 1]        | [132,7]      | [427,0]     |
| Addition       | [3484, 18]        | [1696, 5]       | [582,3]      | [95,4]      |
| Division       | [6034, 50]        | [1864, 21]      | [3135, 0]    | [1247, 0]   |
| exp()          | [5504, 20]        | [2218, 9]       | [1290, 38]   | [ 555, 16]  |
| sqrt()         | [2613, 28]        | [1029, 6.5]     | [1653, 0]    | [673,0]     |
| fused sin/cos  | [7548, 43]        | [7548, 43]?     |              |             |
| sin()          |                   |                 | [1261, 28]   | [771,10]    |

#### Table 4: [LUTs, DSPs] usage of basic arithmetic functions based on oneAPI reports and Maxeler documentation.