

### Use in-chip memory for RDMA operations Roie Danino, UCX Team | UCF Conference, December 2023



Abstracts communication transports Selects best available route(s) between endpoints TCP, RDMA, Shared Memory, GPU Zero-copy GPU memory transfers over RDMA RDMA requires network support (IB or RoCE) http://openucx.org

## UCX Library Unified Communication X



- Connect-X and BlueField devices contain a fast memory chip called MEMIC - Memory Mapped to InterConnect
- The MEMIC can be used for RDMA operations, as well as mapped to a process on the CPU
- Accessible over the network or the PCIe

## **InfiniBand MEMIC - Introduction**







# InfiniBand MEMIC Motivation

| • | Accessing host<br>the PCIe bus w |
|---|----------------------------------|
| • | Using local on-                  |
| • | Reduce the lat operations        |



st memory requires a round-trip of >350 nsec over when performing atomic operations

-NIC memory avoids that round-trip

tency of RDMA read and fetching atomic

Dual x86 Gen5 PCIe capable CPUs 4 x Densilink cables to communicate ConnectX-7 network devices to external ports



## InfiniBand MEMIC – Atomic Fetch & ADD



### Host Memory



## InfiniBand MEMIC – Atomic Fetch & ADD



### NIC Memory (MEMIC)





## InfiniBand MEMIC – Atomic Fetch & ADD

### InfiniBand MEMIC Usage / API UCX

- A new memory type was introduced: UCS\_MEMORY\_TYPE\_RDMA
- MEMIC usage somewhat resembles GPU memory usage, since in both cases it's a region in the process address space that's accessible to the transport but not to the CPU.
- It can be allocated just like any other memory type using ucp\_mem\_map passing UCS\_MEMORY\_TYPE\_RDMA as a memory type parameter.



ucp\_mem\_map\_params\_t mem\_map\_params; ucp\_mem\_h mem\_h; ucs\_status\_t status;

mem\_map\_params.field\_mask = UCP\_MEM\_MAP\_PARAM\_FIELD\_MEMORY\_TYPE;

mem\_map\_params.memory\_type = UCS\_MEMORY\_TYPE\_RDMA;

status = ucp\_mem\_map(context, &mem\_map\_params, &mem\_h);

### InfiniBand MEMIC Usage / API **Open SHMEM**

- Use the SHMEM\_HINT\_DEVICE\_NIC\_MEM hint for allocating device memory.
- If MEMIC is not available, host memory is allocated instead.



int main()

my\_id);

return 0;

```
int *buffer;
   shmem_init();
   buffer =
shmem_malloc_with_hints(sizeof(*buffer),
```

```
SHMEM_HINT_DEVICE_NIC_MEM);
```

```
shmem_int_atomic_set(atomic_variable, 0,
```

```
shmem_free(buffer);
```

```
shmem_finalize();
```

### InfiniBand MEMIC – Implementation Device memory allocation flow in UCX

- 1. Device memory is allocated using ibv\_alloc\_dm()

address = mmap(NULL, dm\_attr.length, PROT\_NONE, MAP\_PRIVATE | MAP\_ANONYMOUS, -1, 0);

ibv\_reg\_dm\_mr(md->pd, dm, 0, length, access\_flags | IBV\_ACCESS\_ZERO\_BASED)

2. An address range is reserved using mmap, ensuring it is inaccessible from CPU:

3. The MEMIC region is registered using ibv\_dm\_reg\_mr(). The address is 0, because device memory is zero-based:





## InfiniBand MEMIC – KSM Mapping

Problem:

- Device Memory is Zero Based meaning its RDMA address is 0. Using 0 as the base address for all memory segments allocated on the IB device is
- undesirable

Solution:

- Use *mmap* to reserve an address range that will be dedicated for each device memory allocation
- 2. Map 0 to the reserved address using KSM mechanism of the NIC

KSM enables indirect memory mapping in the NIC, for example: mapping an existing remote key to a different custom RDMA address.

| Key       | Value |
|-----------|-------|
| 0xfff2340 | 0x100 |
| 0xffab320 | 0     |
|           |       |
|           |       |
|           |       |





### InfiniBand MEMIC – Benchmarks P2P (ucx\_perftest) Atomics Benchmarks

### ucx\_perftest

• a single sender and a single receiver (one direction)

### Benchmarks

- ucp\_fadd fetch & add atomic add and returns the old value.
- ucp\_add atomic add a posted operation.

### Command Line Example:

| \$ ./ucx_perftest -t ucp_fadd -c 0 -m hos |
|-------------------------------------------|
| <b>Command Line Parameter</b>             |
| -t                                        |
| -C                                        |
| -m                                        |
| -S                                        |
| -0                                        |

### t,rdma -s 8 -O16 <other host>

### Description

ucx\_perftest test name: ucp\_fadd, ucp\_add CPU affinity Memory type: <sender, receiver> Message size (bytes): either 4 or 8 for atomics. Window size - number of uncompleted outstanding sends



## InfiniBand MEMIC - Results P2P (ucx\_perftest)

### Units Latency - µsec (Lower is better) Message rate - Million ops / sec (Higher is better)

|             | Window Size       | 1       |         |      | 16      |         |      |
|-------------|-------------------|---------|---------|------|---------|---------|------|
|             | Atomic Allocation | Host    | MEMIC   | Diff | Host    | MEMIC   | Diff |
| Fetch & Add | Latency           | 1.653   | 1.426   | -14% | 0.42    | 0.123   | -71% |
|             | Message Rate      | 604999  | 701437  | 16%  | 2381344 | 8131869 | 241% |
| Atomic Add  | Latency           | 0.627   | 0.299   | -52% | 0.588   | 0.3     | -49% |
|             | Message Rate      | 1595301 | 3341689 | 109% | 1699894 | 3328350 | 96%  |

### Hardware

2 nodes, each with: 2x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz (15 cores each) 2x Connect-X 6 device

Software UCX – v1.16.0



### **Contended Atomics Benchmark OpenSHMEM Fetch & Add**

- Root process allocates an atomic variable in either: 1. symmetric heap 2. global data segment
  - 3. MEMIC
- Its value is set to 0
- Barrier
- 4. For each iteration:
  - Fetch and add 1 to the shared atomic variable  $\bullet$
- Barrier 5.
- 6. Collect results from all processes



```
void benchmark_fadd(int my_id, int npes, int* atomic_variable, unsigned long iterations)
    double begin, end;
   int i;
    static double rate = 0, sum_rate = 0, min_rate = 0, max_rate = 0;
    shmem_int_atomic_set(atomic_variable, 0, my_id);
    shmem_barrier_all();
   if (my_id != 0) {
       int value = 1;
       int old_value;
        begin = get_wall_time();
        for (i = 0; i < iterations; i++) {</pre>
           old_value = shmem_int_fadd(atomic_variable, value, 0);
        end = get_wall_time();
        rate = ((double)iterations) / (end - begin);
    shmem_barrier_all();
    shmem_double_sum_to_all(&sum_rate, &rate, 1, 0, 0, npes, pwrk, psync);
    shmem_double_max_to_all(&max_rate, &rate, 1, 0, 0, npes, pwrk, psync);
    /* Small hack to exclude root process from minimum calculation */
   if (my_id == 0) {
        rate = DBL_MAX;
    shmem_double_min_to_all(&min_rate, &rate, 1, 0, 0, npes, pwrk, psync);
    print_operation_rate(my_id, "shmem_int_fadd", sum_rate/1e6, min_rate/1e6,
                         max_rate/1e6, npes);
```



### **InfiniBand MEMIC - Results** SHMEM Fetch & Add Benchmark – 15 Nodes

**Software:** UCX – v1.16.0, OpenMPI – v5.0.0

# InfiniBand MEMIC – Conclusions

- process.

1. Using NIC memory for atomics improves performance compared to host memory 2. It is noticeable that from a certain number of processes per NIC the improvement is less significant for each added

3. Using more NICs preserves per-process performance when using NIC memory, as opposed to host memory.





## Thanks

