linux_training

Module 8: RDMA Fundamentals

Overview

This module introduces Remote Direct Memory Access (RDMA), a technology that eliminates CPU-mediated data copies in networking.

1. RDMA vs Traditional Networking

Traditional Path (Sockets)

┌──────────────────────────────────────────────────────────────┐
│ Application                                                  │
│    │                                                         │
│    │ send(fd, buf, len)                                      │
│    ▼                                                         │
│ ┌─────────────────┐                                          │
│ │ Kernel (socket) │ ◄─── CPU copy #1                         │
│ │    sk_buff      │                                          │
│ └────────┬────────┘                                          │
│          │                                                   │
│          ▼                                                   │
│ ┌─────────────────┐                                          │
│ │   NIC Driver    │ ◄─── DMA to NIC                          │
│ └────────┬────────┘                                          │
│          │                                                   │
│          ▼                                                   │
│      [  Wire  ]                                              │
└──────────────────────────────────────────────────────────────┘

Latency: ~10-50 microseconds
CPU: Involved in every packet
Copies: 2 per packet (send + receive)

RDMA Path

┌──────────────────────────────────────────────────────────────┐
│ Application                                                  │
│    │                                                         │
│    │ ibv_post_send(qp, wr, ...)                              │
│    ▼                                                         │
│ ┌─────────────────┐                                          │
│ │ User Buffer     │ ◄─── Memory registered with ibv_reg_mr   │
│ │ (pinned in RAM) │                                          │
│ └────────┬────────┘                                          │
│          │                                                   │
│          │  DMA directly from user buffer                    │
│          ▼                                                   │
│ ┌─────────────────┐                                          │
│ │   RDMA NIC      │ ◄─── No kernel involved!                 │
│ │   (RNIC)        │                                          │
│ └────────┬────────┘                                          │
│          │                                                   │
│          ▼                                                   │
│      [  Wire  ]                                              │
└──────────────────────────────────────────────────────────────┘

Latency: ~1-2 microseconds
CPU: Not involved in data path
Copies: 0 (zero-copy)

2. RDMA Concepts

Memory Registration

Before RDMA can access memory, it must be registered:

struct ibv_mr *mr = ibv_reg_mr(
    pd,                          // Protection domain
    buffer,                      // Virtual address
    size,                        // Buffer size
    IBV_ACCESS_LOCAL_WRITE |     // Allow local writes
    IBV_ACCESS_REMOTE_WRITE |    // Allow remote writes
    IBV_ACCESS_REMOTE_READ       // Allow remote reads
);

What happens:

Pages are pinned (no swap, no migration)
Physical addresses are recorded
NIC is given translation table
lkey/rkey returned for operations

Queue Pairs (QP)

┌─────────────────────────────────────────────────────────────┐
│                     Queue Pair (QP)                          │
│                                                              │
│  ┌─────────────────────┐    ┌─────────────────────────────┐ │
│  │    Send Queue       │    │    Receive Queue            │ │
│  │                     │    │                             │ │
│  │ [Work Request 0]    │    │ [Work Request 0]            │ │
│  │ [Work Request 1]    │    │ [Work Request 1]            │ │
│  │ [Work Request 2]    │    │                             │ │
│  │        ...          │    │                             │ │
│  └──────────┬──────────┘    └──────────────┬──────────────┘ │
│             │                              │                 │
│             ▼                              │                 │
│  ┌──────────────────────────────────────────────────────┐   │
│  │              Completion Queue (CQ)                    │   │
│  │                                                       │   │
│  │ [Completion 0] [Completion 1] [Completion 2] ...      │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
└─────────────────────────────────────────────────────────────┘

RDMA Operations

Operation	Description	Remote CPU?
SEND	Push data to remote receive buffer	Wake
RECV	Prepare buffer for incoming SEND	-
WRITE	Write to remote memory	No
READ	Read from remote memory	No

3. Basic RDMA Code

Setup

#include <infiniband/verbs.h>

// 1. Get device list
struct ibv_device **dev_list = ibv_get_device_list(NULL);
struct ibv_context *ctx = ibv_open_device(dev_list[0]);

// 2. Allocate protection domain
struct ibv_pd *pd = ibv_alloc_pd(ctx);

// 3. Create completion queue
struct ibv_cq *cq = ibv_create_cq(ctx, 10, NULL, NULL, 0);

// 4. Create queue pair
struct ibv_qp_init_attr qp_attr = {
    .send_cq = cq,
    .recv_cq = cq,
    .qp_type = IBV_QPT_RC,  // Reliable Connection
    .cap = {
        .max_send_wr = 10,
        .max_recv_wr = 10,
        .max_send_sge = 1,
        .max_recv_sge = 1,
    },
};
struct ibv_qp *qp = ibv_create_qp(pd, &qp_attr);

// 5. Register memory
char buffer[4096];
struct ibv_mr *mr = ibv_reg_mr(pd, buffer, sizeof(buffer),
    IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE);

RDMA Write (Zero-Copy)

// Remote side has shared: raddr (address), rkey (remote key)

struct ibv_sge sge = {
    .addr = (uintptr_t)buffer,
    .length = data_len,
    .lkey = mr->lkey,
};

struct ibv_send_wr wr = {
    .opcode = IBV_WR_RDMA_WRITE,
    .sg_list = &sge,
    .num_sge = 1,
    .send_flags = IBV_SEND_SIGNALED,
    .wr.rdma = {
        .remote_addr = raddr,
        .rkey = rkey,
    },
};

struct ibv_send_wr *bad_wr;
ibv_post_send(qp, &wr, &bad_wr);

// Poll for completion
struct ibv_wc wc;
while (ibv_poll_cq(cq, 1, &wc) == 0) {
    // Wait
}

if (wc.status != IBV_WC_SUCCESS) {
    fprintf(stderr, "RDMA write failed\n");
}

4. RDMA on Loopback (SoftROCE)

Setup SoftROCE

# Load RXE module
$ sudo modprobe rdma_rxe

# Add RXE device on lo interface
$ sudo rdma link add rxe0 type rxe netdev lo

# Verify
$ rdma link
link rxe0/1 state ACTIVE physical_state LINK_UP netdev lo

$ ibv_devices
    device          node GUID
    ------          ---------
    rxe0            505400fffef6f6f6

Test with rping

# Terminal 1: Server
$ rping -s -v

# Terminal 2: Client
$ rping -c -a 127.0.0.1 -v

5. Why RDMA is Faster

Latency Comparison

Operation          Socket      RDMA
──────────────────────────────────────
Small message      10-50 μs    1-2 μs
Context switch     Yes         No
Copies             2           0
CPU per message    High        Near zero

Throughput Comparison

Socket (100 Gbps NIC): ~40 Gbps (CPU limited)
RDMA   (100 Gbps NIC): ~95 Gbps (line rate)

6. Practice Exercises

Exercise 1: Setup SoftROCE

Configure RXE device and run ibv_devinfo to see attributes.

Exercise 2: Measure Registration Cost

Time ibv_reg_mr for different buffer sizes. Plot the results.

Exercise 3: Compare Latency

Implement simple ping-pong using:

UDP sockets
RDMA SEND/RECV Compare latency distributions.

Next Module

Module 9: Maple Tree & VMA →

← Back to Course Index

AXIOMATIC EXERCISES — BRUTE FORCE CALCULATION

EXERCISE A: MEMORY REGISTRATION CALCULATION

GIVEN:
  Buffer size = 1GB = 1073741824 bytes
  Page size = 4096 bytes
  Each page needs physical address entry: 8 bytes

TASK:

1. Pages in buffer = ___ / 4096 = ___ pages
2. Translation table size = ___ × 8 = ___ bytes = ___ MB
3. If NIC can hold 1MB of translation entries:
   Max registrable memory = 1MB / 8 × 4096 = ___ bytes = ___ GB

EXERCISE B: QUEUE PAIR SIZING

GIVEN:
  Max outstanding sends = 128
  Max outstanding receives = 64
  Each WQE (work queue entry) = 64 bytes
  Each CQE (completion queue entry) = 32 bytes

TASK:

1. Send queue size = ___ × 64 = ___ bytes
2. Receive queue size = ___ × 64 = ___ bytes
3. Total QP size = ___ + ___ = ___ bytes = ___ KB
4. CQ size for 128+64 completions = ___ × 32 = ___ bytes

EXERCISE C: RDMA WRITE WORK REQUEST

GIVEN:
  local_buffer = 0x7F00_0000_0000
  local_lkey = 0x1234
  remote_addr = 0x7F00_1000_0000
  remote_rkey = 0x5678
  length = 4096

TASK: Fill ibv_send_wr structure

struct ibv_sge sge = {
    .addr = 0x___,
    .length = ___,
    .lkey = 0x___,
};

struct ibv_send_wr wr = {
    .opcode = IBV_WR_RDMA___,
    .sg_list = &sge,
    .num_sge = ___,
    .wr.rdma.remote_addr = 0x___,
    .wr.rdma.rkey = 0x___,
};

EXERCISE D: LATENCY COMPARISON

GIVEN:
  Socket send: 25μs
  Socket recv: 25μs
  RDMA post_send: 0.5μs
  RDMA poll_cq: 0.5μs
  Network RTT: 5μs

TASK:

Socket round-trip = ___ + ___ + ___ + ___ = ___ μs
RDMA round-trip = ___ + ___ + ___ = ___ μs
Speedup = ___ / ___ = ___×

FAILURE PREDICTIONS

FAILURE 1: Forgetting to register memory → NIC cannot DMA → fault
FAILURE 2: Using wrong rkey → remote side rejects RDMA
FAILURE 3: Buffer not page-aligned → registration may fail or be slow
FAILURE 4: num_sge wrong → reading garbage scatter-gather entries
FAILURE 5: Not polling CQ → completions lost, resources exhausted

W-QUESTIONS — NUMERICAL ANSWERS

WHAT: Memory Registration

ibv_reg_mr(pd, buf, 1GB, flags) returns:
  mr->lkey = 0x1234 (local key for local operations)
  mr->rkey = 0x5678 (remote key for RDMA WRITE/READ)
Internal: 1GB / 4KB = 262144 pages pinned
Translation table: 262144 × 8 = 2MB NIC memory used

WHY: Pin Pages

Socket: page can swap during transfer → copy to kernel first
RDMA: NIC DMAs directly from user buffer
If page swaps: NIC reads wrong data (or crashes)
Pin = guarantee physical address stable
262144 pages pinned = 1GB RAM unswappable

WHERE: Queue Lives

QP created in user-mapped NIC memory:
  Send Queue at 0x7F00_0000_0000 (user VA)
  Recv Queue at 0x7F00_0001_0000 (user VA)
  CQ at 0x7F00_0002_0000 (user VA)
Doorbell register: write to notify NIC of new WR

WHO: Moves Data

Socket: CPU executes memcpy from/to kernel
RDMA: NIC DMA engine moves data
RDMA WRITE: local NIC → remote RAM (remote CPU idle)
RDMA READ: remote RAM → local NIC → local RAM
CPU involvement: 0 for data, only for posting WRs

WHEN: Completion Generated

T₁: ibv_post_send(qp, wr) → WR in send queue
T₂: NIC processes WR → DMA starts
T₃: DMA completes → NIC generates CQE
T₄: ibv_poll_cq(cq, 1, &wc) → wc.status = IBV_WC_SUCCESS
Latency: T₄ - T₁ = 1-2 μs for small message

WITHOUT: No RDMA

1μs RDMA latency vs 25μs socket latency
25× faster per operation
1 million ops/sec:
  Socket: 1M × 25μs = 25 seconds CPU time
  RDMA: 1M × 1μs = 1 second, but offloaded to NIC

WHICH: Operation Type

IBV_WR_SEND = 0: push to remote recv buffer
IBV_WR_RDMA_WRITE = 1: write to remote memory
IBV_WR_RDMA_READ = 2: read from remote memory
IBV_WR_ATOMIC_CMP_AND_SWP = 3: atomic compare-swap
WRITE/READ: remote CPU unaware, memory directly accessed

ANNOYING CALCULATIONS — BREAKDOWN

Annoying: Registration Table Size

Buffer = 16GB
Pages = 16GB / 4KB = 4194304 pages
Entry size = 8 bytes (PA per page)
Table = 4194304 × 8 = 33554432 bytes = 32MB
NIC SRAM typically 16MB → cannot register 16GB in one MR!

Annoying: Work Request Posting

WR addr = 0x7F0000001000
sge.addr = 0x7F0000002000 (data buffer)
sge.length = 4096
sge.lkey = 0x1234
Door bell write at T₀
NIC reads WR at T₁ = T₀ + 100ns
DMA starts at T₂ = T₁ + 50ns

Annoying: Poll CQ

while (ibv_poll_cq(cq, 1, &wc) == 0) → spin
Return 1 → one completion
wc.status = 0 → success
wc.status = 5 → remote access error
wc.wr_id = user-provided ID for matching

ATTACK PLAN

ibv_reg_mr: pin pages, get lkey/rkey
Exchange rkey/raddr with remote via sockets
ibv_post_send with RDMA_WRITE
ibv_poll_cq until completion
Check wc.status == IBV_WC_SUCCESS

ADDITIONAL FAILURE PREDICTIONS

FAILURE 7: rkey valid only for that specific MR → wrong rkey = error
FAILURE 8: Forgot to post recv buffer → SEND fails with RNR
FAILURE 9: Registration table size limit → large MR fails
FAILURE 10: Page must be pinned entire time → munmap breaks RDMA

SHELL COMMANDS — PARADOXICAL THINKING EXERCISES

COMMAND 1: Setup SoftROCE for Testing

# Load RXE (software RDMA over Ethernet)
sudo modprobe rdma_rxe
sudo rdma link add rxe0 type rxe netdev lo

# Verify
rdma link
ibv_devices
ibv_devinfo rxe0

# WHAT: Software RDMA implementation on loopback
# WHY: Test RDMA code without hardware
# WHERE: Kernel module rdma_rxe + lo interface
# WHO: RDMA core + rxe provider
# WHEN: Any time, no special NIC needed
# WITHOUT: Need Mellanox/Intel/Broadcom RNIC
# WHICH: rxe0 device created, use as normal RDMA

# CALCULATION:
# Real RDMA NIC: 1-2μs latency
# SoftROCE: ~50μs latency (software overhead)
# Still shows zero-copy benefit for bandwidth tests

COMMAND 2: Run RDMA Bandwidth Test

# Terminal 1: Server
ib_write_bw -d rxe0

# Terminal 2: Client
ib_write_bw -d rxe0 127.0.0.1

# OUTPUT shows:
# bytes     iterations   BW peak[MB/sec]   BW average[MB/sec]
# 65536     1000         5000.00           4800.00
#
# CALCULATION:
# 64KB messages × 1000 = 64MB transferred
# 4800 MB/sec = 4.8 GB/sec = 38.4 Gbps
# 
# For real 100Gbps NIC:
# Expected: ~12 GB/sec = 96 Gbps (line rate - overhead)

COMMAND 3: Memory Registration Analysis

cat << 'EOF' > /tmp/reg_test.c
#include <stdio.h>
#include <stdlib.h>
#include <infiniband/verbs.h>
#include <sys/time.h>

int main() {
    struct ibv_device **dev_list = ibv_get_device_list(NULL);
    struct ibv_context *ctx = ibv_open_device(dev_list[0]);
    struct ibv_pd *pd = ibv_alloc_pd(ctx);
    
    size_t sizes[] = {4096, 1<<20, 1<<30};  // 4KB, 1MB, 1GB
    
    for (int i = 0; i < 3; i++) {
        void *buf = malloc(sizes[i]);
        
        struct timeval start, end;
        gettimeofday(&start, NULL);
        
        struct ibv_mr *mr = ibv_reg_mr(pd, buf, sizes[i],
            IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE);
        
        gettimeofday(&end, NULL);
        
        long usec = (end.tv_sec - start.tv_sec) * 1000000 +
                    (end.tv_usec - start.tv_usec);
        
        printf("Size %10zu: reg time = %ld μs, pages = %zu\n",
               sizes[i], usec, sizes[i] / 4096);
        
        ibv_dereg_mr(mr);
        free(buf);
    }
}
EOF
# Compile: gcc /tmp/reg_test.c -o /tmp/reg_test -libverbs

# EXPECTED OUTPUT:
# Size       4096: reg time = 50 μs, pages = 1
# Size    1048576: reg time = 5000 μs, pages = 256
# Size 1073741824: reg time = 500000 μs, pages = 262144
#
# CALCULATION:
# Registration pins all pages (mlock equivalent)
# 1GB = 262144 pages to pin
# Each page: validate, add to NIC translation table
# ~2μs per page → 262144 × 2μs = 524ms

COMMAND 4: RDMA Write vs Socket Send

# Latency comparison
ib_write_lat -d rxe0 &   # Server
ib_write_lat -d rxe0 127.0.0.1  # Client

# Shows:
# bytes    iterations    t_min[μs]    t_max[μs]    t_avg[μs]
# 2        1000          10.00        50.00        15.00

# Compare with:
# ping -c 1000 localhost | tail -1
# rtt min/avg/max = 0.020/0.025/0.050 ms = 20/25/50 μs

# CALCULATION:
# RDMA write latency: 15μs (softROCE)
# Socket ping latency: 25μs
# Ratio: 25/15 = 1.67× faster (RDMA)
#
# With real hardware:
# RDMA: 1-2μs
# Socket: 10-25μs
# Ratio: 10-25× faster!

FINAL PARADOX QUESTIONS

Q1: RDMA is "zero copy" but registration takes 500ms for 1GB?
    
    ANSWER:
    Registration is ONE-TIME cost
    After registration, infinite zero-copy transfers
    1GB reg = 500ms
    1GB transfers × 1000 = 500ms + 0 = 500ms total
    Socket: 0ms setup + 200ms × 1000 = 200 seconds total
    
Q2: Remote CPU "doesn't know" about RDMA write. How to notify?
    
    ANSWER METHODS:
    1. Polling: receiver loops on memory location
    2. Send with completion: final SEND wakes receiver
    3. Atomic: CAS on a counter
    
    Cost:
    Polling: 0 latency, 100% CPU on receiver
    Send: +1μs latency, 0% CPU idle
    
Q3: Why is RDMA not used everywhere?
    
    REASONS:
    1. Special NIC required: $500-5000
    2. Memory must be registered: 500ms for 1GB
    3. Programming model different: no sockets
    4. Security: remote can write your memory!

AXIOMATIC DIAGRAMMATIC DEBUGGER TRACE

TRACE 1: RDMA WRITE POSTING

START: IBV_POST_SEND

R1. WR_PREP: WR.ADDR = 0x1000_0000 (Local Buffer) WR.LKEY = 0xABCD (MR Key) WR.RKEY = 0x1234 (Remote Key) WR.RADDR = 0x2000_0000 (Remote Address)

R2. DOORBELL: MMIO_WRITE(0xBAR + 0x10) = QP_NUM CPU → PCIe Bus → NIC NIC Wakes Up.

R3. NIC_FETCH_WQE: NIC reads WQE from 0x1000_0000 (User Mem) via DMA WQE Content decode: RDMA_WRITE, len=4096.

R4. NIC_DMA_READ: Check LKEY 0xABCD in NIC_MTT (Translation Table) VA 0x1000_0000 → PA 0x3000_0000 DMA Read 4096B from PA 0x3000_0000 to NIC.

R5. PACKET_TX: Construct IB Packet: DstLID, Op=RDMA_WRITE, RKEY=0x1234, RADDR=0x2000_0000 Payload = 4096B Send to Wire.

R6. COMPLETION: Ack from Remote. NIC writes CQE to User CQ Buffer. User polls CQ… Found ✓

← Previous Lesson

Course Index

Next Lesson →