This module introduces Remote Direct Memory Access (RDMA), a technology that eliminates CPU-mediated data copies in networking.
┌──────────────────────────────────────────────────────────────┐
│ Application │
│ │ │
│ │ send(fd, buf, len) │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Kernel (socket) │ ◄─── CPU copy #1 │
│ │ sk_buff │ │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ NIC Driver │ ◄─── DMA to NIC │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ [ Wire ] │
└──────────────────────────────────────────────────────────────┘
Latency: ~10-50 microseconds
CPU: Involved in every packet
Copies: 2 per packet (send + receive)
┌──────────────────────────────────────────────────────────────┐
│ Application │
│ │ │
│ │ ibv_post_send(qp, wr, ...) │
│ ▼ │
│ ┌─────────────────┐ │
│ │ User Buffer │ ◄─── Memory registered with ibv_reg_mr │
│ │ (pinned in RAM) │ │
│ └────────┬────────┘ │
│ │ │
│ │ DMA directly from user buffer │
│ ▼ │
│ ┌─────────────────┐ │
│ │ RDMA NIC │ ◄─── No kernel involved! │
│ │ (RNIC) │ │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ [ Wire ] │
└──────────────────────────────────────────────────────────────┘
Latency: ~1-2 microseconds
CPU: Not involved in data path
Copies: 0 (zero-copy)
Before RDMA can access memory, it must be registered:
struct ibv_mr *mr = ibv_reg_mr(
pd, // Protection domain
buffer, // Virtual address
size, // Buffer size
IBV_ACCESS_LOCAL_WRITE | // Allow local writes
IBV_ACCESS_REMOTE_WRITE | // Allow remote writes
IBV_ACCESS_REMOTE_READ // Allow remote reads
);
What happens:
┌─────────────────────────────────────────────────────────────┐
│ Queue Pair (QP) │
│ │
│ ┌─────────────────────┐ ┌─────────────────────────────┐ │
│ │ Send Queue │ │ Receive Queue │ │
│ │ │ │ │ │
│ │ [Work Request 0] │ │ [Work Request 0] │ │
│ │ [Work Request 1] │ │ [Work Request 1] │ │
│ │ [Work Request 2] │ │ │ │
│ │ ... │ │ │ │
│ └──────────┬──────────┘ └──────────────┬──────────────┘ │
│ │ │ │
│ ▼ │ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Completion Queue (CQ) │ │
│ │ │ │
│ │ [Completion 0] [Completion 1] [Completion 2] ... │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
| Operation | Description | Remote CPU? |
|---|---|---|
| SEND | Push data to remote receive buffer | Wake |
| RECV | Prepare buffer for incoming SEND | - |
| WRITE | Write to remote memory | No |
| READ | Read from remote memory | No |
#include <infiniband/verbs.h>
// 1. Get device list
struct ibv_device **dev_list = ibv_get_device_list(NULL);
struct ibv_context *ctx = ibv_open_device(dev_list[0]);
// 2. Allocate protection domain
struct ibv_pd *pd = ibv_alloc_pd(ctx);
// 3. Create completion queue
struct ibv_cq *cq = ibv_create_cq(ctx, 10, NULL, NULL, 0);
// 4. Create queue pair
struct ibv_qp_init_attr qp_attr = {
.send_cq = cq,
.recv_cq = cq,
.qp_type = IBV_QPT_RC, // Reliable Connection
.cap = {
.max_send_wr = 10,
.max_recv_wr = 10,
.max_send_sge = 1,
.max_recv_sge = 1,
},
};
struct ibv_qp *qp = ibv_create_qp(pd, &qp_attr);
// 5. Register memory
char buffer[4096];
struct ibv_mr *mr = ibv_reg_mr(pd, buffer, sizeof(buffer),
IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE);
// Remote side has shared: raddr (address), rkey (remote key)
struct ibv_sge sge = {
.addr = (uintptr_t)buffer,
.length = data_len,
.lkey = mr->lkey,
};
struct ibv_send_wr wr = {
.opcode = IBV_WR_RDMA_WRITE,
.sg_list = &sge,
.num_sge = 1,
.send_flags = IBV_SEND_SIGNALED,
.wr.rdma = {
.remote_addr = raddr,
.rkey = rkey,
},
};
struct ibv_send_wr *bad_wr;
ibv_post_send(qp, &wr, &bad_wr);
// Poll for completion
struct ibv_wc wc;
while (ibv_poll_cq(cq, 1, &wc) == 0) {
// Wait
}
if (wc.status != IBV_WC_SUCCESS) {
fprintf(stderr, "RDMA write failed\n");
}
# Load RXE module
$ sudo modprobe rdma_rxe
# Add RXE device on lo interface
$ sudo rdma link add rxe0 type rxe netdev lo
# Verify
$ rdma link
link rxe0/1 state ACTIVE physical_state LINK_UP netdev lo
$ ibv_devices
device node GUID
------ ---------
rxe0 505400fffef6f6f6
# Terminal 1: Server
$ rping -s -v
# Terminal 2: Client
$ rping -c -a 127.0.0.1 -v
Operation Socket RDMA
──────────────────────────────────────
Small message 10-50 μs 1-2 μs
Context switch Yes No
Copies 2 0
CPU per message High Near zero
Socket (100 Gbps NIC): ~40 Gbps (CPU limited)
RDMA (100 Gbps NIC): ~95 Gbps (line rate)
Configure RXE device and run ibv_devinfo to see attributes.
Time ibv_reg_mr for different buffer sizes. Plot the results.
Implement simple ping-pong using:
GIVEN:
Buffer size = 1GB = 1073741824 bytes
Page size = 4096 bytes
Each page needs physical address entry: 8 bytes
TASK:
1. Pages in buffer = ___ / 4096 = ___ pages
2. Translation table size = ___ × 8 = ___ bytes = ___ MB
3. If NIC can hold 1MB of translation entries:
Max registrable memory = 1MB / 8 × 4096 = ___ bytes = ___ GB
GIVEN:
Max outstanding sends = 128
Max outstanding receives = 64
Each WQE (work queue entry) = 64 bytes
Each CQE (completion queue entry) = 32 bytes
TASK:
1. Send queue size = ___ × 64 = ___ bytes
2. Receive queue size = ___ × 64 = ___ bytes
3. Total QP size = ___ + ___ = ___ bytes = ___ KB
4. CQ size for 128+64 completions = ___ × 32 = ___ bytes
GIVEN:
local_buffer = 0x7F00_0000_0000
local_lkey = 0x1234
remote_addr = 0x7F00_1000_0000
remote_rkey = 0x5678
length = 4096
TASK: Fill ibv_send_wr structure
struct ibv_sge sge = {
.addr = 0x___,
.length = ___,
.lkey = 0x___,
};
struct ibv_send_wr wr = {
.opcode = IBV_WR_RDMA___,
.sg_list = &sge,
.num_sge = ___,
.wr.rdma.remote_addr = 0x___,
.wr.rdma.rkey = 0x___,
};
GIVEN:
Socket send: 25μs
Socket recv: 25μs
RDMA post_send: 0.5μs
RDMA poll_cq: 0.5μs
Network RTT: 5μs
TASK:
Socket round-trip = ___ + ___ + ___ + ___ = ___ μs
RDMA round-trip = ___ + ___ + ___ = ___ μs
Speedup = ___ / ___ = ___×
FAILURE 1: Forgetting to register memory → NIC cannot DMA → fault
FAILURE 2: Using wrong rkey → remote side rejects RDMA
FAILURE 3: Buffer not page-aligned → registration may fail or be slow
FAILURE 4: num_sge wrong → reading garbage scatter-gather entries
FAILURE 5: Not polling CQ → completions lost, resources exhausted
ibv_reg_mr(pd, buf, 1GB, flags) returns:
mr->lkey = 0x1234 (local key for local operations)
mr->rkey = 0x5678 (remote key for RDMA WRITE/READ)
Internal: 1GB / 4KB = 262144 pages pinned
Translation table: 262144 × 8 = 2MB NIC memory used
Socket: page can swap during transfer → copy to kernel first
RDMA: NIC DMAs directly from user buffer
If page swaps: NIC reads wrong data (or crashes)
Pin = guarantee physical address stable
262144 pages pinned = 1GB RAM unswappable
QP created in user-mapped NIC memory:
Send Queue at 0x7F00_0000_0000 (user VA)
Recv Queue at 0x7F00_0001_0000 (user VA)
CQ at 0x7F00_0002_0000 (user VA)
Doorbell register: write to notify NIC of new WR
Socket: CPU executes memcpy from/to kernel
RDMA: NIC DMA engine moves data
RDMA WRITE: local NIC → remote RAM (remote CPU idle)
RDMA READ: remote RAM → local NIC → local RAM
CPU involvement: 0 for data, only for posting WRs
T₁: ibv_post_send(qp, wr) → WR in send queue
T₂: NIC processes WR → DMA starts
T₃: DMA completes → NIC generates CQE
T₄: ibv_poll_cq(cq, 1, &wc) → wc.status = IBV_WC_SUCCESS
Latency: T₄ - T₁ = 1-2 μs for small message
1μs RDMA latency vs 25μs socket latency
25× faster per operation
1 million ops/sec:
Socket: 1M × 25μs = 25 seconds CPU time
RDMA: 1M × 1μs = 1 second, but offloaded to NIC
IBV_WR_SEND = 0: push to remote recv buffer
IBV_WR_RDMA_WRITE = 1: write to remote memory
IBV_WR_RDMA_READ = 2: read from remote memory
IBV_WR_ATOMIC_CMP_AND_SWP = 3: atomic compare-swap
WRITE/READ: remote CPU unaware, memory directly accessed
Buffer = 16GB
Pages = 16GB / 4KB = 4194304 pages
Entry size = 8 bytes (PA per page)
Table = 4194304 × 8 = 33554432 bytes = 32MB
NIC SRAM typically 16MB → cannot register 16GB in one MR!
WR addr = 0x7F0000001000
sge.addr = 0x7F0000002000 (data buffer)
sge.length = 4096
sge.lkey = 0x1234
Door bell write at T₀
NIC reads WR at T₁ = T₀ + 100ns
DMA starts at T₂ = T₁ + 50ns
while (ibv_poll_cq(cq, 1, &wc) == 0) → spin
Return 1 → one completion
wc.status = 0 → success
wc.status = 5 → remote access error
wc.wr_id = user-provided ID for matching
1. ibv_reg_mr: pin pages, get lkey/rkey
2. Exchange rkey/raddr with remote via sockets
3. ibv_post_send with RDMA_WRITE
4. ibv_poll_cq until completion
5. Check wc.status == IBV_WC_SUCCESS
FAILURE 7: rkey valid only for that specific MR → wrong rkey = error
FAILURE 8: Forgot to post recv buffer → SEND fails with RNR
FAILURE 9: Registration table size limit → large MR fails
FAILURE 10: Page must be pinned entire time → munmap breaks RDMA
# Load RXE (software RDMA over Ethernet)
sudo modprobe rdma_rxe
sudo rdma link add rxe0 type rxe netdev lo
# Verify
rdma link
ibv_devices
ibv_devinfo rxe0
# WHAT: Software RDMA implementation on loopback
# WHY: Test RDMA code without hardware
# WHERE: Kernel module rdma_rxe + lo interface
# WHO: RDMA core + rxe provider
# WHEN: Any time, no special NIC needed
# WITHOUT: Need Mellanox/Intel/Broadcom RNIC
# WHICH: rxe0 device created, use as normal RDMA
# CALCULATION:
# Real RDMA NIC: 1-2μs latency
# SoftROCE: ~50μs latency (software overhead)
# Still shows zero-copy benefit for bandwidth tests
# Terminal 1: Server
ib_write_bw -d rxe0
# Terminal 2: Client
ib_write_bw -d rxe0 127.0.0.1
# OUTPUT shows:
# bytes iterations BW peak[MB/sec] BW average[MB/sec]
# 65536 1000 5000.00 4800.00
#
# CALCULATION:
# 64KB messages × 1000 = 64MB transferred
# 4800 MB/sec = 4.8 GB/sec = 38.4 Gbps
#
# For real 100Gbps NIC:
# Expected: ~12 GB/sec = 96 Gbps (line rate - overhead)
cat << 'EOF' > /tmp/reg_test.c
#include <stdio.h>
#include <stdlib.h>
#include <infiniband/verbs.h>
#include <sys/time.h>
int main() {
struct ibv_device **dev_list = ibv_get_device_list(NULL);
struct ibv_context *ctx = ibv_open_device(dev_list[0]);
struct ibv_pd *pd = ibv_alloc_pd(ctx);
size_t sizes[] = {4096, 1<<20, 1<<30}; // 4KB, 1MB, 1GB
for (int i = 0; i < 3; i++) {
void *buf = malloc(sizes[i]);
struct timeval start, end;
gettimeofday(&start, NULL);
struct ibv_mr *mr = ibv_reg_mr(pd, buf, sizes[i],
IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE);
gettimeofday(&end, NULL);
long usec = (end.tv_sec - start.tv_sec) * 1000000 +
(end.tv_usec - start.tv_usec);
printf("Size %10zu: reg time = %ld μs, pages = %zu\n",
sizes[i], usec, sizes[i] / 4096);
ibv_dereg_mr(mr);
free(buf);
}
}
EOF
# Compile: gcc /tmp/reg_test.c -o /tmp/reg_test -libverbs
# EXPECTED OUTPUT:
# Size 4096: reg time = 50 μs, pages = 1
# Size 1048576: reg time = 5000 μs, pages = 256
# Size 1073741824: reg time = 500000 μs, pages = 262144
#
# CALCULATION:
# Registration pins all pages (mlock equivalent)
# 1GB = 262144 pages to pin
# Each page: validate, add to NIC translation table
# ~2μs per page → 262144 × 2μs = 524ms
# Latency comparison
ib_write_lat -d rxe0 & # Server
ib_write_lat -d rxe0 127.0.0.1 # Client
# Shows:
# bytes iterations t_min[μs] t_max[μs] t_avg[μs]
# 2 1000 10.00 50.00 15.00
# Compare with:
# ping -c 1000 localhost | tail -1
# rtt min/avg/max = 0.020/0.025/0.050 ms = 20/25/50 μs
# CALCULATION:
# RDMA write latency: 15μs (softROCE)
# Socket ping latency: 25μs
# Ratio: 25/15 = 1.67× faster (RDMA)
#
# With real hardware:
# RDMA: 1-2μs
# Socket: 10-25μs
# Ratio: 10-25× faster!
Q1: RDMA is "zero copy" but registration takes 500ms for 1GB?
ANSWER:
Registration is ONE-TIME cost
After registration, infinite zero-copy transfers
1GB reg = 500ms
1GB transfers × 1000 = 500ms + 0 = 500ms total
Socket: 0ms setup + 200ms × 1000 = 200 seconds total
Q2: Remote CPU "doesn't know" about RDMA write. How to notify?
ANSWER METHODS:
1. Polling: receiver loops on memory location
2. Send with completion: final SEND wakes receiver
3. Atomic: CAS on a counter
Cost:
Polling: 0 latency, 100% CPU on receiver
Send: +1μs latency, 0% CPU idle
Q3: Why is RDMA not used everywhere?
REASONS:
1. Special NIC required: $500-5000
2. Memory must be registered: 500ms for 1GB
3. Programming model different: no sockets
4. Security: remote can write your memory!
START: IBV_POST_SEND
R1. WR_PREP: WR.ADDR = 0x1000_0000 (Local Buffer) WR.LKEY = 0xABCD (MR Key) WR.RKEY = 0x1234 (Remote Key) WR.RADDR = 0x2000_0000 (Remote Address)
R2. DOORBELL: MMIO_WRITE(0xBAR + 0x10) = QP_NUM CPU → PCIe Bus → NIC NIC Wakes Up.
R3. NIC_FETCH_WQE: NIC reads WQE from 0x1000_0000 (User Mem) via DMA WQE Content decode: RDMA_WRITE, len=4096.
R4. NIC_DMA_READ: Check LKEY 0xABCD in NIC_MTT (Translation Table) VA 0x1000_0000 → PA 0x3000_0000 DMA Read 4096B from PA 0x3000_0000 to NIC.
R5. PACKET_TX: Construct IB Packet: DstLID, Op=RDMA_WRITE, RKEY=0x1234, RADDR=0x2000_0000 Payload = 4096B Send to Wire.
R6. COMPLETION: Ack from Remote. NIC writes CQE to User CQ Buffer. User polls CQ… Found ✓
| ← Previous Lesson | Course Index | Next Lesson → |