Harnessing 3200Gbps Network (3): libfabric
libfabric
is a generic high-performance network interface that’s very similar in style to the RDMA ibverbs
interface mentioned earlier, but is easier to use. Applications only need to call libfabric
’s high-level interface, while specific protocols are implemented by different Providers. libfabric
’s official Providers include tcp
, udp
, shm
(shared memory), verbs
(i.e., RDMA ibverbs
), and most importantly for this article, efa
.
Concepts and Terminology
libfabric
defines many concepts and terms, and includes many abbreviations. Here’s a slightly informal overview.
Software Object Model
- Fabric: Collection of all hardware resources and software states, similar to a structure for storing global state.
- Domain: Similar to a network card, like
eth0
. A Fabric can have multiple Domains. - (EP) Endpoint: Points for sending and receiving data. A Domain can have multiple endpoints, and each endpoint can have different types. For example, on the same ethernet card, you can listen on multiple
ip:port
s and use different protocol types like TCP and UDP. - (EQ) Event Queue: Reports completion of control plane operations. Not used in this article.
- (CQ) Completion Queue: Reports completion of data plane operations (
RECV
/SEND
/WRITE
/READ
). CQ is a resource of a Domain. CQ can be binded to a Endpoint. - (AV) Address Vector: Stores resolved network addresses. Communication partners must be added to AV before initiating communication. AV is a resource of a Domain. AV can be binded to a Endpoint.
- (MR) Memory Region: Buffer for sending and receiving data. All RDMA operations require specifying an MR. Registering an MR requires going through the operating system kernel because the kernel needs to set up CPU page tables and other PCIe device page tables.
Communication Methods
- Endpoint Types
FI_EP_MSG
(Reliable-connected): Similar to RC QP in RDMA, connection-based reliable transport. EFA doesn’t support this type.FI_EP_RDM
(Reliable-unconnected): Reliable datagram-based transport.- Includes retransmission, guarantees message delivery order.
- Can transmit data of any size. Can use both One-sided RDMA (
WRITE
/READ
) and Two-sided RDMA (RECV
/SEND
). - EFA uses Amazon’s custom SRD (Scalable Reliable Datagram) protocol, making this EFA’s primary endpoint type.
verbs
Provider doesn’t natively support this endpoint type because RDMA itself doesn’t have a similar QP type. However,libfabric
can emulate this endpoint type through theRxM
(RDM over MSG) Provider.
FI_EP_DGRAM
(Unreliable datagram): Similar to UD QP in RDMA, unreliable datagram-based transport. On EFA, can only transmit data smaller than MTU and can only use Two-sided RDMA. Won’t be covered in detail here.
- Endpoint Capabilities
FI_MSG
Two-sided RDMA:RECV
/SEND
, seefi_msg
(3)FI_TAGGED
: Similar toFI_MSG
but each message carries a tag, receiver can select buffer based on this tag. Won’t be covered in detail. Seefi_tagged
(3)FI_RMA
One-sided RDMA:WRITE
/WRITE_IMM
/READ
, seefi_rma
(3)FI_ATOMIC
One-sided RDMA atomic operations. Won’t be covered in detail. Seefi_atomic
(3)
Installing libfabric
libfabric
depends on rdma-core
for RDMA operations and GDRCopy for implementing GPUDirect DMA. Additionally, libfabric
provides a performance testing suite called fabtests
. The following script will create a build/
directory in the current location, download these libraries’ code into the build/
directory, and compile and install these libraries into subdirectories of build/
.
mkdir -p build
cd build
BUILD_DIR=$(pwd)
# RDMA Core
sudo apt-get install rdma-core
# GDRCopy
wget -O gdrcopy-2.4.4.tar.gz https://github.com/NVIDIA/gdrcopy/archive/refs/tags/v2.4.4.tar.gz
cd gdrcopy-2.4.4/
make prefix="$BUILD_DIR/gdrcopy" \
CUDA=/usr/local/cuda \
-j$(nproc --all) all install
cd ..
export LD_LIBRARY_PATH="$BUILD_DIR/gdrcopy/lib:$LD_LIBRARY_PATH"
# libfabric
wget https://github.com/ofiwg/libfabric/releases/download/v2.0.0/libfabric-2.0.0.tar.bz2
tar xf libfabric-2.0.0.tar.bz2
cd libfabric-2.0.0
./configure --prefix="$BUILD_DIR/libfabric" \
--with-cuda=/usr/local/cuda \
--with-gdrcopy="$BUILD_DIR/gdrcopy"
make -j$(nproc --all)
make install
cd ..
export LD_LIBRARY_PATH="$BUILD_DIR/libfabric/lib:$LD_LIBRARY_PATH"
# fabtests
wget https://github.com/ofiwg/libfabric/releases/download/v2.0.0/fabtests-2.0.0.tar.bz2
tar xf fabtests-2.0.0.tar.bz2
cd fabtests-2.0.0
./configure --prefix="$BUILD_DIR/fabtests" \
--with-cuda=/usr/local/cuda \
--with-libfabric="$BUILD_DIR/libfabric"
make -j$(nproc --all)
make install
cd ..
Running Example Programs
Now we can run libfabric
’s built-in example programs. This serves two purposes: verifying that the software is properly installed and getting an initial understanding of our hardware.
Getting Basic NIC Information
./build/libfabric/bin/fi_info --verbose
---
fi_info:
caps: [ FI_MSG, FI_RMA, FI_TAGGED, FI_ATOMIC, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_LOCAL_COMM, FI_REMOTE_COMM, FI_SOURCE, FI_DIRECTED_RECV ]
mode: [ ]
addr_format: FI_ADDR_EFA
src_addrlen: 32
dest_addrlen: 0
src_addr: fi_addr_efa://[fe80::8e7:efff:feee:e81d]:0:0
dest_addr: (null)
handle: (nil)
fi_tx_attr:
caps: [ FI_MSG, FI_RMA, FI_TAGGED, FI_ATOMIC, FI_READ, FI_WRITE, FI_SEND ]
mode: [ ]
op_flags: [ FI_COMPLETION, FI_INJECT, FI_TRANSMIT_COMPLETE, FI_DELIVERY_COMPLETE ]
msg_order: [ FI_ORDER_SAS, FI_ORDER_ATOMIC_RAR, FI_ORDER_ATOMIC_RAW, FI_ORDER_ATOMIC_WAR, FI_ORDER_ATOMIC_WAW ]
inject_size: 4096
size: 4096
iov_limit: 4
rma_iov_limit: 1
tclass: 0x0
fi_rx_attr:
caps: [ FI_MSG, FI_RMA, FI_TAGGED, FI_ATOMIC, FI_RECV, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_SOURCE, FI_DIRECTED_RECV ]
mode: [ ]
op_flags: [ FI_MULTI_RECV, FI_COMPLETION ]
msg_order: [ FI_ORDER_SAS, FI_ORDER_ATOMIC_RAR, FI_ORDER_ATOMIC_RAW, FI_ORDER_ATOMIC_WAR, FI_ORDER_ATOMIC_WAW ]
size: 8192
iov_limit: 4
fi_ep_attr:
type: FI_EP_RDM
protocol: FI_PROTO_EFA
protocol_version: 4
max_msg_size: 18446744073709551615
msg_prefix_size: 0
max_order_raw_size: 8776
max_order_war_size: 8776
max_order_waw_size: 8776
mem_tag_format: 0xaaaaaaaaaaaaaaaa
tx_ctx_cnt: 1
rx_ctx_cnt: 1
auth_key_size: 0
fi_domain_attr:
domain: 0x0
name: rdmap79s0-rdm
threading: FI_THREAD_SAFE
progress: FI_PROGRESS_AUTO
resource_mgmt: FI_RM_ENABLED
av_type: FI_AV_TABLE
mr_mode: [ FI_MR_VIRT_ADDR, FI_MR_ALLOCATED, FI_MR_PROV_KEY, FI_MR_HMEM ]
mr_key_size: 4
cq_data_size: 4
cq_cnt: 512
ep_cnt: 256
tx_ctx_cnt: 256
rx_ctx_cnt: 256
max_ep_tx_ctx: 1
max_ep_rx_ctx: 1
max_ep_stx_ctx: 0
max_ep_srx_ctx: 0
cntr_cnt: 0
mr_iov_limit: 1
caps: [ FI_LOCAL_COMM, FI_REMOTE_COMM ]
mode: [ ]
auth_key_size: 0
max_err_data: 0
mr_cnt: 262144
tclass: 0x0
fi_fabric_attr:
name: efa
prov_name: efa
prov_version: 200.0
api_version: 2.0
nic:
fi_device_attr:
name: rdmap79s0
device_id: 0xefa1
device_version: 6
vendor_id: 0x1d0f
driver: efa
firmware: 0.0.0.0
fi_bus_attr:
bus_type: FI_BUS_PCI
fi_pci_attr:
domain_id: 0
bus_id: 79
device_id: 0
function_id: 0
fi_link_attr:
address: EFA-fe80::8e7:efff:feee:e81d
mtu: 8760
speed: 100000000000
state: FI_LINK_UP
network_type: Ethernet
---
...
fi_info
will output information about all network cards. I’ve shown above the complete information for the first EFA card as reference. From this, we can see some key points that we’ll use in later articles, which I’ve specifically highlighted below:
src_addrlen: 32 # EFA address length is 32 bytes
fi_tx_attr:
size: 4096 # Send queue can hold 4096 operations
iov_limit: 4 # Send operation can specify up to 4 local buffers
rma_iov_limit: 1 # Send operation can only specify one remote buffer
fi_rx_attr:
size: 8192 # Receive queue can hold 8192 operations
iov_limit: 4 # Receive operation can specify up to 4 local buffers
fi_ep_attr:
max_msg_size: 18446744073709551615 # Can send data of any size
fi_domain_attr:
name: rdmap79s0-rdm # Domain name
mr_key_size: 4 # MR remote key size is 4 bytes
cq_data_size: 4 # WRITE_IMM immediate data is 4 bytes
mr_iov_limit: 1 # Can only specify one buffer when registering MR
nic:
fi_device_attr:
name: rdmap79s0 # NIC name
driver: efa # EFA NIC
fi_link_attr:
mtu: 8760 # Maximum packet size is 8760 bytes
speed: 100000000000 # NIC speed is 100 Gbps
Bandwidth Testing
We can also run the bandwidth test from fabtests
. The following command performs GPUDirect RDMA WRITE operations between two machines. As we can see, it reaches a maximum speed of 11843.86 MB/sec, which is 94.751 Gbps, nearly saturating the bandwidth.
ip-172-19-226-174$ ./build/fabtests/bin/fi_rma_bw -p efa -o write -E -D cuda -S all
ip-172-19-230-131$ ./build/fabtests/bin/fi_rma_bw -p efa -o write -E -D cuda -S all 172.19.226.174
bytes iters total time MB/sec usec/xfer Mxfers/sec
1 20k 19k 0.03s 0.75 1.33 0.75
2 20k 39k 0.03s 1.58 1.27 0.79
3 20k 58k 0.03s 2.36 1.27 0.79
4 20k 78k 0.03s 3.18 1.26 0.79
6 20k 117k 0.03s 4.72 1.27 0.79
8 20k 156k 0.03s 6.37 1.26 0.80
12 20k 234k 0.03s 9.48 1.27 0.79
16 20k 312k 0.03s 12.66 1.26 0.79
24 20k 468k 0.03s 19.18 1.25 0.80
32 20k 625k 0.03s 24.78 1.29 0.77
48 20k 937k 0.02s 38.75 1.24 0.81
64 20k 1.2m 0.02s 51.87 1.23 0.81
96 20k 1.8m 0.02s 77.48 1.24 0.81
128 20k 2.4m 0.02s 103.46 1.24 0.81
192 20k 3.6m 0.02s 155.55 1.23 0.81
256 20k 4.8m 0.02s 206.08 1.24 0.80
384 20k 7.3m 0.02s 307.40 1.25 0.80
512 20k 9.7m 0.02s 412.82 1.24 0.81
768 20k 14m 0.03s 614.11 1.25 0.80
1k 20k 19m 0.03s 814.12 1.26 0.80
1.5k 20k 29m 0.03s 1212.41 1.27 0.79
2k 20k 39m 0.03s 1590.49 1.29 0.78
3k 20k 58m 0.03s 2340.84 1.31 0.76
4k 20k 78m 0.03s 3068.05 1.34 0.75
6k 20k 117m 0.03s 4462.85 1.38 0.73
8k 20k 156m 0.03s 5440.12 1.51 0.66
12k 20k 234m 0.04s 6770.81 1.81 0.55
16k 20k 312m 0.04s 7921.86 2.07 0.48
24k 20k 468m 0.06s 8319.43 2.95 0.34
32k 20k 625m 0.07s 9470.25 3.46 0.29
48k 20k 937m 0.10s 10127.33 4.85 0.21
64k 2k 125m 0.01s 10465.67 6.26 0.16
96k 2k 187m 0.02s 10885.78 9.03 0.11
128k 2k 250m 0.03s 10215.66 12.83 0.08
192k 2k 375m 0.03s 11301.58 17.40 0.06
256k 2k 500m 0.05s 11501.58 22.79 0.04
384k 2k 750m 0.08s 10123.87 38.84 0.03
512k 2k 1000m 0.10s 10749.66 48.77 0.02
768k 2k 1.4g 0.13s 11843.86 66.40 0.02
1m 200 200m 0.02s 10968.94 95.60 0.01
1.5m 200 300m 0.03s 10837.99 145.12 0.01
2m 200 400m 0.04s 10204.87 205.51 0.00
3m 200 600m 0.06s 10522.41 298.95 0.00
4m 200 800m 0.08s 10826.94 387.39 0.00
6m 200 1.1g 0.11s 11488.20 547.65 0.00
8m 200 1.5g 0.16s 10504.60 798.57 0.00
Now that we’ve verified that both hardware and software are ready, in the next chapter we can start writing code.