libfabric is a generic high-performance network interface that’s very similar in style to the RDMA ibverbs interface mentioned earlier, but is easier to use. Applications only need to call libfabric’s high-level interface, while specific protocols are implemented by different Providers. libfabric’s official Providers include tcp, udp, shm (shared memory), verbs (i.e., RDMA ibverbs), and most importantly for this article, efa.

Concepts and Terminology

libfabric defines many concepts and terms, and includes many abbreviations. Here’s a slightly informal overview.

Software Object Model

libfabric Software Object Model

  • Fabric: Collection of all hardware resources and software states, similar to a structure for storing global state.
  • Domain: Similar to a network card, like eth0. A Fabric can have multiple Domains.
  • (EP) Endpoint: Points for sending and receiving data. A Domain can have multiple endpoints, and each endpoint can have different types. For example, on the same ethernet card, you can listen on multiple ip:ports and use different protocol types like TCP and UDP.
  • (EQ) Event Queue: Reports completion of control plane operations. Not used in this article.
  • (CQ) Completion Queue: Reports completion of data plane operations (RECV / SEND / WRITE / READ). CQ is a resource of a Domain. CQ can be binded to a Endpoint.
  • (AV) Address Vector: Stores resolved network addresses. Communication partners must be added to AV before initiating communication. AV is a resource of a Domain. AV can be binded to a Endpoint.
  • (MR) Memory Region: Buffer for sending and receiving data. All RDMA operations require specifying an MR. Registering an MR requires going through the operating system kernel because the kernel needs to set up CPU page tables and other PCIe device page tables.

Communication Methods

  • Endpoint Types
    1. FI_EP_MSG (Reliable-connected): Similar to RC QP in RDMA, connection-based reliable transport. EFA doesn’t support this type.
    2. FI_EP_RDM (Reliable-unconnected): Reliable datagram-based transport.
      • Includes retransmission, guarantees message delivery order.
      • Can transmit data of any size. Can use both One-sided RDMA (WRITE / READ) and Two-sided RDMA (RECV / SEND).
      • EFA uses Amazon’s custom SRD (Scalable Reliable Datagram) protocol, making this EFA’s primary endpoint type.
      • verbs Provider doesn’t natively support this endpoint type because RDMA itself doesn’t have a similar QP type. However, libfabric can emulate this endpoint type through the RxM (RDM over MSG) Provider.
    3. FI_EP_DGRAM (Unreliable datagram): Similar to UD QP in RDMA, unreliable datagram-based transport. On EFA, can only transmit data smaller than MTU and can only use Two-sided RDMA. Won’t be covered in detail here.
  • Endpoint Capabilities
    1. FI_MSG Two-sided RDMA: RECV / SEND, see fi_msg(3)
    2. FI_TAGGED: Similar to FI_MSG but each message carries a tag, receiver can select buffer based on this tag. Won’t be covered in detail. See fi_tagged(3)
    3. FI_RMA One-sided RDMA: WRITE / WRITE_IMM / READ, see fi_rma(3)
    4. FI_ATOMIC One-sided RDMA atomic operations. Won’t be covered in detail. See fi_atomic(3)

Installing libfabric

libfabric depends on rdma-core for RDMA operations and GDRCopy for implementing GPUDirect DMA. Additionally, libfabric provides a performance testing suite called fabtests. The following script will create a build/ directory in the current location, download these libraries’ code into the build/ directory, and compile and install these libraries into subdirectories of build/.

mkdir -p build
cd build
BUILD_DIR=$(pwd)

# RDMA Core
sudo apt-get install rdma-core

# GDRCopy
wget -O gdrcopy-2.4.4.tar.gz https://github.com/NVIDIA/gdrcopy/archive/refs/tags/v2.4.4.tar.gz
cd gdrcopy-2.4.4/
make prefix="$BUILD_DIR/gdrcopy" \
    CUDA=/usr/local/cuda \
    -j$(nproc --all) all install
cd ..
export LD_LIBRARY_PATH="$BUILD_DIR/gdrcopy/lib:$LD_LIBRARY_PATH"

# libfabric
wget https://github.com/ofiwg/libfabric/releases/download/v2.0.0/libfabric-2.0.0.tar.bz2
tar xf libfabric-2.0.0.tar.bz2
cd libfabric-2.0.0
./configure --prefix="$BUILD_DIR/libfabric" \
    --with-cuda=/usr/local/cuda \
    --with-gdrcopy="$BUILD_DIR/gdrcopy"
make -j$(nproc --all)
make install
cd ..
export LD_LIBRARY_PATH="$BUILD_DIR/libfabric/lib:$LD_LIBRARY_PATH"

# fabtests
wget https://github.com/ofiwg/libfabric/releases/download/v2.0.0/fabtests-2.0.0.tar.bz2
tar xf fabtests-2.0.0.tar.bz2
cd fabtests-2.0.0
./configure --prefix="$BUILD_DIR/fabtests" \
    --with-cuda=/usr/local/cuda \
    --with-libfabric="$BUILD_DIR/libfabric"
make -j$(nproc --all)
make install
cd ..

Running Example Programs

Now we can run libfabric’s built-in example programs. This serves two purposes: verifying that the software is properly installed and getting an initial understanding of our hardware.

Getting Basic NIC Information

./build/libfabric/bin/fi_info --verbose
---
fi_info:
    caps: [ FI_MSG, FI_RMA, FI_TAGGED, FI_ATOMIC, FI_READ, FI_WRITE, FI_RECV, FI_SEND, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_LOCAL_COMM, FI_REMOTE_COMM, FI_SOURCE, FI_DIRECTED_RECV ]
    mode: [  ]
    addr_format: FI_ADDR_EFA
    src_addrlen: 32
    dest_addrlen: 0
    src_addr: fi_addr_efa://[fe80::8e7:efff:feee:e81d]:0:0
    dest_addr: (null)
    handle: (nil)
    fi_tx_attr:
        caps: [ FI_MSG, FI_RMA, FI_TAGGED, FI_ATOMIC, FI_READ, FI_WRITE, FI_SEND ]
        mode: [  ]
        op_flags: [ FI_COMPLETION, FI_INJECT, FI_TRANSMIT_COMPLETE, FI_DELIVERY_COMPLETE ]
        msg_order: [ FI_ORDER_SAS, FI_ORDER_ATOMIC_RAR, FI_ORDER_ATOMIC_RAW, FI_ORDER_ATOMIC_WAR, FI_ORDER_ATOMIC_WAW ]
        inject_size: 4096
        size: 4096
        iov_limit: 4
        rma_iov_limit: 1
        tclass: 0x0
    fi_rx_attr:
        caps: [ FI_MSG, FI_RMA, FI_TAGGED, FI_ATOMIC, FI_RECV, FI_REMOTE_READ, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_SOURCE, FI_DIRECTED_RECV ]
        mode: [  ]
        op_flags: [ FI_MULTI_RECV, FI_COMPLETION ]
        msg_order: [ FI_ORDER_SAS, FI_ORDER_ATOMIC_RAR, FI_ORDER_ATOMIC_RAW, FI_ORDER_ATOMIC_WAR, FI_ORDER_ATOMIC_WAW ]
        size: 8192
        iov_limit: 4
    fi_ep_attr:
        type: FI_EP_RDM
        protocol: FI_PROTO_EFA
        protocol_version: 4
        max_msg_size: 18446744073709551615
        msg_prefix_size: 0
        max_order_raw_size: 8776
        max_order_war_size: 8776
        max_order_waw_size: 8776
        mem_tag_format: 0xaaaaaaaaaaaaaaaa
        tx_ctx_cnt: 1
        rx_ctx_cnt: 1
        auth_key_size: 0
    fi_domain_attr:
        domain: 0x0
        name: rdmap79s0-rdm
        threading: FI_THREAD_SAFE
        progress: FI_PROGRESS_AUTO
        resource_mgmt: FI_RM_ENABLED
        av_type: FI_AV_TABLE
        mr_mode: [ FI_MR_VIRT_ADDR, FI_MR_ALLOCATED, FI_MR_PROV_KEY, FI_MR_HMEM ]
        mr_key_size: 4
        cq_data_size: 4
        cq_cnt: 512
        ep_cnt: 256
        tx_ctx_cnt: 256
        rx_ctx_cnt: 256
        max_ep_tx_ctx: 1
        max_ep_rx_ctx: 1
        max_ep_stx_ctx: 0
        max_ep_srx_ctx: 0
        cntr_cnt: 0
        mr_iov_limit: 1
        caps: [ FI_LOCAL_COMM, FI_REMOTE_COMM ]
        mode: [  ]
        auth_key_size: 0
        max_err_data: 0
        mr_cnt: 262144
        tclass: 0x0
    fi_fabric_attr:
        name: efa
        prov_name: efa
        prov_version: 200.0
        api_version: 2.0
    nic:
        fi_device_attr:
            name: rdmap79s0
            device_id: 0xefa1
            device_version: 6
            vendor_id: 0x1d0f
            driver: efa
            firmware: 0.0.0.0
        fi_bus_attr:
            bus_type: FI_BUS_PCI
            fi_pci_attr:
                domain_id: 0
                bus_id: 79
                device_id: 0
                function_id: 0
        fi_link_attr:
            address: EFA-fe80::8e7:efff:feee:e81d
            mtu: 8760
            speed: 100000000000
            state: FI_LINK_UP
            network_type: Ethernet
---
...

fi_info will output information about all network cards. I’ve shown above the complete information for the first EFA card as reference. From this, we can see some key points that we’ll use in later articles, which I’ve specifically highlighted below:

src_addrlen: 32                         # EFA address length is 32 bytes
fi_tx_attr:
    size: 4096                          # Send queue can hold 4096 operations
    iov_limit: 4                        # Send operation can specify up to 4 local buffers
    rma_iov_limit: 1                    # Send operation can only specify one remote buffer
fi_rx_attr:
    size: 8192                          # Receive queue can hold 8192 operations
    iov_limit: 4                        # Receive operation can specify up to 4 local buffers
fi_ep_attr:
    max_msg_size: 18446744073709551615  # Can send data of any size
fi_domain_attr:
    name: rdmap79s0-rdm                 # Domain name
    mr_key_size: 4                      # MR remote key size is 4 bytes
    cq_data_size: 4                     # WRITE_IMM immediate data is 4 bytes
    mr_iov_limit: 1                     # Can only specify one buffer when registering MR
nic:
    fi_device_attr:
        name: rdmap79s0                 # NIC name
        driver: efa                     # EFA NIC
    fi_link_attr:
        mtu: 8760                       # Maximum packet size is 8760 bytes
        speed: 100000000000             # NIC speed is 100 Gbps

Bandwidth Testing

We can also run the bandwidth test from fabtests. The following command performs GPUDirect RDMA WRITE operations between two machines. As we can see, it reaches a maximum speed of 11843.86 MB/sec, which is 94.751 Gbps, nearly saturating the bandwidth.

ip-172-19-226-174$ ./build/fabtests/bin/fi_rma_bw -p efa -o write -E -D cuda -S all
ip-172-19-230-131$ ./build/fabtests/bin/fi_rma_bw -p efa -o write -E -D cuda -S all 172.19.226.174
bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
1       20k     19k         0.03s      0.75       1.33       0.75
2       20k     39k         0.03s      1.58       1.27       0.79
3       20k     58k         0.03s      2.36       1.27       0.79
4       20k     78k         0.03s      3.18       1.26       0.79
6       20k     117k        0.03s      4.72       1.27       0.79
8       20k     156k        0.03s      6.37       1.26       0.80
12      20k     234k        0.03s      9.48       1.27       0.79
16      20k     312k        0.03s     12.66       1.26       0.79
24      20k     468k        0.03s     19.18       1.25       0.80
32      20k     625k        0.03s     24.78       1.29       0.77
48      20k     937k        0.02s     38.75       1.24       0.81
64      20k     1.2m        0.02s     51.87       1.23       0.81
96      20k     1.8m        0.02s     77.48       1.24       0.81
128     20k     2.4m        0.02s    103.46       1.24       0.81
192     20k     3.6m        0.02s    155.55       1.23       0.81
256     20k     4.8m        0.02s    206.08       1.24       0.80
384     20k     7.3m        0.02s    307.40       1.25       0.80
512     20k     9.7m        0.02s    412.82       1.24       0.81
768     20k     14m         0.03s    614.11       1.25       0.80
1k      20k     19m         0.03s    814.12       1.26       0.80
1.5k    20k     29m         0.03s   1212.41       1.27       0.79
2k      20k     39m         0.03s   1590.49       1.29       0.78
3k      20k     58m         0.03s   2340.84       1.31       0.76
4k      20k     78m         0.03s   3068.05       1.34       0.75
6k      20k     117m        0.03s   4462.85       1.38       0.73
8k      20k     156m        0.03s   5440.12       1.51       0.66
12k     20k     234m        0.04s   6770.81       1.81       0.55
16k     20k     312m        0.04s   7921.86       2.07       0.48
24k     20k     468m        0.06s   8319.43       2.95       0.34
32k     20k     625m        0.07s   9470.25       3.46       0.29
48k     20k     937m        0.10s  10127.33       4.85       0.21
64k     2k      125m        0.01s  10465.67       6.26       0.16
96k     2k      187m        0.02s  10885.78       9.03       0.11
128k    2k      250m        0.03s  10215.66      12.83       0.08
192k    2k      375m        0.03s  11301.58      17.40       0.06
256k    2k      500m        0.05s  11501.58      22.79       0.04
384k    2k      750m        0.08s  10123.87      38.84       0.03
512k    2k      1000m       0.10s  10749.66      48.77       0.02
768k    2k      1.4g        0.13s  11843.86      66.40       0.02
1m      200     200m        0.02s  10968.94      95.60       0.01
1.5m    200     300m        0.03s  10837.99     145.12       0.01
2m      200     400m        0.04s  10204.87     205.51       0.00
3m      200     600m        0.06s  10522.41     298.95       0.00
4m      200     800m        0.08s  10826.94     387.39       0.00
6m      200     1.1g        0.11s  11488.20     547.65       0.00
8m      200     1.5g        0.16s  10504.60     798.57       0.00

Now that we’ve verified that both hardware and software are ready, in the next chapter we can start writing code.