Skip to content

Libfabric Plugin

Note

The implementation notes in this page are provided for reference and may be slightly out of date. If you are only looking for user documentation, please refer to the NA plugin documentation on this page.

OFI Capabilities

The OFI plugin makes use of the following libfabric capabilities:

  • FI_EP_RDM
  • FI_DIRECTED_RECV
  • FI_READ
  • FI_RECV
  • FI_REMOTE_READ
  • FI_REMOVE_WRITE
  • FI_RMA
  • FI_SEND
  • FI_TAGGED
  • FI_WRITE
  • FI_SOURCE
  • FI_SOURCE_ERR
  • FI_ASYNC_IOV
  • FI_CONTEXT
  • FI_MR_BASIC
  • Scalable endpoints

Feature Matrix

  • Supported ✔
  • Limited support or emulated ❗ (see footnote)
  • Not supported ❌
Feature tcp verbs psm2 gni
Source Addressing ❗1 ❗1 ✔ ✔
Manual Progress ❗2 ✔ ❗2 ❗2
FI_WAIT_FD ✔ ✔ ✔ ❌
Scalable Endpoints ❗3 ❌ ✔ ✔

1 Emulated: source address is encoded in the message header.

2 Emulated: the provider is using a thread to drive background progress.

3 Emulated: provider resources are shared.

Performance

Below is a performance comparison of the libfabric plugin with available libfabric providers when using both wait and busy spin mechanisms, mercury v1.0.0 and libfabric v1.7.0a1.

InfiniBand verbs;rxm

Performance is measured between two nodes on ALCF's cooley cluster using the FDR InfiniBand interface mlx5_0. The following plot shows the RPC average time compared to ofi+tcp when using one single RPC in-flight:

RPC time 1

Same plot but with 16 RPCs in-flight:

RPC time 16

The following plot shows the RPC with pull bulk transfer performance compared to ofi+tcp with various transfer sizes:

Write Bandwidth 1

Same plot but with 16 RPCs in-flight:

Write Bandwidth 16

The following plot shows the RPC with push bulk transfer performance compared to ofi+tcp with various transfer sizes:

Read Bandwidth 1

Same plot but with 16 RPCs in-flight:

Write Bandwidth 16

Omni-Path psm2

Performance is measured between two nodes on LCRC's bebop cluster using the Omni-Path interface with PSM2 v11.2.23. The following plot shows the RPC average time compared to ofi+tcp when using one single RPC in-flight:

RPC time 1

Same plot but with 16 RPCs in-flight:

RPC time 16

The following plot shows the RPC with pull bulk transfer performance compared to ofi+tcp with various transfer sizes:

Write Bandwidth 1

Same plot but with 16 RPCs in-flight:

Write Bandwidth 16

The following plot shows the RPC with push bulk transfer performance compared to ofi+tcp with various transfer sizes:

Read Bandwidth 1

Same plot but with 16 RPCs in-flight:

Write Bandwidth 16

Aries gni

Performance is measured between two Haswell nodes (debug queue in exclusive mode) on Nersc's cori system using the interface ipogif0 with uGNI v6.0.14.0. The following plot shows the RPC average time compared to ofi+tcp when using one single RPC in-flight:

RPC time 1

Same plot but with 16 RPCs in-flight:

RPC time 16

The following plot shows the RPC with pull bulk transfer performance compared to ofi+tcp with various transfer sizes:

Write Bandwidth 1

Same plot but with 16 RPCs in-flight:

Write Bandwidth 16

The following plot shows the RPC with push bulk transfer performance compared to ofi+tcp with various transfer sizes:

Read Bandwidth 1

Same plot but with 16 RPCs in-flight:

Write Bandwidth 16

Last update: December 4, 2021