Skip to content

Mercury

Libfabric Plugin

mercury-hpc/mercury

Libfabric Plugin

Note

The implementation notes in this page are provided for reference and may be slightly out of date. If you are only looking for user documentation, please refer to the NA plugin documentation on this page.

OFI Capabilities

The OFI plugin makes use of the following libfabric capabilities:

FI_EP_RDM
FI_DIRECTED_RECV
FI_READ
FI_RECV
FI_REMOTE_READ
FI_REMOVE_WRITE
FI_RMA
FI_SEND
FI_TAGGED
FI_WRITE
FI_SOURCE
FI_SOURCE_ERR
FI_ASYNC_IOV
FI_CONTEXT
FI_MR_BASIC
Scalable endpoints

Feature Matrix

Supported
Limited support or emulated (see footnote)
Not supported

Feature	tcp	verbs	psm2	gni
Source Addressing	¹	¹
Manual Progress	²		²	²
`FI_WAIT_FD`
Scalable Endpoints	³

¹ Emulated: source address is encoded in the message header.

² Emulated: the provider is using a thread to drive background progress.

³ Emulated: provider resources are shared.

Performance

Below is a performance comparison of the libfabric plugin with available libfabric providers when using both wait and busy spin mechanisms, mercury v1.0.0 and libfabric v1.7.0a1.

InfiniBand `verbs;rxm`

Performance is measured between two nodes on ALCF's cooley cluster using the FDR InfiniBand interface mlx5_0. The following plot shows the RPC average time compared to ofi+tcp when using one single RPC in-flight:

RPC time 1

Same plot but with 16 RPCs in-flight:

RPC time 16

The following plot shows the RPC with pull bulk transfer performance compared to ofi+tcp with various transfer sizes:

Write Bandwidth 1

Same plot but with 16 RPCs in-flight:

Write Bandwidth 16

The following plot shows the RPC with push bulk transfer performance compared to ofi+tcp with various transfer sizes:

Read Bandwidth 1

Same plot but with 16 RPCs in-flight:

Write Bandwidth 16

Omni-Path `psm2`

Performance is measured between two nodes on LCRC's bebop cluster using the Omni-Path interface with PSM2 v11.2.23. The following plot shows the RPC average time compared to ofi+tcp when using one single RPC in-flight:

RPC time 1

Same plot but with 16 RPCs in-flight:

RPC time 16

The following plot shows the RPC with pull bulk transfer performance compared to ofi+tcp with various transfer sizes:

Write Bandwidth 1

Same plot but with 16 RPCs in-flight:

Write Bandwidth 16

The following plot shows the RPC with push bulk transfer performance compared to ofi+tcp with various transfer sizes:

Read Bandwidth 1

Same plot but with 16 RPCs in-flight:

Write Bandwidth 16

Aries `gni`

Performance is measured between two Haswell nodes (debug queue in exclusive mode) on Nersc's cori system using the interface ipogif0 with uGNI v6.0.14.0. The following plot shows the RPC average time compared to ofi+tcp when using one single RPC in-flight:

RPC time 1

Same plot but with 16 RPCs in-flight:

RPC time 16

The following plot shows the RPC with pull bulk transfer performance compared to ofi+tcp with various transfer sizes:

Write Bandwidth 1

Same plot but with 16 RPCs in-flight:

Write Bandwidth 16

The following plot shows the RPC with push bulk transfer performance compared to ofi+tcp with various transfer sizes:

Read Bandwidth 1

Same plot but with 16 RPCs in-flight:

Write Bandwidth 16