Diagnosing Performance

Measuring performance both at the Network Abstraction (NA) level and Mercury (RPC) level is always a good idea to ensure proper configuration of the network and that the expected performance can be achieved. To do that, Mercury comes with a set of performance measurement utilities.

Note

To ensure that the perf utilities are compiled, BUILD_TESTING and BUILD_TESTING_PERF must be turned ON in CMake options. Additionally, to enable MPI support with HG perf tests, MERCURY_TESTING_ENABLE_PARALLEL must also be set to ON. Performance utilities are placed into the bin directory relative to the install prefix.

NA Perf

NA perf tests are composed of: na_lat, na_bw_get, na_bw_put (clients) and na_perf_server (server). For all of the tests, na_perf_server must first be launched:

na_perf_server -c ofi -p tcp -b

This will generate a port.cfg within the current working directory. This file will be used by the client to contact the server and must reside within the same working directory when launching the client.

To measure latency of the network, run:

na_lat -c ofi -p tcp -b -l 1000

To measure bandwidth, run:

na_bw_get -c ofi -p tcp -b -l 1000

HG Perf

HG perf tests are composed of: hg_rate, hg_bw_read, hg_bw_write (clients) and hg_perf_server (server). For all of the tests, hg_perf_server must first be launched:

hg_perf_server -c ofi -p tcp -b

This will generate a port.cfg within the current working directory. This file will be used by the client to contact the server and must reside within the same working directory when launching the client.

To measure RPC rate, run:

hg_rate -c ofi -p tcp -b -l 1000

To measure bulk RPC bandwidth, run:

hg_bw_write -c ofi -p tcp -b -l 1000

Example: DAOS Use Case

Most often it is desirable to replicate a specific workflow or communication pattern. In the case of a storage system like DAOS that follows a client-server model, there are two levels of parallelism with servers being both multithreaded and distributed over multiple nodes. Clients on the other hand may also use threads but they are in most cases distributed over multiple nodes. The Mercury performance benchmarks (compiled with parallel mode enabled) can be used to replicate this use case. To simulate multithreaded servers running on multiple nodes, hg_perf_server can be launched with the following parameters:

mpirun -np 2 --bind-to numa --map-by numa hg_perf_server -c ofi -p tcp -b -C 16

-C controls the number of Mercury classes/contexts per process (equivalent to the number of DAOS storage targets) while -np controls the number of processes (equivalent to the number of DAOS engines). Since classes/contexts operate independently there is no restriction on either of these values but as a general rule, it is best to restrict -C to the number of available cores per NUMA node (which is also the reason why tasks are mapped by and bound to NUMA nodes in that case).

Similarly, to launch multiple clients, one would launch:

mpirun -hostfile hosts.txt --bind-to hwthread --map-by numa hg_bw_write -c ofi -p tcp -b -l 100

The resulting bandwidth in that case is the aggregate bandwidth (or RPC rate when using hg_rate). Without thread, it is best to bind the client to HW threads or cores to ensure that tasks are not migrated and the performance remains consistent.

Tip

-x can be used to control the number of RPCs in flight
-b forces busy spin and prevents any wait/sleep
-w changes the RMA window size and the number of RMAs in flight, increasing it will saturate bandwidth but be wary of memory usage
-y / -z controls the min/max buffer size, setting it to the same value allows for testing a specific buffer size

See --help for additional options

Last update: November 6, 2023