Release Notes v2.2.0
Summary
This version brings bug fixes and updates to our v2.0.0 release.
New features
- [NA OFI]
- Choose addr format dynamically based on user preferences
- Add support for IPv6
- Add support for
FI_SOCKADDR_IB - Add support for
FI_ADDR_STRand shm provider - Add support for
FI_ADDR_OPXand opx provider - Add support for HPE
cxiprovider, init info format forcxiis:NIC:PID(both or only one may be passed), NIC iscxi[0-9], PID is[0-510]
- Use
hwlocto select interface to use if NIC information is available (only supported bycxiat the moment) - Support device memory types and
FI_HMEMforverbsandcxiproviders - Add support for
FI_THREAD_DOMAIN- Passing
NA_THREAD_MODE_SINGLEwill relax defaultFI_THREAD_SAFEthread mode and useFI_THREAD_DOMAINinstead.
- Passing
- Update min required version to libfabric 1.9
- Improve debug output to print verbose FI info of selected provider
- [NA UCX]
- Use active messaging
UCP_FEATURE_AMfor unexpected messages (only), this allows for removal of address resolution and retry on first message to exchange connection IDs - Turn on mempool by default
- Support device memory types
- Bump min required version to 1.10
- Use active messaging
- [NA PSM]
- Add mercury NA plugin for the qlogic/intel PSM interface
- Also support PSM2 (Intel OmniPath) through the PSM NA plugin
- Add mercury NA plugin for the qlogic/intel PSM interface
- [NA SM]
- Add support for 0-size messages
- [NA]
- Add
na_addr_formatinit info - Add
request_mem_deviceinit info when GPU support is requested - Update
NA_Mem_register()API call to support memory types (e.g., CUDA, ROCm, ZE) and devices IDs - Add
na_locmodule forhwlocdetection - Remove
na_uint,na_int,na_bool_tandna_size_ttypes - Use separate versioning for library and update to v3.0.0
- Add
- [NA IP]
- Refactor
na_ip_check_interface()to only usegetaddrinfo()andgetifaddrs() - Add family argument to force detection of IPv4/IPv6 addresses
- Add ip debug log
- Refactor
- [NA Test]
- Introduce new perf tests to measure msg latency, put / get bandwidth. These benchmarks produce results that are comparable with OSU benchmarks.
- [HG util]
- Add
mercury_byteswap.hforbswapmacros - Add
mercury_inet.hforhtonllandntohllroutine - Add
mercury_param.hto usesys/param.horMIN/MAXmacros etc - Add alternative log names:
err,warn,trace,dbg - Use separate versioning for library and update to v3.0.0
- Add
- [HG bulk]
- Add support for memory attributes through a new
HG_Bulk_create_attr()routine (support CUDA, ROCm, ZE)
- Add support for memory attributes through a new
- [HG]
- Remove
MERCURY_ENABLE_STATSCMake option and use'diag'log subsys instead- Modify behavior of
statsfield to turn on diagnostics - Refactor existing counters (used only if debug is on)
- Modify behavior of
- Add checksum levels that can be manually controlled at runtime (disabled by default,
HG_CHECKSUM_NONElevel) - Update to mchecksum v2.0
- Add
HG_Set_log_func()andHG_Set_log_stream()to control log output
- Remove
- [HG hl]
- The deprecated mercury high-level library and high-level macros have now been removed.
Bug fixes
- [NA OFI]
- Switch
tcpprovider toFI_PROGRESS_MANUAL - Prevent empty authorization keys from being passed
- Check max MR key used when
FI_MR_PROV_KEYis not set - New implementation of address management
- Fix duplicate addresses on multithreaded lookups
- Redefine address keys and raw addresses to prevent allocations
- Use FI addr map to lookup by FI addr
- Improve serialization and deserialization of addresses
- Fix provider table and use EP proto
- Refactor and clean up plugin initialization
- Clean up ip and domain checking
- Ensure interface name is not used as domain name for verbs etc
- Use NA IP module and add missing
NA_OFI_VERIFY_PROV_DOMfortcpprovider - Rework handling of
fi_infoto open fabric/domain/endpoint - Separate fabric from domain and keep single domain per NA class
- Refactor handling of scalable vs standard endpoints
- Improve handling of retries after
FI_EAGAINreturn code- Abort retried ops after default 90s timeout
- Abort ops to a target being retried after first
NA_HOSTUNREACHerror in CQ
- Switch
- [NA UCX]
- Fix potential error not returned correctly on
conn_insert() - Fix potential double free of worker_addr
- Remove use of unified mode
- Ensure address key is correctly reset
- Fix hostname / net device parsing to allow for multiple net devices
- Fix potential error not returned correctly on
- [HG util]
- Make sure we round up ms time conversion, this ensures that small timeouts do not result in busy spin.
- Use sched_yield() instead of deprecated pthread_yield()
- Fix
'none'log level not recognized - Fix external logging facility
- Let mercury log print counters on exit when debug outlet is on
- [HG proc]
- Prevent call to
save_ptr()/restore_ptr()duringHG_FREE
- Prevent call to
- [HG Bulk]
- Remove some
NA_CANCELEDevent warnings.
- Remove some
- [HG]
- Properly handle error when overflow bulk transfer is interrupted. Previously the RPC callback was triggered regarldless, potentially causing issues.
- [CMake]
- Correctly set INSTALL_RPATH for target libraries
- Split
mercury.pcpkg config file into multiple.pcfiles formercury_utilandnato prevent from overlinking against those libraries when using pkg config.
Known Issues
- [NA OFI]
- [tcp/verbs;ofi_rxm] Using more than 256 peers requires
FI_UNIVERSE_SIZEto be set.
- [tcp/verbs;ofi_rxm] Using more than 256 peers requires
- [NA UCX]
NA_Addr_to_string()cannot be used on non-listening processes to convert a self-address to a string.