-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-14532 gurt: Replace environment APIs hook #13483
Conversation
Add thread safe environment variables managment functions. Remove and replace non thread safe custom environment management function. Replace setenv() function with d_setenv() Required-githooks: true Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@intel.com>
Bug-tracker data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
src/gurt/misc.c
Outdated
if (!dis_unsigned_str(env)) { | ||
rc = -DER_INVAL; | ||
goto out; | ||
} | ||
|
||
env_val = getenv(env); | ||
if (!env_val) | ||
return; | ||
errno = 0; | ||
val = strtoul(env, &endptr, 0); | ||
if (errno != 0 || endptr == env || *endptr != '\0') { | ||
rc = -DER_INVAL; | ||
goto out; | ||
} | ||
|
||
if (!dis_integer_str(env_val)) { | ||
D_ERROR("ENV %s is not integer.\n", env_val); | ||
return; | ||
#if UINT_MAX != ULONG_MAX | ||
assert(sizeof(unsigned) < sizeof(unsigned long)); | ||
if (val > UINT_MAX) { | ||
rc = -DER_INVAL; | ||
goto out; | ||
} | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[suggestion] i think for somehow dis_unsigned_str() is duplicated as strtoull()?
we could remove dis_unsigned_str() and just use strtoull()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The best solution from my point of view should be to remove all the unsigned
and to replace them with uint32_t
or uint64_t
. However, this is beyond the scope of this PR.
For this PR, I would prefer to keep these three functions as the sizeof()
of the unsigned
is platform dependent: I would like to avoid any unexpected overflow. In a follow-up PR, I could replace the d_getenv_uint()
with relevant d_getenv_uintXXt()
and update the code accordingly.
To avoid duplicate code, I could factorize the code into d_getenv_uint64_t()
function and just check the overflow in d_getenv_uint32t()
and d_getenv_uint()
functions.
@wangshilong, is this last solution seems acceptable for you ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with this, i thought dis_unsigned_str() is unnecessary and might be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Factorize unsigned fonctions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[suggestion] i think for somehow dis_unsigned_str() is duplicated as strtoull()? we could remove dis_unsigned_str() and just use strtoull()?
My apologies @wangshilong , I did not well understood your question regarding the function dis_unsigned_str()
. At the beginning, I also wanted to remove it. However, strtoull()
is less restrictive and allow to have spaces before the first digit and even to have a negative integer. To not change the behavior of the d_getenv_uint()
functions, I decided to kept the dis_unsigned_str()
function.
Is this OK for you ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code of the getenv of unsigned integer function has been factorized in commit 193c24c
src/gurt/misc.c
Outdated
{ | ||
int rc; | ||
|
||
pthread_rwlock_wrlock(&d_env_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate why this is safer than the hook version? I think the issue we have seen is due to threads calling getenv in other libraries, which in this case would not be locked with respect to calls in daos libraries. Whereas the hooked version would ensure the lock is acquired for all calls, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hook version can have some side effect with other third party libraries such as mvapich and eventually lead to segfault. You could find in the following comment, a detailed description of such scenario https://daosio.atlassian.net/browse/DAOS-14532?focusedCommentId=121770
I tried to overcome this issue with updating hooks or playing with library constructors. At the end, I did not find an acceptable solution which could not have some side effects with other third party libraries.
After some discussion with @johannlombardi and @gnailzenh , it appeared that the least bad solution was to have our own thread safe functions for managing environment. From my side, the only disadvantage of this solution is to be much more intrusive in our code. Happily, we already have some dedicated non thread safe functions for managing the environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we see an intermittent issue with fio that manifests as follows
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007d9faabc195d in getenv () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x7d9fad4335c0 (LWP 60560))]
Missing separate debuginfos, use: yum debuginfo-install boost-atomic-1.66.0-13.el8.x86_64 boost-chrono-1.66.0-13.el8.x86_64 boost-date-time-1.66.0-13.el8.x86_64 boost-iostreams-1.66.0-13.el8.x86_64 boost-program-options-1.66.0-13.el8.x86_64 boost-random-1.66.0-13.el8.x86_64 boost-regex-1.66.0-13.el8.x86_64 boost-system-1.66.0-13.el8.x86_64 boost-thread-1.66.0-13.el8.x86_64 daxctl-libs-71.1-4.el8.x86_64 glibc-2.28-225.el8.x86_64 libaio-0.3.112-1.el8.x86_64 libblkid-2.32.1-42.el8_8.x86_64 libgcc-8.5.0-18.el8.x86_64 libibverbs-44.0-2.el8.1.x86_64 libicu-60.3-2.el8_1.x86_64 libmount-2.32.1-42.el8_8.x86_64 libnl3-3.7.0-1.el8.x86_64 libpmem-1.6.1-1.el8.x86_64 libpmemblk-1.6.1-1.el8.x86_64 librados2-12.2.7-9.el8.x86_64 librbd1-12.2.7-9.el8.x86_64 librdmacm-44.0-2.el8.1.x86_64 libselinux-2.9-8.el8.x86_64 libstdc++-8.5.0-18.el8.x86_64 libyaml-0.1.7-5.el8.x86_64 lz4-libs-1.8.3-3.el8_4.x86_64 ndctl-libs-71.1-4.el8.x86_64 nspr-4.34.0-3.el8_6.x86_64 nss-util-3.79.0-11.el8_7.x86_64 numactl-libs-2.0.12-13.el8.x86_64 openssl-libs-1.1.1k-9.el8_7.x86_64 pcre2-10.32-3.el8_6.x86_64 protobuf-c-1.3.0-6.el8.x86_64 systemd-libs-239-74.el8_8.x86_64 zlib-1.2.11-21.el8_7.x86_64
(gdb) bt
#0 0x00007d9faabc195d in getenv () from /lib64/libc.so.6
#1 0x00005b1f2183098c in check_status_file () at stat.c:2372
#2 0x00005b1f21839c25 in check_for_running_stats () at stat.c:2411
#3 0x00005b1f2186cf51 in do_usleep (usecs=10000) at backend.c:2460
#4 run_threads (sk_out=sk_out@entry=0x0) at backend.c:2460
#5 0x00005b1f2186d5b2 in fio_backend (sk_out=0x0) at backend.c:2513
#6 0x00005b1f2181c595 in main (argc=16, argv=0x7ffe8dcfb228, envp=) at fio.c:60
It doesn't show dlopen in the callstack. Do you think it's related to this issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a high probably that it could be related: however it is hard for me to be 100% sure without a more deep investigation in the core file and the related code source.
As for @daltonbohning issue with mvapich, you could use the full patch #13250 to check if it also solves your issue with fio.
src/include/gurt/common.h
Outdated
void | ||
d_free_env(char **str_val); | ||
int | ||
d_getenv_bool(bool *bool_val, const char *env); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is changing the API and ABI isn't it? Existing applications which use this interface will need code changes to work, but if they build with one version and run with another they could also segfault.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a similar thought but then kind of assumed that the only user of these APIs are DAOS libraries that will be shipped with the changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the danger would be things like mpifileutils and hdf5 etc. At the very least we should bump the SO version with this changes, but that only applies to releases from "before" and "after" and I'd assume that 2.4 and 2.6 will have different SO numbers anyway - perhaps worth checking it though and bumping it with this change if master is still using the same as 2.4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this time, the only user of this new APIs are indeed the DAOS libraries and thus should not have any side effects for the client applications linked with DAOS. They will just use the standard non-thread safe getenv()
functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to the functions you've changed, not the ones you've added.
…/daos-14532-gurt Required-githooks: true
Integrate reviewers comments: - Factorize d_getenv_uint() functions. Required-githooks: true Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@intel.com>
Fix clang format. Required-githooks: true Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@intel.com>
Test stage NLT on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-13483/3/display/redirect |
src/include/gurt/common.h
Outdated
void | ||
d_free_env(char **str_val); | ||
int | ||
d_getenv_bool(bool *bool_val, const char *env); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the danger would be things like mpifileutils and hdf5 etc. At the very least we should bump the SO version with this changes, but that only applies to releases from "before" and "after" and I'd assume that 2.4 and 2.6 will have different SO numbers anyway - perhaps worth checking it though and bumping it with this change if master is still using the same as 2.4.
src/include/gurt/common.h
Outdated
void | ||
d_free_env(char **str_val); | ||
int | ||
d_getenv_bool(bool *bool_val, const char *env); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to the functions you've changed, not the ones you've added.
src/bio/bio_xstream.c
Outdated
d_getenv_bool("DAOS_NVME_AUTO_FAULTY_ENABLED", &glb_criteria.fc_enabled); | ||
d_getenv_int("DAOS_NVME_AUTO_FAULTY_IO", &glb_criteria.fc_max_io_errs); | ||
d_getenv_int("DAOS_NVME_AUTO_FAULTY_CSUM", &glb_criteria.fc_max_csum_errs); | ||
d_getenv_bool(&glb_criteria.fc_enabled, "DAOS_NVME_AUTO_FAULTY_ENABLED"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't change these APIs. MPI_IO uses d_getenv_bool so we need to maintain the old interfaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Revert interfaces update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By any chance, do you know if we have some functional tests using MPI_IO ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use Features: mpiio
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the past where this has come up then what functional testing we have uses a weird system where:
- PRs and landings to downstream components test against master daos.
- PRs and landings to daos test with the latest built downstream component.
The unfortunate implication here is that we have some testing but it's limited and if we change the API or ABI and it's incompatible then we don't notice until there's a new PR to the downstream component or it gets rebuilt for another reason. If we had test coverage here then it would fail at build time, as it is it'll fail (obscurely) at runtime, but only if the code is called.
It's an area we need to work on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Revert interfaces update
Fixed with commit ac764d1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing my +1 since the API change needs to be reverted.
@ashleypittman , my apologies for the misunderstanding. |
…/daos-14532-gurt Required-githooks: true
Integrate reviewers comments: - Revert d_getenv_XXX interfaces update - Rename d_free_env() -> d_free_env_str() Features: mpiio Required-githooks: true Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@intel.com>
Revert useless header file changes. Features: mpiio Required-githooks: true Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@intel.com>
src/gurt/misc.c
Outdated
{ | ||
char *env; | ||
|
||
pthread_rwlock_rdlock(&d_env_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These will be coverity warnings unless you either check the return code or use our macros.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the info.
Sadly, I will not be able to use the D_RWLOCK_RDLOCK macros as it use the DAOS logging facilities which could lead to infinite recursive calls. Thus I will check the return code is a similar way as this last macros.
- Checks return code when using standard function calls such as
pthread_rwlock_rdlock()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did wonder. Does the logging ever call getenv though?
I made the same change to the locking in the logging as that was causing recursion and found out about the coverity issue that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I remember, I experiment infinite recursion with the function d_logfac_is_enabled()
from src/gurt/dlog.c
. There is other functions from this file which could also probably lead to the same kind of issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the info. Sadly, I will not be able to use the D_RWLOCK_RDLOCK macros as it use the DAOS logging facilities which could lead to infinite recursive calls. Thus I will check the return code is a similar way as this last macros.
- Checks return code when using standard function calls such as
pthread_rwlock_rdlock()
Fixed with commit 2791b93
Integrate reviwers comments: - Checks return code when using standard function calls such as pthread_rwlock_rdlock() Features: mpiio Required-githooks: true Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@intel.com>
Fix clang format coding standards. Features: mpiio Required-githooks: true Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@intel.com>
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13483/8/execution/node/1453/log |
The only CI issue with the functional test erasurecode_online_rebuild_mdtest_py is a known issue DAOS-14845. |
This PR is a subset of the PR #13250 allowing thread safe management of environment variables: it has been split into smaller PRs to facilitate the review process. This PR mainly add thread safe environment variables management functions. It also remove and replace old non thread safe custom environment management functions. Finally, it replace the setenv() function with d_setenv(). Required-githooks: true Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@intel.com>
This PR is a subset of the PR #13250 allowing thread safe management of environment variables: it has been split into smaller PRs to facilitate the review process. This PR mainly add thread safe environment variables management functions. It also remove and replace old non thread safe custom environment management functions. Finally, it replace the setenv() function with d_setenv(). Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@intel.com>
This PR is a subset of the PR #13250 allowing thread safe management of environment variables: it has been split into smaller PRs to facilitate the review process. This PR mainly add thread safe environment variables management functions. It also remove and replace old non thread safe custom environment management functions. Finally, it replace the setenv() function with d_setenv(). Required-githooks: true Change-Id: Ife6690e2c63dd6c47279a2ac8c3c5a3da5cf8213 Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@intel.com>
This PR is a subset of the PR #13250 allowing thread safe management of environment variables: it has been split into smaller PRs to facilitate the review process. This PR mainly add thread safe environment variables management functions. It also remove and replace old non thread safe custom environment management functions. Finally, it replace the setenv() function with d_setenv(). Required-githooks: true Change-Id: Ife6690e2c63dd6c47279a2ac8c3c5a3da5cf8213 Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@intel.com>
Description
This PR is a subset of the PR #13250 allowing thread safe management of environment variables: it has been split into smaller PRs to facilitate the review process.
This PR mainly add thread safe environment variables management functions.
It also remove and replace old non thread safe custom environment management functions.
Finally, it replace the
setenv()
function withd_setenv()
.Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: