-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial implementation of C api #915
Conversation
bc232be
to
f621b2a
Compare
4dbe2da
to
cf69f20
Compare
3dab797
to
74a5cc0
Compare
} catch (std::exception & e) { \ | ||
std::cerr << "Aborting due to exception: " << e.what() << std::endl; \ | ||
SWSS_LOG_ERROR("Aborting due to exception: %s", e.what()); \ | ||
std::abort(); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately this is not possible. It is UB to allow a c++ exception to unwind into rust, so there are two options: 1. catch exceptions and convert into normal data types which can be returned to rust to recover from, or 2. abort on exception. Option 1 was decided against because all thrown exceptions are basically fatal (out of memory, invalid config file, etc). Option 2 also gives us a core dump to debug against. If we were to ignore the exceptions, rust would abort automatically anyway but with a bad core dump.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation!
all thrown exceptions are basically fatal -> I do not agree. Some exceptions are related to runtime env for example a redis server could not respond right now but may respond after retry. ref:
sonic-swss-common/common/dbconnector.cpp
Line 564 in 0044540
throw system_error(make_error_code(errc::address_not_available), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I understand and that is a good point, but now we have to convert errors into an ffi-safe behavior. This leaves us with a couple of non-ideal options.
Option 1:
Enumerate all non-fatal exceptions and code those into the signature of the C apis.
Eg: we would change SWSSDBConnector_new_tcp to something like this
// If this function returns null, db did not respond, but it might if you retry
SWSSDBConnector SWSSDBConnector_new_tcp(int32_t dbId, const char *hostname, uint16_t port, uint32_t timeout) {
SWSSTry({
try {
return (SWSSDBConnector) new DBConnector(dbId, string(hostname), port, timeout);
} catch (system_error &e) {
// we know this error is nonfatal so we hard coded it in here
return nullptr;
}
// any other error will still abort() because any other error must be fatal
});
}
This is not ideal because now the C api becomes logically coupled to the implementation of the underlying function. If another nonfatal exception is added to DBConnector, they also have to change this C function to properly return null. We will have to do this for every possible nonfatal exception that one of the C functions might cause.
This may be OK if there are very few nonfatal exceptions, we expect to practically never add more, and if the C api stays very small. Otherwise, anyone who works on swsscommon now also has to understand these implications on the C api.
Option 2 simple:
Introduce a generic error interface to the C api. This is a very bad way of doing it, but I mention it because this is how I previously implemented the C api. We store errors as a global error string, and SWSSTry + code using it looks something like this:
#include <cstring> // for strdup
#include <cstdlib> // for free
const char *globalErrorString = nullptr;
extern "C" { const char *SWSSGetError(void) { return globalErrorString; } }
#define SWSSTry(failure_value, ...) \
try { \
__VA_ARGS__ \
; \
} catch (std::exception &e) { \
free(globalErrorString); \
globalErrorString = strdup(e.what().c_str()); \
return failure_value; \
}
SWSSDBConnector SWSSDBConnector_new_tcp(int32_t dbId, const char *hostname, uint16_t port, uint32_t timeout) {
SWSSTry(nullptr, {
// upon any exception, returns nullptr and globalErrorString is set to e.what()
return (SWSSDBConnector) new DBConnector(dbId, string(hostname), port, timeout)
});
}
On the rust (or any other language) side:
fn make_a_dbconnector() {
loop {
let dbconnector = unsafe { SWSSDBConnector_new_tcp(...) };
if dbconnector.is_null() {
// If any error occurred fatal or nonfatal, all we get is a descriptive string
let error_str = unsafe { str::from_utf8_unchecked(SWSSGetError()) };
// Magic strings!! yuck!!
if error_str.contains("address not available") {
// nonfatal error - retry later
sleep_10_seconds();
} else {
panic!("fatal error: {error_str}");
}
} else {
return dbconnector;
}
}
}
I originally wrote the api this way, but it was nixed by Riff (#915 (comment)) because we are losing the core dump/stack trace for fatal errors. Another problem is that all we get is a string - if we want to catch specific exceptions, we have to know what e.what()
will return. This is really bad because foreign code is logically coupled to the exact exception messages in swss-common. If somebody fixes a typo in an exception message, that the magic string in foreign code is now wrong!
Option 2 advanced:
Here's another approach to a generic error interface that may be less bad. Introduce a generic try/catch to the C interface, like so:
bool globalDisableAborts = false;
const char *globalErrorString = nullptr;
extern "C" {
void SWSSTry_(void) { free(globalErrorString); globalErrorString = nullptr; globalDisableAborts = true; }
const char *SWSSCatch(void) { globalDisableAborts = false; return globalErrorString; }
}
#define SWSSTry(failure_value, ...) \
try { \
__VA_ARGS__ \
; \
} catch (std::exception &e) { \
if (globalDisableAborts) {
free(globalErrorString); \
globalErrorString = strdup(e.what().c_str()); \
return failure_value; \
} else { \
SWSS_LOG_ERROR(... e.what() ...); \
abort(); \
} \
}
```rs
fn make_a_dbconnector_retrying() {
// We will only try 10 times before performing an aborting call
for tries in 0..10 {
// begin a pseudo try block
unsafe { SWSSTry_() };
let dbconnector = unsafe { SWSSDBConnector_new_tcp(...) };
// end the pseudo try block, get the error if one happened.
let err = unsafe { SWSSCatch() };
if err.is_null() {
// happy path: error is null so there was no error - dbconnector is valid
return dbconnector;
} else {
// This might've been a fatal or nonfatal error but we aren't checking with a magic string.
// We don't need to check because we are limiting our number of tries.
let error_str = unsafe { str::from_utf8_unchecked(SWSSGetError()) };
println!("error: {error_str}, retrying");
sleep_10_seconds();
}
}
// We have tried 10 times - it's probably a fatal error, so let's bite the bullet and do an aborting call
// Since we did not use SWSSTry_, any exception will still abort. Just like current behavior.
// We will get a nice core dump because abort() was called;
let dbconnector = unsafe { SWSSDBConnector_new_tcp(...) };
return dbconnector;
}
This is the best idea IMO because we can choose where to abort and where not to. Using SWSSTry_
and SWSSCatch
we can retry something a few times before electing to lay down our cards and get a proper core dump. In simple cases, where any error is fatal, we can simply ignore SWSSTry_
and SWSSCatch
, and use aborting behavior on the first try.
Final thoughts
If there are very very few nonfatal errors in code that the C api uses, and we can find all of them and hard code their nonfatality into the C api functions, then we should do option 1.
If the C api is going to grow a lot later, or swss-common might change what errors are nonfatal, or we really want the C api to be very simple, or we want to allow people working on swss-common to ignore the C api, or we want to give generic error handling decisions to downstream foreign code, I think we should do option 2 advanced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even with this it still creates challenge in the Humberger stack, where c++ -> rust -> c++, because with return code approach, every function in rust need to understand the error code and passing it to the upper layer. This limits the error handling in rust to only Anyhow-ish crates with downcast...
This is a problem that me and Oliver struggled for days on finding the solutions, but so far there is no obvious/quick answer to it. So I have discussed this issue with Qi offline. We will need to fix this issue, but since it is only impacting the services that uses rust in the exception handling path, we are going to create a backlog to get this fixed, and get the current PR merged to unblock our immediate project progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this issue is fundamentally caused by rust not having try-catch support, and we are trying to recreate the try-catch support for the language, which will take time to get it right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Work Item created here: MSADO:29998324 and assigned to Oliver at this moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this issue is fundamentally caused by rust not having try-catch support, and we are trying to recreate the try-catch support for the language, which will take time to get it right.
Rust panics use the same mechanism as c++ exceptions - it really does have try/catch, but the issue is that no C ABIs permit unwinding. Technically it would even be incorrect if we called these C api functions from within another C++ program, if we crossed a dynamic-linking boundary (i.e. the "hamburger stack" would look like C++ -> C -> C++). This is because internal ABI details are negotiated per linkage unit (unless magic is applied like rust's C-unwind). If we had a C++ function that called a C function, it's valid to optimize out any unwind-catching code because, by spec, no C function can ever unwind. Rust is also making this assumption - that no C function can ever unwind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I think we should either implement the "option 2 advanced" thing I wrote, or just ignore this for now until catching specific errors becomes necessary in hamgrd or other services.
@erer1243 In future could you reduce the force merge during PR iteration? We are using tool to compare between commits to understand what is newly added, and it improve code review efficiency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the backlog created and sync'ed with Qi, approving this PR, so we can get it merged.
Sorry, forgot to include the c api stuff in the deb packages :) |
Implement rust wrapper of the new C api on libswsscommon Related: sonic-net/sonic-swss-common#915 --------- Co-authored-by: erer1243 <erer1243@users.noreply.github.com>
sync'ed with Qi. all comments are resolved now, hence merging it. |
Implement a C interface to some of libswsscommon in support of sonic-dash-ha.
Related:
sonic-net/sonic-dash-ha#6
#921
Incoming follow up PR:
erer1243#1