Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial implementation of C api #915

Merged
merged 9 commits into from
Oct 30, 2024
Merged

Conversation

erer1243
Copy link
Contributor

@erer1243 erer1243 commented Sep 12, 2024

Implement a C interface to some of libswsscommon in support of sonic-dash-ha.

Related:
sonic-net/sonic-dash-ha#6
#921

Incoming follow up PR:
erer1243#1

Copy link

linux-foundation-easycla bot commented Sep 12, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

common/c-api/util.h Show resolved Hide resolved
common/c-api/util.h Outdated Show resolved Hide resolved
common/Makefile.am Outdated Show resolved Hide resolved
@erer1243 erer1243 force-pushed the c-interface branch 3 times, most recently from 3dab797 to 74a5cc0 Compare September 30, 2024 17:23
r12f
r12f previously approved these changes Oct 1, 2024
common/table.h Outdated Show resolved Hide resolved
} catch (std::exception & e) { \
std::cerr << "Aborting due to exception: " << e.what() << std::endl; \
SWSS_LOG_ERROR("Aborting due to exception: %s", e.what()); \
std::abort(); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abort

If I understand correctly, all exception in this repo will abort an rust application. This is behavior change. Could you keep original behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this is not possible. It is UB to allow a c++ exception to unwind into rust, so there are two options: 1. catch exceptions and convert into normal data types which can be returned to rust to recover from, or 2. abort on exception. Option 1 was decided against because all thrown exceptions are basically fatal (out of memory, invalid config file, etc). Option 2 also gives us a core dump to debug against. If we were to ignore the exceptions, rust would abort automatically anyway but with a bad core dump.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation!

all thrown exceptions are basically fatal -> I do not agree. Some exceptions are related to runtime env for example a redis server could not respond right now but may respond after retry. ref:

throw system_error(make_error_code(errc::address_not_available),

Copy link
Contributor Author

@erer1243 erer1243 Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I understand and that is a good point, but now we have to convert errors into an ffi-safe behavior. This leaves us with a couple of non-ideal options.

Option 1:

Enumerate all non-fatal exceptions and code those into the signature of the C apis.
Eg: we would change SWSSDBConnector_new_tcp to something like this

// If this function returns null, db did not respond, but it might if you retry
SWSSDBConnector SWSSDBConnector_new_tcp(int32_t dbId, const char *hostname, uint16_t port, uint32_t timeout) {
    SWSSTry({
        try {
            return (SWSSDBConnector) new DBConnector(dbId, string(hostname), port, timeout);
        } catch (system_error &e) {
            // we know this error is nonfatal so we hard coded it in here
            return nullptr;
        }

        // any other error will still abort() because any other error must be fatal
    });
}

This is not ideal because now the C api becomes logically coupled to the implementation of the underlying function. If another nonfatal exception is added to DBConnector, they also have to change this C function to properly return null. We will have to do this for every possible nonfatal exception that one of the C functions might cause.

This may be OK if there are very few nonfatal exceptions, we expect to practically never add more, and if the C api stays very small. Otherwise, anyone who works on swsscommon now also has to understand these implications on the C api.

Option 2 simple:

Introduce a generic error interface to the C api. This is a very bad way of doing it, but I mention it because this is how I previously implemented the C api. We store errors as a global error string, and SWSSTry + code using it looks something like this:

#include <cstring> // for strdup
#include <cstdlib> // for free

const char *globalErrorString = nullptr;

extern "C" { const char *SWSSGetError(void) { return globalErrorString; } }

#define SWSSTry(failure_value, ...) \
    try { \
        __VA_ARGS__ \
        ; \
    } catch (std::exception &e) { \
        free(globalErrorString); \
        globalErrorString = strdup(e.what().c_str()); \
        return failure_value; \
    }

SWSSDBConnector SWSSDBConnector_new_tcp(int32_t dbId, const char *hostname, uint16_t port, uint32_t timeout) {
    SWSSTry(nullptr, {
            // upon any exception, returns nullptr and globalErrorString is set to e.what()
            return (SWSSDBConnector) new DBConnector(dbId, string(hostname), port, timeout)
    });
}

On the rust (or any other language) side:

fn make_a_dbconnector() {
    loop {
        let dbconnector = unsafe { SWSSDBConnector_new_tcp(...) };
        if dbconnector.is_null() {
            // If any error occurred fatal or nonfatal, all we get is a descriptive string
            let error_str =  unsafe { str::from_utf8_unchecked(SWSSGetError()) };
            
            // Magic strings!! yuck!!
            if error_str.contains("address not available") {
                // nonfatal error - retry later
               sleep_10_seconds();
            } else {
                panic!("fatal error: {error_str}");
            }
        } else {
            return dbconnector;
        }
    }
}

I originally wrote the api this way, but it was nixed by Riff (#915 (comment)) because we are losing the core dump/stack trace for fatal errors. Another problem is that all we get is a string - if we want to catch specific exceptions, we have to know what e.what() will return. This is really bad because foreign code is logically coupled to the exact exception messages in swss-common. If somebody fixes a typo in an exception message, that the magic string in foreign code is now wrong!

Option 2 advanced:

Here's another approach to a generic error interface that may be less bad. Introduce a generic try/catch to the C interface, like so:

bool globalDisableAborts = false;
const char *globalErrorString = nullptr;

extern "C" { 
    void SWSSTry_(void) { free(globalErrorString); globalErrorString = nullptr; globalDisableAborts = true; }
    const char *SWSSCatch(void) { globalDisableAborts = false; return globalErrorString; } 
}

#define SWSSTry(failure_value, ...) \
    try { \
        __VA_ARGS__ \
        ; \
    } catch (std::exception &e) { \
        if (globalDisableAborts) {
            free(globalErrorString); \
            globalErrorString = strdup(e.what().c_str()); \
            return failure_value; \
        } else { \
            SWSS_LOG_ERROR(... e.what() ...); \
            abort(); \
        } \
    }

```rs
fn make_a_dbconnector_retrying() {
    // We will only try 10 times before performing an aborting call
    for tries in 0..10 {
        // begin a pseudo try block
        unsafe { SWSSTry_() }; 
        let dbconnector = unsafe { SWSSDBConnector_new_tcp(...) };
        // end the pseudo try block, get the error if one happened.
        let err = unsafe { SWSSCatch() }; 
        if err.is_null() {
            // happy path: error is null so there was no error - dbconnector is valid
            return dbconnector;
        } else {
            // This might've been a fatal or nonfatal error but we aren't checking with a magic string. 
            // We don't need to check because we are limiting our number of tries.
            let error_str =  unsafe { str::from_utf8_unchecked(SWSSGetError()) };
            println!("error: {error_str}, retrying");
            sleep_10_seconds();
        }
    }

    // We have tried 10 times - it's probably a fatal error, so let's bite the bullet and do an aborting call
    // Since we did not use SWSSTry_, any exception will still abort. Just like current behavior.
    // We will get a nice core dump because abort() was called;
    let dbconnector = unsafe { SWSSDBConnector_new_tcp(...) };
    return dbconnector;
}

This is the best idea IMO because we can choose where to abort and where not to. Using SWSSTry_ and SWSSCatch we can retry something a few times before electing to lay down our cards and get a proper core dump. In simple cases, where any error is fatal, we can simply ignore SWSSTry_ and SWSSCatch, and use aborting behavior on the first try.

Final thoughts

If there are very very few nonfatal errors in code that the C api uses, and we can find all of them and hard code their nonfatality into the C api functions, then we should do option 1.

If the C api is going to grow a lot later, or swss-common might change what errors are nonfatal, or we really want the C api to be very simple, or we want to allow people working on swss-common to ignore the C api, or we want to give generic error handling decisions to downstream foreign code, I think we should do option 2 advanced.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with this it still creates challenge in the Humberger stack, where c++ -> rust -> c++, because with return code approach, every function in rust need to understand the error code and passing it to the upper layer. This limits the error handling in rust to only Anyhow-ish crates with downcast...

This is a problem that me and Oliver struggled for days on finding the solutions, but so far there is no obvious/quick answer to it. So I have discussed this issue with Qi offline. We will need to fix this issue, but since it is only impacting the services that uses rust in the exception handling path, we are going to create a backlog to get this fixed, and get the current PR merged to unblock our immediate project progress.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this issue is fundamentally caused by rust not having try-catch support, and we are trying to recreate the try-catch support for the language, which will take time to get it right.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Work Item created here: MSADO:29998324 and assigned to Oliver at this moment.

Copy link
Contributor Author

@erer1243 erer1243 Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this issue is fundamentally caused by rust not having try-catch support, and we are trying to recreate the try-catch support for the language, which will take time to get it right.

Rust panics use the same mechanism as c++ exceptions - it really does have try/catch, but the issue is that no C ABIs permit unwinding. Technically it would even be incorrect if we called these C api functions from within another C++ program, if we crossed a dynamic-linking boundary (i.e. the "hamburger stack" would look like C++ -> C -> C++). This is because internal ABI details are negotiated per linkage unit (unless magic is applied like rust's C-unwind). If we had a C++ function that called a C function, it's valid to optimize out any unwind-catching code because, by spec, no C function can ever unwind. Rust is also making this assumption - that no C function can ever unwind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I think we should either implement the "option 2 advanced" thing I wrote, or just ignore this for now until catching specific errors becomes necessary in hamgrd or other services.

liuh-80
liuh-80 previously approved these changes Oct 22, 2024
@qiluo-msft
Copy link
Contributor

@erer1243 In future could you reduce the force merge during PR iteration? We are using tool to compare between commits to understand what is newly added, and it improve code review efficiency.

r12f
r12f previously approved these changes Oct 23, 2024
Copy link
Contributor

@r12f r12f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the backlog created and sync'ed with Qi, approving this PR, so we can get it merged.

@erer1243 erer1243 dismissed stale reviews from r12f and liuh-80 via 32a8131 October 23, 2024 21:48
@erer1243
Copy link
Contributor Author

Sorry, forgot to include the c api stuff in the deb packages :)

r12f pushed a commit to sonic-net/sonic-dash-ha that referenced this pull request Oct 29, 2024
Implement rust wrapper of the new C api on libswsscommon

Related: sonic-net/sonic-swss-common#915

---------

Co-authored-by: erer1243 <erer1243@users.noreply.github.com>
@r12f
Copy link
Contributor

r12f commented Oct 30, 2024

sync'ed with Qi. all comments are resolved now, hence merging it.

@r12f r12f merged commit 45d7cb0 into sonic-net:master Oct 30, 2024
17 checks passed
@erer1243 erer1243 deleted the c-interface branch October 30, 2024 01:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants