Skip to content

Conversation

@ywkaras
Copy link
Contributor

@ywkaras ywkaras commented Apr 12, 2022

This seems to have eliminated some ERROR diags we were seeing in Yahoo Prod when
doing config reloads.

The SSLSecret public functions no longer return pointers into the unorded_map of
secrets, they return a copy of the secret data. This seemed thread unsafe. A
periodic poll running in the background can update the secret data for an entry
for a secret name in the map.

To avoid exporting pointers, I had to change the prototype of TSSslSecretGet().
Hopefully there are no existing plugins that are already using this TS API function,
so breaking this rule will be moot. I added a new TS API handle, TSHeapBuf, in order
to be able to cleanly return a copy of the secret data.

@ywkaras ywkaras self-assigned this Apr 12, 2022
@ywkaras ywkaras added this to the 10.0.0 milestone Apr 12, 2022
@ywkaras ywkaras force-pushed the os_pkey_cnf_reload branch from bf22920 to 4c40fd3 Compare April 12, 2022 16:41
@ywkaras
Copy link
Contributor Author

ywkaras commented Apr 12, 2022

I'll undraft this after letting it soak in Yahoo Prod for a while.

@ywkaras
Copy link
Contributor Author

ywkaras commented May 25, 2022

Sorry, I accidentally made the PR branch in the Apache repo and not my fork. I'll delete the branch once this PR is closed.

@ywkaras ywkaras force-pushed the os_pkey_cnf_reload branch from 559d15b to d7c7d66 Compare May 25, 2022 23:55
@ywkaras
Copy link
Contributor Author

ywkaras commented Jun 7, 2022

I disagree this is incompatible, since, as Masakazu found, it's only incompatible with commits which have not been released yet.

@ywkaras ywkaras force-pushed the os_pkey_cnf_reload branch from 32947d0 to c643ffd Compare June 13, 2022 19:31
@ywkaras
Copy link
Contributor Author

ywkaras commented Jul 27, 2022

This has been running in Yahoo Prod for more than 2 weeks, with no issues/problems. These changes fixed errors that were happening when doing a config reload.

@ywkaras ywkaras marked this pull request as ready for review July 27, 2022 19:02
@bryancall bryancall requested a review from shinrich August 1, 2022 23:42
@ywkaras ywkaras added the Bug label Aug 22, 2022
@maskit maskit self-requested a review August 30, 2022 15:14
Copy link
Contributor

@cmcfarlen cmcfarlen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain more about the blocking_invoke being in this PR? Is that necessary and do we want this so-called hack?

if (stat(name.c_str(), &statdata) < 0) {
return false;
return std::string{};
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we could skip calling stat here since we don't do anything with the result and the load call will fail below if stat does.

const std::string *
SSLSecret::getSecretItem(const std::string &name) const
std::string
SSLSecret::getSecret(const std::string &name) const
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With these changes the caller can no longer differentiate between not found and empty secrets. Perhaps this could return std::optional to restore that info.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see that the calling code treats, or should treat, an empty file versus a non-existent file differently. Are you saying that it should? Why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm saying a caller cannot tell anymore if its empty vs non-existent, as well as giving a suggestion as to how that information could be restored.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're saying you think this function needs to be future safe in case someday something that calls it needs to know that the secret isn't available because the file is empty rather than not present? We could spend a lot of time and bloat consistently having that level of future safety.

// the calling thread. It should be replaced with a version with a continuation parameter, providing a continuation
// to be called when the hook handling is complete.
//
int blocking_invoke(int event, void *edata) const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this hack related to cleanup of SSL secrets? Is this needed for this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The APIHook::invoke() (what was originally used instead of this) will, if the mutex of the continuation is already locked, do an ink_release_assert(). So that effectively means you can't have a continuation mutex. I needed a mutex on a continuation in an internal Yahoo plugin. APIHook::blocking_invoke() will just block on the mutex. If there is time someday, all hooks should be made to work like the global and transaction hooks. Try the mutex, and if you can't lock it, immediate reschedule the continuation. That's the sense in which this is a hack.

// Support code for TSHeapBuf.
namespace
{
class HeapBuf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HeapBuf seems like its just std::unique_ptr<std::string>, would you consider using that to back TSHeapBuf instead of this custom class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like it would require three memory allocations instead of one. This is meant to be a general purpose type, so pessimizing it could have a significant performance impact, depending how it's used in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should only have to heap allocate the string as you do with _S here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TSHeapBuf would have to be of type std::unique_ptr<std::string> *. The std::uniue_ptr<std::string> instance would be dynamically allocated (allocation 1). It would point to a std::string instance in the heap (allocation 2). The std::string instance would allocate a buffer for the data (allocation 3).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that effectively, HeapBuf could be written as using HeapBuf = std::unique_ptr<std::string> and TSHeapBuf would be (opaquely) *std::string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I see what you mean now. But that will still mean doubling the number of memory allocations. Also mice that HeapBuf encapsulates the reinterpret_casts.

@ywkaras ywkaras force-pushed the os_pkey_cnf_reload branch from c643ffd to 452478a Compare August 30, 2022 17:52
Copy link
Member

@maskit maskit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to have eliminated some ERROR diags we were seeing in Yahoo Prod when
doing config reloads.

This seemed thread unsafe. A periodic poll running in the background can update the secret data for an entry
for a secret name in the map.

I added a new TS API handle, TSHeapBuf, in order to be able to cleanly return a copy of the secret data.

I would not call this general cleanup. The commit log would look like a trivial change but it is not. Who thinks a "cleanup" fixes a runtime issue and introduces new TS APIs?

And I don't see the necessity of introducing TSHeapBuf on this PR. I might be missing something but it seems like we just need to return a copy of secret data to resolve the issue. That is the key, right?

I'm not going to judge whether we want TSHeapBuf in general here, because that should be discussed on dev@ as API review, but I think we should not introduce TSHeapBuf on this PR. If we failed to fix the issue before 9.2.0 release, we would have to introduce the problematic API and need an incompatible change on 10.0.0. I'd suggest fixing the issue without introducing TSHeapBuf for now (if possible) to avoid the unfortunate releases. Since feature freeze for 9.2.0 has not been called yet, you can still propose the interface that uses TSHeapBuf, after we fix the issue. Although that could be postponed to 10.0.0, I think that's much better than introducing the problematic API as it is.

At minimum, the title, which will be a commit message, need to be updated. And since there are new TS APIs proposed, this PR should not be merged until API review completes even if we add them on this PR.

@ywkaras
Copy link
Contributor Author

ywkaras commented Aug 30, 2022

"I would not call this general cleanup."

What title do you prefer then?

@maskit
Copy link
Member

maskit commented Aug 30, 2022

What title do you prefer then?

I don't know what happened exactly, so I can only say "Fix an error on SSL config reload".

@ywkaras ywkaras changed the title General cleanup of the SSLSecret code. Fix an error on SSL config reload (plus some cleanup). Aug 30, 2022
@zwoop
Copy link
Contributor

zwoop commented Sep 1, 2022

As mentioned already, at a minimum this should be split into two PRs, and a discussion should be opened on the mailing list about adding the new set of APIs.

@ywkaras ywkaras marked this pull request as draft September 6, 2022 16:08
@ywkaras ywkaras marked this pull request as ready for review September 20, 2022 00:45
@ywkaras ywkaras closed this Sep 20, 2022
@ywkaras ywkaras reopened this Sep 20, 2022
@ywkaras
Copy link
Contributor Author

ywkaras commented Sep 30, 2022

We're seeing a deadlock with our plugin that uses SSL_SECRET_HOOK, so making this a draft for now.

@ywkaras ywkaras marked this pull request as draft September 30, 2022 21:02
@ywkaras
Copy link
Contributor Author

ywkaras commented Oct 3, 2022

Brian Neradt has a fix for the deadlock problem. We're going to let it ripen internally for a bit, then I'll pull it into this PR.

ywkaras and others added 2 commits October 31, 2022 22:55
This seems to have eliminated some ERROR diags we were seeing in Yahoo Prod when
doing config reloads.

The SSLSecret public functions no longer return pointers into the unorded_map of
secrets, they return a copy of the secret data. This seemed thread unsafe. A
periodic poll running in the background can update the secret data for an entry
for a secret name in the map.

To avoid exporting pointers, I had to change the prototype of TSSslSecretGet().
Hopefully there are no existing plugins that are already using this TS API function,
so breaking this rule will be moot. I added a new TS API type, TSAllocatedVarLenData,
in order to be able to cleanly return a copy of the secret data.
1. getOrLoadSecret grabbed the secret_map_mutex and called loadSecret.
2. loadSecret dispatched to Continations that registered for the
   TS_EVENT_SSL_SECRET event. This would try to grab the Continuation's
   lock.
3. In the meantime, those Continuations could call setSecret which would
   try to grab the secret_map_mutex. If this Continuation did this while
   holding the lock that step 2 is waiting upon, then there will be a
   deadlock between the Continuation lock and the secret_map_mutex
   between the two threads.

This patch avoids the deadlock by releasing the secret_map_mutex lock
before calling loadSecret. It also updates the secret_map when loading
secrets from a file in loadSecret.
@ywkaras ywkaras force-pushed the os_pkey_cnf_reload branch from fcd9137 to 5d25fa2 Compare October 31, 2022 23:00
@ywkaras ywkaras marked this pull request as ready for review November 1, 2022 16:48
@maskit
Copy link
Member

maskit commented Jan 3, 2023

I still think we should fix the API without introducing additional data structure to avoid the unfortunate release that has an unusable API.

Even if we are going to have TSAllocatedVarLenData, using it just for TSSslSecretGet causes inconsistency. TSUrlStringGet could use it too. We can continue the discussion about the data structure on the list and have it later. Let's focus on the issue on this PR.

@zwoop zwoop closed this Jan 20, 2023
@zwoop zwoop deleted the os_pkey_cnf_reload branch January 20, 2023 18:05
@ywkaras
Copy link
Contributor Author

ywkaras commented Jan 20, 2023

We have this in Yahoo internal TS. If anyone else sees errors like:

[Jan 20 00:01:47.669] [ET_TASK 1] ERROR: SSL::139889882617600:error:0909006C:PEM routines:get_name:no start line:crypto/pem/pem_lib.c:745:Expecting: TRUSTED CERTIFICATE
[Jan 20 00:01:47.669] [ET_TASK 1] ERROR: SSL::139889882617600:error:0909006C:PEM routines:get_name:no start line:crypto/pem/pem_lib.c:745:Expecting: ANY PRIVATE KEY
[Jan 20 00:01:47.669] [ET_TASK 1] ERROR: failed to load server private key from /home/y/conf/trafficserver/ssl/ycpi.tls.yahooinc.com.ec.key

we should reopen this PR.

@maskit
Copy link
Member

maskit commented Jan 20, 2023

Weren't you going to update this PR with the simplest interface? I think this PR was closed because the branch was on the upstream repo.

@ywkaras
Copy link
Contributor Author

ywkaras commented Jan 20, 2023

I was planning to, but it was delayed by production issues. Leif didn't comment on why he was closing it.

@zwoop zwoop removed this from the 10.0.0 milestone Feb 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants