Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor abort on OOM changes #1506

Merged
merged 5 commits into from
Mar 2, 2023
Merged

Conversation

travisdowns
Copy link
Contributor

At Redpanda we are moving to enable abort-on-OOM by default, and as part of this it is useful to be able to disable as well as enable this setting (current behavior is that it starts out disabled and the reactor does a one-time enable based on the command line flag --abort-on-seastar-bad-alloc.

This series implements the ability to disable it and also to get its status.

A small test is included.

* occurs if abort is globablly enabled on this shard and there
* are no disable_abort_... objets currently alive on this shard.
*/
void enable_abort_on_allocation_failure(bool enabled = true);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a boolean parameter with default so the default behavior is unchanged.

seastar_logger.warn("Seastar compiled with default allocator, will not abort on bad_alloc");
void enable_abort_on_allocation_failure(bool enabled) {
if (enabled) {
seastar_logger.warn("Seastar compiled with default allocator, will not abort on bad_alloc");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only warn if we are trying to turn enable abort on OOM since the other way is "OK", I suppose.

}

bool is_abort_on_allocation_failure() {
return false;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inside ifdef DEFAULT_ALLOCATOR block.

@travisdowns travisdowns marked this pull request as ready for review February 16, 2023 18:01
* occurs if abort is globablly enabled on this shard and there
* are no disable_abort_... objets currently alive on this shard.
*/
void enable_abort_on_allocation_failure(bool enabled = true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the call allows setting the parameter both ways, it's better to name it set_abort_on_allocation_failure()

Copy link
Contributor Author

@travisdowns travisdowns Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xemul - I agree and considered this, but this would be a breaking change to the API right?

I also considered just adding another method disable_abort_on_allocation_failure, though I find this API a bit worse, it doesn't have the naming problem.

I'll go whatever way you think is best: maybe adding the new set_... method and leaving this one as is (just forwarding to the set_ one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xemul - I agree and considered this, but this would be a breaking change to the API right?

Ah, I wasn't clear enough, sorry :) Let me try to clarify my point.

This comment and next one should have been together -- the old API call should not get removed, but it should be deprecated. Then we wait for ~1 year, then remove it completely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification @xemul, it makes sense. I'll wait for the resolution of the discussion on whether we need this at all with @raphaelsc before proceeding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xemul - I've updated this PR to add a new set_ method as you suggest and deprecate the existing one-way API.

@raphaelsc
Copy link
Member

why not piggyback on existing class disable_abort_on_alloc_failure_temporarily?

seastar allows user to enable and disable abort_on_alloc_failure at runtime. I don't see an actual need to add a third choice.

you can do the following at the application (not even compile tested):

thread_local std::optional<seastar::memory::disable_abort_on_alloc_failure_temporarily> abort_on_alloc_failure_supressed;

void enable_abort_on_alloc_failure() {
    seastar::memory::enable_abort_on_allocation_failure(); // overrides choice at command line arg.
    abort_on_alloc_failure_supressed.reset();
}

void disable_abort_on_alloc_failure() {
    abort_on_alloc_failure_supressed = seastar::memory::disable_abort_on_alloc_failure_temporarily();
}

bool is_abort_on_alloc_failure_supressed() {
    return abort_on_alloc_failure_supressed.has_value();
}

you can even wrap it in a seastar service, do whatever you want.

interfaces should only be added / changed when really needed (meaning seastar application cannot achieve a target without it), otherwise we end up bloating the seastar interface.

@xemul
Copy link
Contributor

xemul commented Feb 17, 2023

thread_local std::optional<seastar::memory::disable_abort_on_alloc_failure_temporarily> abort_on_alloc_failure_supressed;

void disable_abort_on_alloc_failure() {
    abort_on_alloc_failure_supressed = seastar::memory::disable_abort_on_alloc_failure_temporarily();
}

Global variable storing the temporary disabling RAII guard is awesome :D

@travisdowns
Copy link
Contributor Author

travisdowns commented Feb 17, 2023

why not piggyback on existing class disable_abort_on_alloc_failure_temporarily?

Because this is a different behavior: it controls the shard-local temporary disablement behavior.

I want to control the global enablement, which even if I were willing to try to emulate this by changing the thread-local value on every shard still isn't the same (consider e.g., alien threads).

interfaces should only be added / changed when really needed (meaning seastar application cannot achieve a target without it), otherwise we end up bloating the seastar interface.

I really need this, and prefer that this change (or an equivalent one) is upstream as well. Avoiding adding another method is why I went with the boolean argument and default value (though I agree with Pavel that it makes the name "wrong").

@raphaelsc
Copy link
Member

thread_local std::optional<seastar::memory::disable_abort_on_alloc_failure_temporarily> abort_on_alloc_failure_supressed;

void disable_abort_on_alloc_failure() {
    abort_on_alloc_failure_supressed = seastar::memory::disable_abort_on_alloc_failure_temporarily();
}

Global variable storing the temporary disabling RAII guard is awesome :D

That is indeed not the best example to come up with :-)

* default (system) allocator is being used.
*/
bool is_abort_on_allocation_failure();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please add a few words on the commit message about the motivation for this one?

I am trying to understand why an application needs it. If it needs to enable, it will enable regardless of the previous value.

Copy link
Contributor Author

@travisdowns travisdowns Feb 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback @raphaelsc. I've updated the commit message to the following:

Allow application to query the abort on OOM state

Add an is_abort_on_allocation_failure method which returns
the global state of this reactor option.

This is useful for applications that which to query this state in order
to expose it for debugging or administration purposes. Applications
could try to track this themselves, but this is unreliable since the
reactor itself can change this value based on the command line flags.

Let me know if you think it is sufficient.

@raphaelsc
Copy link
Member

why not piggyback on existing class disable_abort_on_alloc_failure_temporarily?

Because this is a different behavior: it controls the shard-local temporary disablement behavior.

I want to control the global enablement, which even if I were willing to try to emulate this by changing the thread-local value on every shard still isn't the same (consider e.g., alien threads).

interfaces should only be added / changed when really needed (meaning seastar application cannot achieve a target without it), otherwise we end up bloating the seastar interface.

I really need this, and prefer that this change (or an equivalent one) is upstream as well. Avoiding adding another method is why I went with the boolean argument and default value (though I agree with Pavel that it makes the name "wrong").

Thanks for the explanation. I was just making sure that we're not doing changes needlessly. Please address Pavel's suggestions.

@travisdowns travisdowns force-pushed the td-scylla-abort-on-oom branch from cf064fd to 0289f47 Compare February 20, 2023 14:07
* is enabled, or false otherwise. Always returns false if the
* default (system) allocator is being used.
*/
bool is_abort_on_allocation_failure();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this fwd declaration belong to next patch instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, moved in next push.

@@ -139,8 +139,35 @@ static constexpr size_t huge_page_size =
void configure(std::vector<resource::memory> m, bool mbind,
std::optional<std::string> hugetlbfs_path = {});

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't help asking for making this awesome comment into a doxygen-generated documentation (and the one in front of is_abort_on_allocation_failure below too).

To check it you can use ninja -C build/dev docs and then check the build/dev/doc/html/namespaceseastar_1_1memory.html file. Likely it won't render at the first try, because this line sits inside \cond internal block, so its worth moving it outside.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely it won't render at the first try, because this line sits inside \cond internal block

Right, doing this.

Question: is whether an API internal or not defined by the this \cond internal annotation or whether it appears in an *::internal namespace? I would assume the latter, though if the former I guess this would change this API from internal to not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The *::internal namespace doesn't affect documentation generation AFAIK. Also, the *_abort_on_allocation_failure() is already out of any C++'s internal namespace, it only comes under the doxygen's radar, so my request is just to pull it out of there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xemul - done and checked with ninja docs.

My question was just whether having a method inside the \cond internal block had any bearing on whether it was considered an "internal" API for the purposes of stability guarantees, etc. I guess the answer is "no, only the C++ namespace matters".

@travisdowns travisdowns force-pushed the td-scylla-abort-on-oom branch from 0289f47 to d17780b Compare February 23, 2023 13:08
Copy link
Member

@raphaelsc raphaelsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good.

There is a global "abort on OOM" setting in the seastar, and
we allow enabling it via a public method in seastar::memory.

There was no corresponding method to disable it, but such
a method is useful in order to give applications full control
over this setting.

Introduce a new method which takes a boolean parameter in order to allow
both enabling and disabling the abort behavior. The old call is
deprecated and simply falls the new method with enabled=true.
Add an is_abort_on_allocation_failure method which returns
the global state of this reactor option.

This is useful for applications that which to query this state in order
to expose it for debugging or administration purposes. Applications
could try to track this themselves, but this is unreliable since the
reactor itself can change this value based on the command line flags.
A basic test which flips the abort on OOM switch up and
down and which checks that the is_enabled function
returns the correct value.
Replace enable_abort_on_allocation_failure with
set_abort_on_allocation_failure(true) as the former is deprecated.
To include details of the set_abort_on_allocation_failure API.
@travisdowns travisdowns force-pushed the td-scylla-abort-on-oom branch from d17780b to 3077ab3 Compare February 23, 2023 13:16
@travisdowns
Copy link
Contributor Author

travisdowns commented Feb 23, 2023

Last pushes enable doxygen generation for the new methods and also switch from /** to /// style comment blocks for consistency.

@travisdowns travisdowns requested a review from xemul February 23, 2023 14:58
@travisdowns
Copy link
Contributor Author

@raphaelsc - thanks for the review, much appreciated!

@xemul - thanks for the review so far: let me know if you were looking for anything else here.

@xemul xemul closed this in eef1d6e Mar 2, 2023
@xemul xemul merged commit eef1d6e into scylladb:master Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants