Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regression: unable to handle detect errors related to messages being too large #2655

Closed
ae-govau opened this issue Sep 20, 2023 · 10 comments
Closed
Labels
bug needs-investigation Issues that require followup from maintainers stale Issues and pull requests without any recent activity

Comments

@ae-govau
Copy link

(creating issue from comments I made on #2628 a number weeks ago)

The recent change to error handling in #2628 broke some of our running code. We had code that would:

if err == sarama.ErrMessageSizeTooLarge {
    // log a message, bump a metric, but don't retry
}

ie for our use-case, we want to ignore messages related to a message being too large. It's OK for us to throw these out. We used to check a size before sending into Sarama, but then we had some cases where our payload was under the limit, but once headers etc were added, it was over, so we switched to checking this error code.

Noting that #543 reports that previous error code has been returned for at least 8 years, this seems a reasonably significant change... (and a good example of Hyrum's Law )

It's now not clear to me how I should check for size related issues.

Interestingly the last PR which touched that line of code was 4 years ago in 2019 where #1262 reverts a change a few days earlier (#1218) that renamed a bunch of error variables because:

renaming ErrXxx constants is an API changing change #1261

(that change is arguably less impactful than this one because it could be detected/corrected at compile time whereas we only caught this at runtime)

@ae-govau
Copy link
Author

We're open to changing our code, but as a result of this change we've had to rewrite it as

if err == sarama.ErrMessageSizeTooLarge || strings.Contains(err.Error(), "MaxMessageBytes") {

and I feel like there should be a better option?

@dnwe
Copy link
Collaborator

dnwe commented Sep 20, 2023

@ae-govau would this be sufficient?

if errors.Is(err, sarama.ConfigurationError) ||
       errors.Is(err, sarama.ErrMessageSizeTooLarge) {
        // log non-retriable error and don’t retry send
}

@hindessm
Copy link
Collaborator

Apologies that my assumption that code would already be handling the configuration error was incorrect. Hopefully @dnwe's suggestion above helps.

@dnwe Perhaps the API should export a generic function to check for retriable/non-retriable errors for cases like this?

@ae-govau
Copy link
Author

Thanks for the responses. We aren't that keen on checking for a generic configuration error, as we would like to be alerted to any other config errors - we just want to know when the problem is that the message is too big - since we deal with that failure mode quite differently than others.

Noting that the calculated size of the message (as I understand it) is bigger than the payload itself, it's not clear how we can calculate that before trying to send it and thus we need to be able to figure it out from the error itself.

An API for "is this a message too big error" would also be fine.

@shou1dwe
Copy link

shou1dwe commented Jan 2, 2024

Ran into the exact problem, and I am really reluctant to put in a string-based error type check like errors.Is(err, sarama.ConfigurationError) && strings.HasPrefix(string(err), "Attempt to produce message larger than configured Producer.MaxMessageBytes") for example.

I am afraid someone will make a "error message readability improvement" or a casing change that is seemingly harmless and cause a problem for us.

I appreciate the need to distinguish a client side error from broker error. But reusing ConfigurationError here doesn't sound right either. Arguably this isn't a ConfigurationError, because we configure the threshold correctly, and message produced can be sourced from other places so the size of it is a runtime behavior that is not entirely in the control of the system hosting producer. It is some flavor of "Bad Request" or "Invalid Argument" imho.

What makes it worse is that it is not a behavior in the mock producer, so I can't just add a unit test to detect the change in behavior here but have to use a real kafka client setup to detect it.

In any case, it would be nice to be some error value that we can reference in the code, or similar to what @ae-govau suggested, an API for "is this a message too big error".

Copy link

github-actions bot commented Apr 1, 2024

Thank you for taking the time to raise this issue. However, it has not had any activity on it in the past 90 days and will be closed in 30 days if no updates occur.
Please check if the main branch has already resolved the issue since it was raised. If you believe the issue is still valid and you would like input from the maintainers then please comment to ask for it to be reviewed.

@github-actions github-actions bot added the stale Issues and pull requests without any recent activity label Apr 1, 2024
@ae-govau
Copy link
Author

ae-govau commented Apr 1, 2024

This is still an issue for us. I might make a PR to try to address.

ae-govau added a commit to ae-govau/sarama that referenced this issue Apr 1, 2024
This provides a method for producers to workaround the regression
introduced by IBM#2628. Fixes IBM#2655.

Signed-off-by: Adam Eijdenberg <adam.eijdenberg@defence.gov.au>
@ae-govau
Copy link
Author

ae-govau commented Apr 2, 2024

PR #2848 is opened so that clients can call:

// IsMessageSizeTooLarge returns true if the error relates the message size
// being either too large as reported by the broker, or too large because it
// exceeds the configured maximum size.
func IsMessageSizeTooLarge(err error) bool {
...

@github-actions github-actions bot removed the stale Issues and pull requests without any recent activity label Apr 2, 2024
ae-govau added a commit to ae-govau/sarama that referenced this issue Apr 3, 2024
For most of this library's existence it has returned ErrMessageSizeTooLarge
when the message exceeded the configured size.

For a short period in 2019 this error was renamed (IBM#1218) but shortly
revered back (IBM#1262). Later in 2023 this error was changed to a
ConfigurationError (IBM#2628) to fix IBM#2137, however this has caused issues
with clients who rely on the previous error code being distinct from
other ConfigurationError conditions (IBM#2655).

This commit reverts to previous behaviour, and adds a test to pickup if
this changes again in the future.

Signed-off-by: Adam Eijdenberg <adam.eijdenberg@defence.gov.au>
@Manicben
Copy link

Manicben commented Apr 4, 2024

Just adding how this has impacted us.

In our applications, the ErrMessageSizeTooLarge error was used to determine retriability. If it was returned by the brokers, we had logic to split up the batch provided to SendMessages to allow messages to be accepted by the brokers. If it was returned by the producer itself, or by the brokers after splitting down to a single message, then it was deemed non-retriable and the message would be handled as such (usually via DLQ).

In this case, checking for any ConfigurationError is not acceptable, and we'd have to inspect the error string to ensure it is due to exceeding Producer.MaxMessageBytes and nothing else. This means the suggestion from @dnwe doesn't work for us and as @ae-govau said, we cannot determine the message/batch size ourselves before calling SendMessage(s).

For now we will have to settle with checking the error string, but we would also be keen to see either a dedicated error type or some API like @ae-govau has suggested.

Copy link

github-actions bot commented Jul 3, 2024

Thank you for taking the time to raise this issue. However, it has not had any activity on it in the past 90 days and will be closed in 30 days if no updates occur.
Please check if the main branch has already resolved the issue since it was raised. If you believe the issue is still valid and you would like input from the maintainers then please comment to ask for it to be reviewed.

@github-actions github-actions bot added the stale Issues and pull requests without any recent activity label Jul 3, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug needs-investigation Issues that require followup from maintainers stale Issues and pull requests without any recent activity
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants