-
Notifications
You must be signed in to change notification settings - Fork 109
Heroic Bigtable Consumer does not handle failures as expected #724
Comments
Just to avoid any confusion, this problem is not related to distribution implementation :).
|
Good call @ao2017 It turns out the column family creation works but we just have to pass the proper config variable and set I will repurpose this ticket to address the exception handling. |
@samfadrigalan - some evidence lending import to this issue: |
I have concerns about moving to late acks. Right now unprocessable messages get dropped - if we switch to late-ack, those go back to PubSub and will be continuously retried. A relatively small amount of unprocessable messages would plausibly lockup the system. If we know that the messages are bad we could ack them despite not successfully processing them, but if we do that we now have a system where only known failures are safely handled and unknown failures cause queue buildup. I think in general we let messages blow up when we don't know what to do with them and only retry if we're very certain its safe to do so. |
For that concern, can we use dead-letter queues (https://cloud.google.com/pubsub/docs/dead-letter-topics)? This way we will have visibility on the failures and not have any data loss without clogging up the main queues. Because we have copies of the failed messages, we could easily reproduce the failures in a non-prod pipeline. We could then redirect the messages back to the main consumer processing once we've released a fix.
|
Yeah that sounds like a good plan, the visibility would be great. I wonder how useful redirecting will be in practice though - if we need to make a code change and redeploy, the queue could become way too big to process. |
So it turns out PubSub dead letters only after a minimum of 5 delivery attempts and can't be configured for less than that. If there's an issue like the column family missing that spawned this, the cluster ends up doing 5x as much work before rejecting the message. I'm working on improving the logging and tracing so there is better visibility for write failures, but I'm planning to leave the PubSub acking as-is. |
I'm having trouble reproing the lack of logs. I tried running IT tests without the column family created, and the exception was very visibly logged:
|
Were you able to reproduce that log in staging or locally? I wonder if there is a difference between the setup in the IT tests and running the consumers as is in prod or locally (i.e. Are the grpc calls in the production consumer flow wrapped in async functions while the IT calls aren't?). I noticed that log does not exactly look like the stack trace on the description. I had to set up the consumers locally and evaluate expressions in debug mode to get that stack trace as there were no cloud error logs or in the local set up when the consumer tried to process messages. |
It should be the same flow - the writes are in a |
This adds error flags and exception messages to the spans for metadata, suggest, and bigtable writes when the chain of futures fails. The same exceptions are also logged. Previously some exceptions, such as grpc errors, could be masked by sl4j settings. The trace for a metric write was cleaned up to remove several intermediary spans. These spans did not have useful information and had no branching paths, so they were just clutter in the overall trace. Closes #724.
This adds error flags and exception messages to the spans for metadata, suggest, and bigtable writes when the chain of futures fails. The same exceptions are also logged. Previously some exceptions, such as grpc errors, could be masked by sl4j settings. The trace for a metric write was cleaned up to remove several intermediary spans. These spans did not have useful information and had no branching paths, so they were just clutter in the overall trace. Closes #724.
This adds error flags and exception messages to the spans for metadata, suggest, and bigtable writes when the chain of futures fails. The same exceptions are also logged. Previously some exceptions, such as grpc errors, could be masked by sl4j settings. The trace for a metric write was cleaned up to remove several intermediary spans. These spans did not have useful information and had no branching paths, so they were just clutter in the overall trace. Closes #724.
DoD
Heroic should properly address exception handling.
Background
Heroic bigtable consumer failed to write to bigtable when the new column family was not created yet. There was no exception logged and the consumer ack-ed the message as if the write was successful. I got the exception below by hacky debugger evaluations.
The text was updated successfully, but these errors were encountered: