-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
materialize: extended logging #2019
Conversation
57b7a26
to
4a05ddc
Compare
4a05ddc
to
2c02325
Compare
@@ -10,12 +10,12 @@ import ( | |||
|
|||
const ( | |||
// How often various the various async loggers will report their status. | |||
loggingFrequency = 5 * time.Second | |||
loggingFrequency = 5 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean we will not see any logs more frequently than 5 minutes? i.e. there will always be 5 minutes delay between two logs from this stream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update the comment to state this as well:
When "extended" logging is enabled, transitions of interest will always be logged immediately, and only after
loggingFrequency
will status updates be logged if nothing else has happened. When extended logging is not enabled, there will be a more generic status update logged out at this frequency.
So with extended logged for an OLAP materialization, we will for example see a "started reading store requests" logged when it starts doing that, and then every 5 minutes until it finishes we'll get a "reading store requests in progress" log. For materializations that don't enable extended logging it would be pretty spamming to log out such detail information for every transaction, so they just use a logger than logs a general progress message at the loggingFrequency
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@williamhbaker is there a requirement that makes us want to supress the instant logs from most connectors? I personally find them very useful in debugging connectors. I understand I can change the log level to debug to achieve this, but for already-running tasks that have had an issue in the past, it is useful to be able to go back in time and see what happened in what sequence 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking at the original issue that this pull-request aims to address, and the description by Phil is:
We've had a number of debugging sessions that suffered from lack of information in the logs. Ideally, we'd have some common logging in materialize-sql to help with this sort of thing.
So I think we want more information in logs, but I think the simple logger will give us less information
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a point of clarification, with this change we will indeed be logging more for all materializations, just not quite as much more for some vs. others. The simple logger will give us less information than the extended logger, but the simple logger does give us more information than we currently get from materializations that will use it.
With this PR, OLAP materializations that typically run slow transactions will log a lot more, bringing in nearly all of the debug level logs that were previously in RunTransactions
and logging those at info level, as well as status updates every loggingFrequency
when it they are in the middle of doing some long running operation like reading a lot of Store requests or running a long transaction.
Other materializations that may run multiple transactions per second will also log more than they did before, but not as much more by default as OLAP materializations since that would be pretty spammy in most cases. Materializations that do not set ExtendedLogging
and are logging at info level will now log the additional status updates every 5 minutes per loggingFrequency
, which will describe what they are doing to some extent. Admittedly this is not as sophisticated as I could imagine it being, but it will indicate (among other things) when they have been waiting for documents from the runtime for a long time, or are waiting for a long-running possibly stuck commit for a long time, which are a couple of things we've previously seen.
The specific scenario that the original issue was opened for was a case of a materialize-postgres
task that we suspect got hung up for a long time when opening and installing its checkpoint (see thread). Currently for pretty much all materializations in this case the last thing they will log is a line like opening database
and then nothing else - and for most non OLAP materializations, this is indeed the last thing they will ever log no matter what, including materialize-postgres
. This PR includes some logging additions directly targeted at this case in this commit, in addition to the other logging it is adding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I think it is easier if we see this in action and then if we find it doesn't quite satisfy troubleshooting needs we will add more to it. It is hard for me to imagine what the logs look like now. But this is a significant work that will pave the way for better logging, thanks!
a4a5505
to
741d40b
Compare
// do a delay, since we want at least the first storedHistorySize | ||
// transactions to complete without a delay when the connector starts | ||
// up. | ||
belowThreshold := len(l.storedHistory) == storedHistorySize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming of this variable still doesn't make sense to me. It is called belowThreshold
, but it seems to actually signify above
or at
threshold: we are first checking we have reached the threshold number of storedHistories (==
, meaning "at" to me), and then we are checking whether each of these stored histories is at or above the storeThreshold (>=
), and in the end we use it in the if condition as an aboveOrAtThreshold
by negating it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks...honestly I struggled with this bit of code to try to express what it is really doing.
I think I've got something pretty clear now that I just pushed, it looks like this:
// If there have been at least `storedHistorySize` transactions
// completed and none of the transactions were large enough to suggest
// we are still in the midst of a backfill, the acknowledgement delay
// may be applicable.
delayBasedOnStoredHistory := len(l.storedHistory) >= storedHistorySize && !slices.ContainsFunc(l.storedHistory, func(n int) bool {
return n >= storeBackfillThreshold
})
if !delayBasedOnStoredHistory {
ll.Info("not delaying commit acknowledgement based on stored history")
} else if d <= 0 { ...
@@ -10,12 +10,12 @@ import ( | |||
|
|||
const ( | |||
// How often various the various async loggers will report their status. | |||
loggingFrequency = 5 * time.Second | |||
loggingFrequency = 5 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@williamhbaker is there a requirement that makes us want to supress the instant logs from most connectors? I personally find them very useful in debugging connectors. I understand I can change the log level to debug to achieve this, but for already-running tasks that have had an issue in the past, it is useful to be able to go back in time and see what happened in what sequence 🤔
741d40b
to
c43ddc2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Will!
Narrows the materialize "stream" interface to be focused on sending and receiving protocol messages. This is mostly removing baggage from when materializations only used gRPC and used the gRPC server interface more directly. Now they also support a JSON protocol so the prior implementation doesn't make much sense anymore. The lifecycle tests were also almost entirely redundant so rather than refactoring them I have deleted them.
Small refactoring to fix a linter complaint about a context leak of the loadCtx.
This logging will be implemented elsewhere in the coming commits, so the transactor is being cleaned up of most of its logging duties.
Creates a basic logger that will log out materialization status on a set frequency. Generally it will just log the current round and various counters, but if a commit has been running for a long time or the connector has been waiting for documents for a long time it will log information about that.
Adds an "extended" logger that will log materialization transaction progress in more detail, including the start & stop of most events of interest and how long they take to complete. This logging can be enabled via the new `MaterializeOptions` struct returned by `NewTransactor`. It will also be enabled if the log level is set to debug.
…ream Lifts the acknowledgement scheduling mechanism out of `RunTransactions` and into the new transactor message stream. This allows the new transactor message stream layer to provide useful logging information about what the materialization is doing in terms of an artificially applied delay. This also finishes up extracting meta operations out of the `RunTransactions` routine. It will no longer be responsible for computing acknowledgement delays or logging anything - that is all handled by the transactor message stream.
The `MaterializeOptions` configuration struct was added to the boilerplate in a previous commit to allow for materializations to tune specific high-level behaviors of transaction processing. This adapts all the materializations to use it.
The preceding commits added high-level logging capabilities that are applicable to all materializations, enabled via on configuration. But there are still some details that only the materialization implementation itself has access to. For these events, a BindingEvents construct is provided. It can provide start/stop and progress timings for evaluating load requests (load queries), and start/stop timings for committing stores to specific bindings. Materializations don't have to use this, but they can opt-in if it makes sense for their implementation. I have updated all of the materializations as it makes sense, also removing the now redundant logs that the boilerplate takes care of.
Getting the Open response for a materialization can take a long time, especially if it has to do a lot of work to setup the transactor or if querying a checkpoint from the store is not fast. This adds some progress logging for that case so it's more clear what the materialization is doing. Also a final round of refactoring: Split out the stream/logging logic into separate files, and added a general `repeatAsync` helper that several of these new logs make use of.
The 5 second and 15 second frequencies were just for testing, and the real values should be in minutes.
None of these needed to be exported anymore, so I unexported them. Also renamed some of them to be more clear, for example `ReadFlush` didn't actually read anything, but rather just validated that the last read message by the Load iterator was a "flush".
c43ddc2
to
ede4d19
Compare
Description:
Adds more logging and modifies logging for materializations that use the Go materialization boilerplate.
The individual commit messages describe the changes in more detail, but here are the highlights:
RunTransactions
is getting a new wrapper to augment its functionality, and it will dispatch notifications for various events of interest that happen during the transactions lifecycle. Implemented here are loggers that will log messages in response to these events, and also a relocated acknowledgement delay mechanism that has been extracted fromRunTransactions
. Further behaviors could be added here, like the dbt job triggering, if they prove useful in the future for general features.DelayedCommitter
interface, which has been removed.RunTransactions
has been simplified as much as possible, removing non-essential logging and auxilliary behaviors (like the ack delay) into the new stream wrapper.One thing that is not being done here that I think should be done is reconciliation of the packages for the
go/protocols/materialize
andmaterialize-boilerplate
. They should both probably be under a commongo/materialize
package & there should be a similargo/capture
package, but I'm saving that refactoring for later.Closes #1872
Workflow steps:
(How does one use this feature, and how has it changed)
Documentation links affected:
(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)
Notes for reviewers:
(anything that might help someone review this PR)
This change is