materialize: extended logging #2019

williamhbaker · 2024-10-04T15:32:07Z

Description:

Adds more logging and modifies logging for materializations that use the Go materialization boilerplate.

The individual commit messages describe the changes in more detail, but here are the highlights:

The message stream used by RunTransactions is getting a new wrapper to augment its functionality, and it will dispatch notifications for various events of interest that happen during the transactions lifecycle. Implemented here are loggers that will log messages in response to these events, and also a relocated acknowledgement delay mechanism that has been extracted from RunTransactions. Further behaviors could be added here, like the dbt job triggering, if they prove useful in the future for general features.
For the above, there's a configuration object that implementations are required to return that will control these broad behaviors. This is IMO a significant improvement over the prior requirement to implement the DelayedCommitter interface, which has been removed.
RunTransactions has been simplified as much as possible, removing non-essential logging and auxilliary behaviors (like the ack delay) into the new stream wrapper.
Lots of small changes to individual materializations to make use of all of the above.

One thing that is not being done here that I think should be done is reconciliation of the packages for the go/protocols/materialize and materialize-boilerplate. They should both probably be under a common go/materialize package & there should be a similar go/capture package, but I'm saving that refactoring for later.

Closes #1872

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

(anything that might help someone review this PR)

This change is

go/protocols/materialize/lifecycle.go

go/protocols/materialize/transactor.go

materialize-boilerplate/aux_stream.go

materialize-boilerplate/transactions_stream.go

mdibaiee · 2024-10-07T15:13:53Z

materialize-boilerplate/logging.go

@@ -10,12 +10,12 @@ import (

 const (
 	// How often various the various async loggers will report their status.
-	loggingFrequency = 5 * time.Second
+	loggingFrequency = 5 * time.Minute


Does this mean we will not see any logs more frequently than 5 minutes? i.e. there will always be 5 minutes delay between two logs from this stream?

I'll update the comment to state this as well:

When "extended" logging is enabled, transitions of interest will always be logged immediately, and only after loggingFrequency will status updates be logged if nothing else has happened. When extended logging is not enabled, there will be a more generic status update logged out at this frequency.

So with extended logged for an OLAP materialization, we will for example see a "started reading store requests" logged when it starts doing that, and then every 5 minutes until it finishes we'll get a "reading store requests in progress" log. For materializations that don't enable extended logging it would be pretty spamming to log out such detail information for every transaction, so they just use a logger than logs a general progress message at the loggingFrequency.

@williamhbaker is there a requirement that makes us want to supress the instant logs from most connectors? I personally find them very useful in debugging connectors. I understand I can change the log level to debug to achieve this, but for already-running tasks that have had an issue in the past, it is useful to be able to go back in time and see what happened in what sequence 🤔

I was looking at the original issue that this pull-request aims to address, and the description by Phil is:

We've had a number of debugging sessions that suffered from lack of information in the logs. Ideally, we'd have some common logging in materialize-sql to help with this sort of thing.

So I think we want more information in logs, but I think the simple logger will give us less information

As a point of clarification, with this change we will indeed be logging more for all materializations, just not quite as much more for some vs. others. The simple logger will give us less information than the extended logger, but the simple logger does give us more information than we currently get from materializations that will use it.

With this PR, OLAP materializations that typically run slow transactions will log a lot more, bringing in nearly all of the debug level logs that were previously in RunTransactions and logging those at info level, as well as status updates every loggingFrequency when it they are in the middle of doing some long running operation like reading a lot of Store requests or running a long transaction.

Other materializations that may run multiple transactions per second will also log more than they did before, but not as much more by default as OLAP materializations since that would be pretty spammy in most cases. Materializations that do not set ExtendedLogging and are logging at info level will now log the additional status updates every 5 minutes per loggingFrequency, which will describe what they are doing to some extent. Admittedly this is not as sophisticated as I could imagine it being, but it will indicate (among other things) when they have been waiting for documents from the runtime for a long time, or are waiting for a long-running possibly stuck commit for a long time, which are a couple of things we've previously seen.

The specific scenario that the original issue was opened for was a case of a materialize-postgres task that we suspect got hung up for a long time when opening and installing its checkpoint (see thread). Currently for pretty much all materializations in this case the last thing they will log is a line like opening database and then nothing else - and for most non OLAP materializations, this is indeed the last thing they will ever log no matter what, including materialize-postgres. This PR includes some logging additions directly targeted at this case in this commit, in addition to the other logging it is adding.

Okay I think it is easier if we see this in action and then if we find it doesn't quite satisfy troubleshooting needs we will add more to it. It is hard for me to imagine what the logs look like now. But this is a significant work that will pave the way for better logging, thanks!

mdibaiee · 2024-10-08T08:32:59Z

materialize-boilerplate/transactions_stream.go

+		// do a delay, since we want at least the first storedHistorySize
+		// transactions to complete without a delay when the connector starts
+		// up.
+		belowThreshold := len(l.storedHistory) == storedHistorySize


The naming of this variable still doesn't make sense to me. It is called belowThreshold, but it seems to actually signify above or at threshold: we are first checking we have reached the threshold number of storedHistories (==, meaning "at" to me), and then we are checking whether each of these stored histories is at or above the storeThreshold (>=), and in the end we use it in the if condition as an aboveOrAtThreshold by negating it

Thanks...honestly I struggled with this bit of code to try to express what it is really doing.

I think I've got something pretty clear now that I just pushed, it looks like this:

// If there have been at least `storedHistorySize` transactions // completed and none of the transactions were large enough to suggest // we are still in the midst of a backfill, the acknowledgement delay // may be applicable. delayBasedOnStoredHistory := len(l.storedHistory) >= storedHistorySize && !slices.ContainsFunc(l.storedHistory, func(n int) bool { return n >= storeBackfillThreshold }) if !delayBasedOnStoredHistory { ll.Info("not delaying commit acknowledgement based on stored history") } else if d <= 0 { ...

mdibaiee · 2024-10-08T08:34:41Z

materialize-boilerplate/logging.go

@@ -10,12 +10,12 @@ import (

 const (
 	// How often various the various async loggers will report their status.
-	loggingFrequency = 5 * time.Second
+	loggingFrequency = 5 * time.Minute


@williamhbaker is there a requirement that makes us want to supress the instant logs from most connectors? I personally find them very useful in debugging connectors. I understand I can change the log level to debug to achieve this, but for already-running tasks that have had an issue in the past, it is useful to be able to go back in time and see what happened in what sequence 🤔

mdibaiee

Thanks Will!

Narrows the materialize "stream" interface to be focused on sending and receiving protocol messages. This is mostly removing baggage from when materializations only used gRPC and used the gRPC server interface more directly. Now they also support a JSON protocol so the prior implementation doesn't make much sense anymore. The lifecycle tests were also almost entirely redundant so rather than refactoring them I have deleted them.

Small refactoring to fix a linter complaint about a context leak of the loadCtx.

This logging will be implemented elsewhere in the coming commits, so the transactor is being cleaned up of most of its logging duties.

Creates a basic logger that will log out materialization status on a set frequency. Generally it will just log the current round and various counters, but if a commit has been running for a long time or the connector has been waiting for documents for a long time it will log information about that.

Adds an "extended" logger that will log materialization transaction progress in more detail, including the start & stop of most events of interest and how long they take to complete. This logging can be enabled via the new `MaterializeOptions` struct returned by `NewTransactor`. It will also be enabled if the log level is set to debug.

…ream Lifts the acknowledgement scheduling mechanism out of `RunTransactions` and into the new transactor message stream. This allows the new transactor message stream layer to provide useful logging information about what the materialization is doing in terms of an artificially applied delay. This also finishes up extracting meta operations out of the `RunTransactions` routine. It will no longer be responsible for computing acknowledgement delays or logging anything - that is all handled by the transactor message stream.

The `MaterializeOptions` configuration struct was added to the boilerplate in a previous commit to allow for materializations to tune specific high-level behaviors of transaction processing. This adapts all the materializations to use it.

The preceding commits added high-level logging capabilities that are applicable to all materializations, enabled via on configuration. But there are still some details that only the materialization implementation itself has access to. For these events, a BindingEvents construct is provided. It can provide start/stop and progress timings for evaluating load requests (load queries), and start/stop timings for committing stores to specific bindings. Materializations don't have to use this, but they can opt-in if it makes sense for their implementation. I have updated all of the materializations as it makes sense, also removing the now redundant logs that the boilerplate takes care of.

Getting the Open response for a materialization can take a long time, especially if it has to do a lot of work to setup the transactor or if querying a checkpoint from the store is not fast. This adds some progress logging for that case so it's more clear what the materialization is doing. Also a final round of refactoring: Split out the stream/logging logic into separate files, and added a general `repeatAsync` helper that several of these new logs make use of.

The 5 second and 15 second frequencies were just for testing, and the real values should be in minutes.

None of these needed to be exported anymore, so I unexported them. Also renamed some of them to be more clear, for example `ReadFlush` didn't actually read anything, but rather just validated that the last read message by the Load iterator was a "flush".

williamhbaker force-pushed the wb/materialization-logging branch 2 times, most recently from 57b7a26 to 4a05ddc Compare October 4, 2024 15:57

williamhbaker changed the title ~~Wb/materialization logging~~ materialize: extended logging Oct 4, 2024

williamhbaker force-pushed the wb/materialization-logging branch from 4a05ddc to 2c02325 Compare October 4, 2024 16:46

williamhbaker added the change:planned This is a planned change label Oct 4, 2024

williamhbaker marked this pull request as ready for review October 4, 2024 20:45

williamhbaker requested a review from mdibaiee October 4, 2024 20:45

mdibaiee reviewed Oct 7, 2024

View reviewed changes

williamhbaker force-pushed the wb/materialization-logging branch from a4a5505 to 741d40b Compare October 7, 2024 20:26

mdibaiee reviewed Oct 8, 2024

View reviewed changes

williamhbaker force-pushed the wb/materialization-logging branch from 741d40b to c43ddc2 Compare October 8, 2024 13:05

mdibaiee approved these changes Oct 8, 2024

View reviewed changes

williamhbaker added 13 commits October 8, 2024 10:20

materialize: fix linter complaint about context leak

9702cd9

Small refactoring to fix a linter complaint about a context leak of the loadCtx.

materialize: remove most logging from transactor

2db1e2e

This logging will be implemented elsewhere in the coming commits, so the transactor is being cleaned up of most of its logging duties.

materialize: rename auxiliary stream to "transactions stream"

aff8cf7

materialize: fix delay acknowledgement based on stored history

275dd80

materialize: adjust loggers to use reasonable frequencies

1f7866d

The 5 second and 15 second frequencies were just for testing, and the real values should be in minutes.

williamhbaker force-pushed the wb/materialization-logging branch from c43ddc2 to ede4d19 Compare October 8, 2024 14:22

williamhbaker merged commit 420733a into main Oct 8, 2024
50 of 52 checks passed

williamhbaker deleted the wb/materialization-logging branch October 8, 2024 14:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

materialize: extended logging #2019

materialize: extended logging #2019

williamhbaker commented Oct 4, 2024 •

edited

Loading

mdibaiee Oct 7, 2024

williamhbaker Oct 7, 2024

mdibaiee Oct 8, 2024

mdibaiee Oct 8, 2024 •

edited

Loading

williamhbaker Oct 8, 2024

mdibaiee Oct 8, 2024

mdibaiee Oct 8, 2024

williamhbaker Oct 8, 2024

mdibaiee Oct 8, 2024

mdibaiee left a comment

materialize: extended logging #2019

materialize: extended logging #2019

Conversation

williamhbaker commented Oct 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mdibaiee Oct 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mdibaiee left a comment

Choose a reason for hiding this comment

williamhbaker commented Oct 4, 2024 •

edited

Loading

mdibaiee Oct 8, 2024 •

edited

Loading