Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

materialize: extended logging #2019

Merged
merged 13 commits into from
Oct 8, 2024
Merged

Conversation

williamhbaker
Copy link
Member

@williamhbaker williamhbaker commented Oct 4, 2024

Description:

Adds more logging and modifies logging for materializations that use the Go materialization boilerplate.

The individual commit messages describe the changes in more detail, but here are the highlights:

  • The message stream used by RunTransactions is getting a new wrapper to augment its functionality, and it will dispatch notifications for various events of interest that happen during the transactions lifecycle. Implemented here are loggers that will log messages in response to these events, and also a relocated acknowledgement delay mechanism that has been extracted from RunTransactions. Further behaviors could be added here, like the dbt job triggering, if they prove useful in the future for general features.
  • For the above, there's a configuration object that implementations are required to return that will control these broad behaviors. This is IMO a significant improvement over the prior requirement to implement the DelayedCommitter interface, which has been removed.
  • RunTransactions has been simplified as much as possible, removing non-essential logging and auxilliary behaviors (like the ack delay) into the new stream wrapper.
  • Lots of small changes to individual materializations to make use of all of the above.

One thing that is not being done here that I think should be done is reconciliation of the packages for the go/protocols/materialize and materialize-boilerplate. They should both probably be under a common go/materialize package & there should be a similar go/capture package, but I'm saving that refactoring for later.

Closes #1872

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

(anything that might help someone review this PR)


This change is Reviewable

@williamhbaker williamhbaker force-pushed the wb/materialization-logging branch 2 times, most recently from 57b7a26 to 4a05ddc Compare October 4, 2024 15:57
@williamhbaker williamhbaker changed the title Wb/materialization logging materialize: extended logging Oct 4, 2024
@williamhbaker williamhbaker added the change:planned This is a planned change label Oct 4, 2024
@williamhbaker williamhbaker marked this pull request as ready for review October 4, 2024 20:45
go/protocols/materialize/lifecycle.go Outdated Show resolved Hide resolved
go/protocols/materialize/transactor.go Outdated Show resolved Hide resolved
materialize-boilerplate/aux_stream.go Outdated Show resolved Hide resolved
materialize-boilerplate/transactions_stream.go Outdated Show resolved Hide resolved
@@ -10,12 +10,12 @@ import (

const (
// How often various the various async loggers will report their status.
loggingFrequency = 5 * time.Second
loggingFrequency = 5 * time.Minute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we will not see any logs more frequently than 5 minutes? i.e. there will always be 5 minutes delay between two logs from this stream?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the comment to state this as well:

When "extended" logging is enabled, transitions of interest will always be logged immediately, and only after loggingFrequency will status updates be logged if nothing else has happened. When extended logging is not enabled, there will be a more generic status update logged out at this frequency.

So with extended logged for an OLAP materialization, we will for example see a "started reading store requests" logged when it starts doing that, and then every 5 minutes until it finishes we'll get a "reading store requests in progress" log. For materializations that don't enable extended logging it would be pretty spamming to log out such detail information for every transaction, so they just use a logger than logs a general progress message at the loggingFrequency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@williamhbaker is there a requirement that makes us want to supress the instant logs from most connectors? I personally find them very useful in debugging connectors. I understand I can change the log level to debug to achieve this, but for already-running tasks that have had an issue in the past, it is useful to be able to go back in time and see what happened in what sequence 🤔

Copy link
Member

@mdibaiee mdibaiee Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at the original issue that this pull-request aims to address, and the description by Phil is:

We've had a number of debugging sessions that suffered from lack of information in the logs. Ideally, we'd have some common logging in materialize-sql to help with this sort of thing.

So I think we want more information in logs, but I think the simple logger will give us less information

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a point of clarification, with this change we will indeed be logging more for all materializations, just not quite as much more for some vs. others. The simple logger will give us less information than the extended logger, but the simple logger does give us more information than we currently get from materializations that will use it.

With this PR, OLAP materializations that typically run slow transactions will log a lot more, bringing in nearly all of the debug level logs that were previously in RunTransactions and logging those at info level, as well as status updates every loggingFrequency when it they are in the middle of doing some long running operation like reading a lot of Store requests or running a long transaction.

Other materializations that may run multiple transactions per second will also log more than they did before, but not as much more by default as OLAP materializations since that would be pretty spammy in most cases. Materializations that do not set ExtendedLogging and are logging at info level will now log the additional status updates every 5 minutes per loggingFrequency, which will describe what they are doing to some extent. Admittedly this is not as sophisticated as I could imagine it being, but it will indicate (among other things) when they have been waiting for documents from the runtime for a long time, or are waiting for a long-running possibly stuck commit for a long time, which are a couple of things we've previously seen.

The specific scenario that the original issue was opened for was a case of a materialize-postgres task that we suspect got hung up for a long time when opening and installing its checkpoint (see thread). Currently for pretty much all materializations in this case the last thing they will log is a line like opening database and then nothing else - and for most non OLAP materializations, this is indeed the last thing they will ever log no matter what, including materialize-postgres. This PR includes some logging additions directly targeted at this case in this commit, in addition to the other logging it is adding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I think it is easier if we see this in action and then if we find it doesn't quite satisfy troubleshooting needs we will add more to it. It is hard for me to imagine what the logs look like now. But this is a significant work that will pave the way for better logging, thanks!

// do a delay, since we want at least the first storedHistorySize
// transactions to complete without a delay when the connector starts
// up.
belowThreshold := len(l.storedHistory) == storedHistorySize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming of this variable still doesn't make sense to me. It is called belowThreshold, but it seems to actually signify above or at threshold: we are first checking we have reached the threshold number of storedHistories (==, meaning "at" to me), and then we are checking whether each of these stored histories is at or above the storeThreshold (>=), and in the end we use it in the if condition as an aboveOrAtThreshold by negating it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks...honestly I struggled with this bit of code to try to express what it is really doing.

I think I've got something pretty clear now that I just pushed, it looks like this:

// If there have been at least `storedHistorySize` transactions
// completed and none of the transactions were large enough to suggest
// we are still in the midst of a backfill, the acknowledgement delay
// may be applicable.
delayBasedOnStoredHistory := len(l.storedHistory) >= storedHistorySize && !slices.ContainsFunc(l.storedHistory, func(n int) bool {
	return n >= storeBackfillThreshold
})

if !delayBasedOnStoredHistory {
	ll.Info("not delaying commit acknowledgement based on stored history")
} else if d <= 0 { ...

@@ -10,12 +10,12 @@ import (

const (
// How often various the various async loggers will report their status.
loggingFrequency = 5 * time.Second
loggingFrequency = 5 * time.Minute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@williamhbaker is there a requirement that makes us want to supress the instant logs from most connectors? I personally find them very useful in debugging connectors. I understand I can change the log level to debug to achieve this, but for already-running tasks that have had an issue in the past, it is useful to be able to go back in time and see what happened in what sequence 🤔

Copy link
Member

@mdibaiee mdibaiee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Will!

Narrows the materialize "stream" interface to be focused on sending and
receiving protocol messages. This is mostly removing baggage from when
materializations only used gRPC and used the gRPC server interface more
directly. Now they also support a JSON protocol so the prior implementation
doesn't make much sense anymore.

The lifecycle tests were also almost entirely redundant so rather than
refactoring them I have deleted them.
Small refactoring to fix a linter complaint about a context leak of the loadCtx.
This logging will be implemented elsewhere in the coming commits, so the
transactor is being cleaned up of most of its logging duties.
Creates a basic logger that will log out materialization status on a set
frequency. Generally it will just log the current round and various counters,
but if a commit has been running for a long time or the connector has been
waiting for documents for a long time it will log information about that.
Adds an "extended" logger that will log materialization transaction progress in
more detail, including the start & stop of most events of interest and how long
they take to complete.

This logging can be enabled via the new `MaterializeOptions` struct returned by
`NewTransactor`. It will also be enabled if the log level is set to debug.
…ream

Lifts the acknowledgement scheduling mechanism out of `RunTransactions` and into
the new transactor message stream. This allows the new transactor message stream
layer to provide useful logging information about what the materialization is
doing in terms of an artificially applied delay.

This also finishes up extracting meta operations out of the `RunTransactions`
routine. It will no longer be responsible for computing acknowledgement delays
or logging anything - that is all handled by the transactor message stream.
The `MaterializeOptions` configuration struct was added to the boilerplate in a
previous commit to allow for materializations to tune specific high-level
behaviors of transaction processing. This adapts all the materializations to use
it.
The preceding commits added high-level logging capabilities that are applicable
to all materializations, enabled via on configuration. But there are still some
details that only the materialization implementation itself has access to.

For these events, a BindingEvents construct is provided. It can provide
start/stop and progress timings for evaluating load requests (load queries), and
start/stop timings for committing stores to specific bindings.

Materializations don't have to use this, but they can opt-in if it makes sense
for their implementation.

I have updated all of the materializations as it makes sense, also removing the
now redundant logs that the boilerplate takes care of.
Getting the Open response for a materialization can take a long time, especially
if it has to do a lot of work to setup the transactor or if querying a
checkpoint from the store is not fast. This adds some progress logging for that
case so it's more clear what the materialization is doing.

Also a final round of refactoring: Split out the stream/logging logic into
separate files, and added a general `repeatAsync` helper that several of these
new logs make use of.
The 5 second and 15 second frequencies were just for testing, and the real
values should be in minutes.
None of these needed to be exported anymore, so I unexported them. Also renamed
some of them to be more clear, for example `ReadFlush` didn't actually read
anything, but rather just validated that the last read message by the Load
iterator was a "flush".
@williamhbaker williamhbaker merged commit 420733a into main Oct 8, 2024
50 of 52 checks passed
@williamhbaker williamhbaker deleted the wb/materialization-logging branch October 8, 2024 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change:planned This is a planned change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

materializations: standardize and improve logging
2 participants