Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[POC] Using gls library to provide context rather than attaching it to errors #4293

Closed
wants to merge 10 commits into from

Conversation

oxtoacart
Copy link
Contributor

cc: @uaalto

Our new errors package attaches contextual information (e.g. user-agent, proxy ip, origin host, etc.) to errors so that we can report that contextual information in our logs and to Borda. After playing with the library, I found myself missing a few things:

  1. Context only propagates up, not down. For example, if the caller of my function knows something about the user agent and my function knows something about the proxy address, within my function I can only log the proxy address, not the user agent. Only the caller can know both pieces of information.
  2. Context is only available on errors. We have all this useful debug logging already in place, and it would be nice the see the context on there as well.
  3. If multiple errors are raised within the same context, they all have to manually attach that context. For example, inside of our proxy dialer, we not only attempt to open a TLS connection, but we also verify the certificates. The proxy address is the same for both errors, and I have to manually attach it in both places.
  4. Everyone has to wrap! Let's say that I have a call chain A -> B -> C. C raises an error, wrapping it with our errors package to attach some useful metadata. Unfortunately, package B treats it like a regular error and actually just prepends some text to it before passing it to A (or worse just logs the error and doesn't raise it on). A now has no access to the context information from C. If we own package B, this is extra work, but if we don't own package B there's no solution unless we want to fork the code.

I had earlier suggested that we look into the tylerb/gls library for capturing contextual information in the context of our goroutines (kind of like a Java ThreadLocal). I went ahead and prototyped something using that approach, and I have to say that I like it better. It completely solves the first three of the above problems and mostly solves the 4th. It doesn't completely solve the 4th problem because if package B is creating a new goroutine, we have to do some special work in there to keep the context. Especially for things like HTTP request processing, raising new goroutines is rare, so I'm less concerned about this than errors not getting wrapped.

The context-based approach also seems less invasive, since we only need to add code where we have new contextual information that we want to add and all errors logged within those contexts automatically benefit from the information.

Here is an excerpt from my logs that shows:

  1. Both ERROR and DEBUG logging benefit
  2. Even logging from low-level packages like balancer shows contextual information like user_agent and request_id that those packages have no knowledge about. In fact, in this example, balancer doesn't itself know any of the logged contextual information.
May 15 02:40:02.099 - 0m5s ERROR balancer: balancer.go:81 Unable to dial via (trusted) chained proxy at 45.55.130.104:443 [] to connect://www.google-analytics.com:443: Unable to dial server (trusted) chained proxy at 45.55.130.104:443 []: read tcp xxx.xxx.xxx.xxx:54174->xx.xxx.xxx.xxx:443: read: connection reset by peer on pass 2...continuing [op=proxy origin=www.google-analytics.com:443 proxy_host=45.55.130.104 proxy_port=443 proxy_protocol=https request_id=3266353486698850603 user_agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36]
May 15 02:40:02.100 - 0m5s DEBUG flashlight.client: handler.go:133 Could not dial Still unable to dial connect://www.google-analytics.com:443 after 3 attempts [op=proxy origin=www.google-analytics.com:443 proxy_host=xx.xxx.xxx.xxx proxy_port=443 proxy_protocol=https request_id=3266353486698850603 user_agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36]

"user_agent": userAgent,
"request_id": rand.Int63(),
"origin": req.Host,
})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this call, anything that we do in processing this request is tagged with these 4 pieces of context information.

@oxtoacart oxtoacart changed the title [POC] Using gls library to provide context rathar than attaching it to errors [POC] Using gls library to provide context rather than attaching it to errors May 15, 2016
@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 72.015% when pulling bc10f0a on contextlog into 3e2f1c5 on devel.

@oxtoacart
Copy link
Contributor Author

Closing in lieu of #4295.

@oxtoacart oxtoacart closed this May 15, 2016
@oxtoacart
Copy link
Contributor Author

For your point 1, context passes information top down, while Error attaches information bottom up, which is complementary to each other. Without the bottom up propagation, we have to log or report immediately when the error occurs, or the details will be lost.

I agree, they are definitely complementary.

As for your point 4, typed error is an recommendation for complex packages. Many 3rd party logging library, like logrus and log15, all use structured logging. We probably can consider evolve in that direction.

Sorry, I don't quite understand. The structured logging here just means that they actually log JSON rather than a string, right? We're essentially talking about doing the same in terms of submitting our errors as measurements, right? My point was about something else, namely that to ensure that the information attached to our error propagates in a typed form, we have to make sure it's handled properly along the entire path between whoever is logging it and whoever raised it.

@oxtoacart
Copy link
Contributor Author

Goroutine - context: Not all Goroutines require the gls data, but we have to establish a convention to always call ctx.Go. With Context, we can pass the context to go func(), or create new context from the one in outer scope when necessary.

I don't understand how passing the context along is any better. Either way, the intermediary function that's spawning the goroutine has to know to pass this state. Simply using context.Go() by convention actually seems easier, because I don't have to modify my method to accept a Context.

For example:

func doStuffInGoroutine(myarg string, myotherarg int) {
    context.Go(func() {
        doStuff(myarg, myotherarg)
    })
}

vs.

func doStuffInGoroutine(myarg string, myotherarg int, ctx *Context) {
    go func() {
        doStuff(myarg, myotherarg, ctx)
    }
}

Either way we're aware of the need to pass context, but the 2nd one has a more verbose syntax.

Context - error: When EOF occurs when dialing proxy, logging or reporting the user agent adds nothing but noise. We should select those context really makes sense. If it becomes tedious to extract same set of information from context, a handy custom helper FromContext(e *Error, ctx Context) *Error would be enough.

I don't think the user-agent is noise. Perhaps some user agents behave in a way that causes the EOF (for example the user agent disconnects early). That's something that we'd want to see in our Influx data. I think it was actually you who convinced me that we should collect as much information as is reasonably possible into the raw Influx data to see what we can learn from it.

@oxtoacart
Copy link
Contributor Author

BTW, getting back to the point about attaching metadata to errors being complementary to passing it down via context, I think what would be really cool is if the errors package integrated with context so that a call to error.New() or error.Wrap automatically attaches the context. That way, we can use the context setting convention to fill our context and our errors, killing two birds with one stone.

@oxtoacart
Copy link
Contributor Author

Sorry for the confusion. I just recommend to use fields instead of formatted strings to create both error and log entry in the future. That would eliminate return fmt.Errorf("Fail to xxx: %v", err) style completely, so code can either: 1. return error as is, or 2. attach fields use errors package. What if there're 3rd party package in between? I don't know but didn't see any occurrence. Anyway we need to propagate errors bottom up.

I agree, the fmt.Errorf() style is something I regret doing. It was convenient at first, but it loses too much information.

I think this points out something important, namely that we have two types of data related to errors and logging.

  1. Data about the error, like what was the underlying error, where in the code did the error occur, what were the data parameters that led to the error, etc.
  2. Data about the context in which the error occurred, such as high-level operation (proxying), user-agent, whether or not we're detouring, the user's locale, etc.

I'm thinking that the best approach is a combination where:

a. We populate context before errors even happen so that we have that available both for errors and for debug logging

b. At the point that an error occurs, we switch to using structured errors that capture information about the error in a machine-readable format (e.g. callsite information, nested error preserved in its original form, etc.) In a way, Java's exceptions are a good example here in the way that they contain typed fields and also nested causes.

@oxtoacart
Copy link
Contributor Author

oxtoacart commented May 15, 2016

Ha comments become cluttered. By saying passing context, I mean passing context when necessary instead of calling context.Go in all occurrences.

My view is that intermediary layers don't know when context is or is not necessary, so they have to assume it always is. For example, something like this:

conn, err := idletiming.Wrap(dialFN)

Taking our real context as an example, with user-agent, request id, origin host, etc., the idletiming.Wrap function may not care about any of this because it doesn't even log anything or return errors, however the dialFN may care because it does. The only solution is to pass the context through all the time, so now we have:

conn, err := idletiming.Wrap(ctx, dialFN)

And dialFN has to take a context as well. And so on. That's a lot of functions that need to change, and it clutters the code. With my context implementation, most of the time nothing is required. If idletiming happens to open a goroutine (which we do relatively rarely on the critical path), then it just switches to context.Go, which is quite easy and an isolated change.

@oxtoacart
Copy link
Contributor Author

BTW, this branch has a bug that causes context to leak, which does pollute the log. That's fixed in #4295.

@oxtoacart oxtoacart deleted the contextlog branch April 28, 2020 21:06
@getlantern getlantern deleted a comment from fffw Mar 3, 2021
@getlantern getlantern deleted a comment from fffw Mar 3, 2021
@getlantern getlantern deleted a comment from fffw Mar 3, 2021
@getlantern getlantern deleted a comment from fffw Mar 3, 2021
@getlantern getlantern deleted a comment from fffw Mar 3, 2021
@getlantern getlantern deleted a comment from fffw Mar 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants