Investigation about tracing, logging, and metrics support. #482

liurenjie1024 · 2024-07-26T09:23:18Z

As a library for production usage, it would be useful to add logging, metrics, and tracing support. We need to do some investigation about to fit into rust ecosystem as much as possible.

Xuanwo · 2024-07-26T09:49:54Z

I will suggestion the following set:

logging: log
metrics: prometheus_client
tracing: fastrace

Why not `tracing`?

Tracing is widely used, but it often confuses the scope of logging and tracing.

From a tracing perspective, it's 100 times slower than fastrace.

Meanwhile, log offers the most compatible solution for all logging systems, making integration easier for users.

Why not `prometheus`?

prometheus project maintained by tikv community is almost dead.

Why not `metrics`?

It makes sense to introduce an abstraction based on metrics tailored to my own usage. However, its abstraction does not effectively support prometheus, which is the most widely used metrics collector.

I suggest we first integrate directly with Prometheus and consider metrics, while also accommodating requests for support of other metrics tools like tcp.

sdd · 2024-07-26T19:09:13Z

Personally I'd support tracing crate over logging. Adoption of tracing has supplanted that of logging over the past few years. It is more capable and flexible than logging. I disagree with the sentiment that it confuses things.

Xuanwo · 2024-07-27T05:32:30Z

Adoption of tracing has supplanted that of logging over the past few years.

Users who rely on tracing can still integrate with our log, as tracing has native integration. However, if we adopt tracing, then log users will not be able to integrate with us. Therefore, I believe libraries like Iceberg should use log instead of tracing. Nonetheless, I'm open to using tracing_subscriber in our tests for a better user experience.

Here is a decision matrix:

Does it works?	iceberg use `log`	iceberg use `tracing`
user use `env_logger` (`log` ecosystem)	YES	NO
user use `tracing_subscriber`(`tracing` ecosystem)	YES	YES

liurenjie1024 · 2024-07-28T15:24:32Z

Hi, @sdd

It is more capable and flexible than logging.

Could you elaborate on this? Previously for logging capability I thought tracing crate is better since it provides structural logging, but log also provides structural logging now.

As with using tracing + logging backends, there is tracing-log, but it seems it's not actively maintained.

I'm trying to summarize the proposed approaches here:

`log` + `fastrace`

Pros:

Each crate fouces on its own functionality only, that means they have most widely supported ecosystem, and maybe easier to enable logging and tracing separately?
fastrace reports that it's much faster than tracing.

Cons:

tracing seems to have better ecosystem if you look at their main page.
It seems that fastrace is less popular than tracing, that means it maybe more difficult to integrate with other libraries? For example if someone wants to use iceberg-rust in an application in h2, which relies on tracing library, that means the user needs to setup two collectors/subscribers?

`tracing` only

Pros:

Most widely used tracing library in rust with best integrations.

Cons:

It's mixing logging with tracing. If we use tracing for logging, our logging compability will not be compatible with logger system.
It's reported to be slower than fastrace.

So how about we use log + tracing? We stick to the rule that using tracing library for tracing only, and use log for logging?

twuebi · 2024-07-29T08:20:13Z

Hi all,

I agree with @sdd and suggest to stick with tracing. It is the default choice for logging in async applications and has the most wide-spread adoption.

Regarding performance arguments, I doubt that this matters in any real world benchmark of this crate, any tracing incurred overhead will be dwarfed by the network latencies of whatever storage cloud service serves the data files / roundtrips to DB catalog / API calls to a Rest catalog. I'm strongly in favor with sticking with a tried and tested implementation and put some real world benchmarks in place and only if those show tracing to be a bottleneck consider choosing a less active alternative.

I also do not agree with the sentiment that it is a problem to address tracing & logging in the same crate. It rather provides very useful debugging utilities, e.g. when used in a server, some middleware can create a span containing a request id, now every (tracing-)log statement within that span will have the request id attached to it so your favorite log aggregator can filter by this request id to get all logs pertaining to it which is incredibly helpful when dealing with production issues.

Regarding metrics, I'm not sure I understand the issue, there exists https://crates.io/crates/metrics-prometheus which provides integration of prometheus with metrics.rs.

c-thiel · 2024-07-30T09:22:22Z

Hey guys, some thoughts from my side:

I don't think performance matters much - we are talking nanoseconds while roundtrips to S3 and catalogs are Milliseconds. Thus, the wider adoption is a clear plus for me for tracing.
We are an io-heavy crate - lots of roundtrips via OpenDAL to Object Stores, Catalogs and more. IMO we would benefit alot from the tracing features over plain logging. It would enable us to consistently track operations across all affected systems - including forwarding of trace-ids to them. Iceberg-Rust is one component in an inherently distributed system that we call a Lakehouse. Distributed systems can best be monitored as a whole if traces are used everywhere along the way.

Xuanwo · 2024-07-30T09:32:02Z

Thank you all for joining the discussion. It seems most people prefer using tracing for trace and log. I'm willing to accept this since it's more important for us to move forward.

The left part from my side is metrics: What do you think about prometheus_client and metrics?

Context from opendal.

OpenDAL supports ALL. Regardless of the library chosen, we simply need to enable the corresponding layer for it.

Name	Depends	Description
`DtraceLayer`	probe	Support User Statically-Defined Tracing(aka USDT) on Linux
`LoggingLayer`	log	Add log for every operations.
`MetricsLayer`	metrics	Add metrics for every operations.
`FastraceLayer`	fastrace	Add fastrace for every operations.
`OtelTraceLayer`	opentelemetry::trace	Add opentelemetry::trace for every operations.
`PrometheusClientLayer`	prometheus_client	Add prometheus metrics for every operations.
`PrometheusLayer`	prometheus	Add prometheus metrics for every operations.
`TracingLayer`	tracing	Add tracing for every operations.

sdd · 2024-07-30T20:12:20Z

Some additional context. Renjie mentioned:

Cons:
It's mixing logging with tracing. If we use tracing for logging, our logging compability will not be compatible with logger system.

This is not quite right. You can use the tracing_log crate to redirect any other crates logs that went to log to tracing:

use tracing_log::LogTracer;

// redirects all output sent to log:* to their tracing::* equivalent
LogTracer::init().expect("Failed to set logger");

Regarding metrics, my preference would be either metrics or opentelemetry. Both are vendor-agnostic with pluggable exporters. My first choice would be opentelemetry as it seems to have better cross-language recognition and a bit more active / vibrant community (I'm basing this on github stars, watchers, number of contributors and commit / merge frequency)

sdd · 2024-07-31T21:41:37Z

It sounds like tracing is the preferred option for tracing and logging. I'm happy to raise a PR to add this if we are all in agreement.

Regarding telemetry, does anyone have objections to the choice of opentelemetry?

Xuanwo · 2024-08-01T05:58:18Z

It sounds like tracing is the preferred option for tracing and logging. I'm happy to raise a PR to add this if we are all in agreement.

I'm fine with this. Also cc @liurenjie1024 for ideas.

Regarding telemetry, does anyone have objections to the choice of opentelemetry?

opentelemetry's metrics support is currently in the Alpha stage. I believe most people primarily use it for trace. See: https://opentelemetry.io/docs/languages/rust/

I will support to use metrics instead if we don't want to depends on lib like prometheus_client directly.

liurenjie1024 · 2024-08-02T01:04:53Z

It sounds like tracing is the preferred option for tracing and logging. I'm happy to raise a PR to add this if we are all in agreement.

+1.

I will support to use metrics instead if we don't want to depends on lib like prometheus_client directly.

+1. I would also prefer to use metrics since opentelemetry is in too early stage.

ZENOTME · 2024-08-04T11:36:00Z

Some additional context. Renjie mentioned:

Cons:
It's mixing logging with tracing. If we use tracing for logging, our logging compability will not be compatible with logger system.

This is not quite right. You can use the tracing_log crate to redirect any other crates logs that went to log to tracing:
use tracing_log::LogTracer;

// redirects all output sent to log:* to their tracing::* equivalent
LogTracer::init().expect("Failed to set logger");
Regarding metrics, my preference would be either metrics or opentelemetry. Both are vendor-agnostic with pluggable exporters. My first choice would be opentelemetry as it seems to have better cross-language recognition and a bit more active / vibrant community (I'm basing this on github stars, watchers, number of contributors and commit / merge frequency)

I have a question: the LogTracer in tracing_log is used to convert the log record into a trace event. Does this mean that if the user uses log originally and after they introduce iceberg-rust, they have two choices:

maintain trace and log separately
use trace and redirect all original log records into trace events.

For users, is it a more ideal behavior to redirect the trace to the original log? 🤔

liurenjie1024 added enhancement New feature or request discussion Discussion about idea of this project. labels Jul 26, 2024

liurenjie1024 added this to iceberg-rust Jul 26, 2024

liurenjie1024 mentioned this issue Jul 26, 2024

Incorrect Avro schema generated for Tables with a Transform::Day partition causes manifest file parsing to fail #478

Closed

Xuanwo mentioned this issue Jul 29, 2024

Allow collecting tracing from tracing fast/fastrace#13

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigation about tracing, logging, and metrics support. #482

Investigation about tracing, logging, and metrics support. #482

liurenjie1024 commented Jul 26, 2024

Xuanwo commented Jul 26, 2024

sdd commented Jul 26, 2024

Xuanwo commented Jul 27, 2024 •

edited

Loading

liurenjie1024 commented Jul 28, 2024

twuebi commented Jul 29, 2024

c-thiel commented Jul 30, 2024

Xuanwo commented Jul 30, 2024 •

edited

Loading

sdd commented Jul 30, 2024

sdd commented Jul 31, 2024

Xuanwo commented Aug 1, 2024

liurenjie1024 commented Aug 2, 2024

ZENOTME commented Aug 4, 2024 •

edited

Loading

Investigation about tracing, logging, and metrics support. #482

Investigation about tracing, logging, and metrics support. #482

Comments

liurenjie1024 commented Jul 26, 2024

Xuanwo commented Jul 26, 2024

Why not tracing?

Why not prometheus?

Why not metrics?

sdd commented Jul 26, 2024

Xuanwo commented Jul 27, 2024 • edited Loading

liurenjie1024 commented Jul 28, 2024

log + fastrace

tracing only

twuebi commented Jul 29, 2024

c-thiel commented Jul 30, 2024

Xuanwo commented Jul 30, 2024 • edited Loading

sdd commented Jul 30, 2024

sdd commented Jul 31, 2024

Xuanwo commented Aug 1, 2024

liurenjie1024 commented Aug 2, 2024

ZENOTME commented Aug 4, 2024 • edited Loading

Why not `tracing`?

Why not `prometheus`?

Why not `metrics`?

Xuanwo commented Jul 27, 2024 •

edited

Loading

`log` + `fastrace`

`tracing` only

Xuanwo commented Jul 30, 2024 •

edited

Loading

ZENOTME commented Aug 4, 2024 •

edited

Loading