Skip to content

Conversation

@davidhcoe
Copy link
Contributor

  • Adds tracing for the BigQuery driver
  • Modifies the behavior for large result sets:
    • Uses a default dataset ID for large result sets if one is not specified.
    • The caller can specify their own dataset. If it does not exist, the driver will attempt to create it.

Copy link
Contributor

@birschick-bq birschick-bq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally ...

  1. You won't need the "Action" tag - it's automatically added as the activity name with the TraceActivity call.
  2. Consider tag keys carefully to try to keep "semantic convention" recommendations.

catch (Exception ex)
{
activity?.AddBigQueryTag("retry_attempt", retryCount);
activity?.AddException(ex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly right. A handled exception should still be traced.

@davidhcoe davidhcoe marked this pull request as ready for review July 1, 2025 04:32
@github-actions github-actions bot added this to the ADBC Libraries 19 milestone Jul 1, 2025
Copy link
Contributor

@CurtHagenlocher CurtHagenlocher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I've left some feedback and some questions.


AdbcStatement statement = adbcConnection.CreateStatement();
statement.SqlQuery = environment.Query;
statement.SqlQuery = "select * from mashuptest-154002.NWIND.AdbcAllTypes"; // environment.Query;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be reverted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted in recent PR

// default value per https://pkg.go.dev/cloud.google.com/go/bigquery#section-readme
public const string DetectProjectId = "*detect-project-id*";

// What ODBC uses
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider amending the comment to explain why we would use the same value as ODBC (versus e.g. _bqadbc_temp_tables).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the comment, but I wonder if using _bqadbc_temp_tables might actually be better? On the one hand, assuming the dataset had already been created, it would just reuse that. On the other hand, having a specific one dedicated to ADBC could be useful in some scenarios if people want to understand the usage patterns.

{
activity?.AddBigQueryTag("update_token.status", "Required");
await tokenProtectedResource.UpdateToken();
activity?.AddBigQueryTag("update_token.status", "Completed");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the behavior when adding the same tag twice? It keeps the most recent one or it keeps all of them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For AddTag, it will keep all of them. For SetTag, it will override and keep the latest.


internal static string GetAssemblyVersion(Type type) => FileVersionInfo.GetVersionInfo(type.Assembly.Location).ProductVersion ?? string.Empty;

internal static bool TracingToFile() { return false; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making a property instead of a method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a method because I thought I would have to put some logic in there to determine whether a file listener was active, but I could be misunderstanding how that part will work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the name and added a TODO in there for now.

const string infoVendorName = "BigQuery";
const string infoDriverArrowVersion = "19.0.0";
const string publicProjectId = "bigquery-public-data";
readonly string? infoDriverArrowVersion;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this an instance field while the next two are static fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in PR

{
UpdateClientToken();
}
activity?.AddTag(key + ".set", value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like a bad idea to log an access token under any circumstances. Are there other sensitive option values we should also avoid logging (or tie to BigQueryUtils.TracingToFile, which looks like a proxy for "is safe to log sensitive information")?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an opt-in list in the BigQueryParameters to indicate which BigQueryParameter is safe to log no matter what the listener is.

readonly string? infoDriverArrowVersion;

private static readonly string s_assemblyName = BigQueryUtils.GetAssemblyName(typeof(BigQueryStatement));
private static readonly string s_assemblyVersion = BigQueryUtils.GetAssemblyVersion(typeof(BigQueryStatement));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of capturing these three times in BigQueryConnection, BigQueryStatement and BigQueryStatement.MultiArrowReader, consider storing them in just BigQueryUtils and referencing them from there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in PR

{
// Ideally we wouldn't need to indirect through a stream, but the necessary APIs in Arrow
// are internal. (TODO: consider changing Arrow).
activity?.AddBigQueryTag("read_stream", streamName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the stream name system-generated or can it expose an arbitrary user identifier?

I'll stop asking this question, but might apply to other things like the destination table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's system generated in the form projects/{project_id}/locations/{location}/sessions/{session_id}/streams/{stream_id}. Do you not think that should be logged?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the ids are GUIDs then that seems unlikely to be a problem. @birschick-bq, is there an OTel convention for marking possibly-sensitive data?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CurtHagenlocher / @davidhcoe

It seems OTel
has described the need to be careful, but the only concrete action seems to be to have filters/scrubbers on the OTel Collector via a processor.
https://opentelemetry.io/docs/security/handling-sensitive-data/

I believe we'll need to add overloads to scrub/anonymize potential PII data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up making this a conditional output to only use if wring to a file.

{
string destinationTable = keyValuePair.Value;
case BigQueryParameters.AllowLargeResults:
options.AllowLargeResults = true ? keyValuePair.Value.ToLower().Equals("true") : false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two existing and one new instance of s.ToLower().Equals("true") in this file, which is strictly worse than s.Equals("true", StringComparer.OrdinalIgnoreCase).

Copy link
Contributor

@CurtHagenlocher CurtHagenlocher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks largely good! Let me know if you want to check-in as-is or adjust the ToLowers first.


public static bool IsSafeToLog(string name)
{
if (safeToLog.Contains(name.ToLower()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's generally better to declare the HashSet as taking a case-insensitive comparer than to do ToLower as the latter may cause an allocation. (In this case it probably won't because the options are generally defined as lower-case; this is more about the pattern.)

Otherwise, this should probably be ToLowerInvariant() as I don't think we'd want a culture-specific conversion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I just noticed that there are at least three other ToLower() in this project that maybe should be ToLowerInvariant() -- though in the case of BigQueryConnection.DescToField I'm 1) not sure why we're changing the case of the column name and 2) we're calling ToLower twice on the same value instead of just once and storing it in a local.)

@CurtHagenlocher CurtHagenlocher merged commit 3e481a6 into apache:main Jul 7, 2025
6 checks passed
@davidhcoe davidhcoe deleted the dev/bigquery-tracing branch October 24, 2025 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants