Skip to content

Conversation

@toddmeng-db
Copy link
Contributor

@toddmeng-db toddmeng-db commented Jul 29, 2025

Motivation

The following cases are not properly stopping or disposing the status poller:

  1. If the DatabricksCompositeReader is explicitly disposed by the user
  2. CloudFetchReader is done returning results
  3. Edge case terminal operation status (timedout_state, unknown_state)

In addition:

  • When DatabricksOperationStatusPoller.Dispose(), it may cancel the GetOperationStatusRequest in the client. If the input buffer has data and cancellation is triggered, it leaves the TCLI client with unconsumed/unsent data in the buffer, breaking subsequent requests (fixed in this PR)

Fixes

DatabricksOperationStatusPollerLogic is now more appropriately managed by DatabricksCompositeReader (moved out of BaseDatabricksReader) to handle all cases where null results (indicating completion) are returned.

Disposing DatabricksCompositeReader appropriately disposes the activeReader and statusPoller

TODO

Follow-up PR - when statement is disposed, it should also dispose the reader (the poller is currently stopped when operationhandle is set to null, but this should also happen explicitly)

Need add some unit testing (follow up pr: #3243)

@toddmeng-db toddmeng-db changed the title Error handling for operation status poller fix(csharp/src/Drivers/Databricks): Error handling for operation status poller Jul 29, 2025
@toddmeng-db toddmeng-db force-pushed the toddmeng-db/operation-status-poller-error-handling branch 2 times, most recently from ec41720 to 004a5a7 Compare July 29, 2025 17:50
@jadewang-db
Copy link
Contributor

can you confirm, even without this fix, the polling will stop after statement being disposed, right? if not, we need fix there also

@toddmeng-db toddmeng-db changed the title fix(csharp/src/Drivers/Databricks): Error handling for operation status poller fix(csharp/src/Drivers/Databricks): Tighten OperationStatusPoller Disposal Jul 29, 2025

// Add the end of results guard to the queue
_downloadQueue.Add(EndOfResultsGuard.Instance, cancellationToken);
_isCompleted = true;
Copy link
Contributor Author

@toddmeng-db toddmeng-db Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From testing:
Small nit but I think we need to avoid this here, since it's possible that DownloadQueue is full, then exception handling would be stuck. Should I modify the Exception handling below, or was there a reason why it it like this? (line 262) @jadewang-db

catch (Exception ex)
            {
                try
                {
                    _downloadQueue.Add(EndOfResultsGuard.Instance, CancellationToken.None);
                 }
}

Alternatively, we can create a new CancellationToken with Timeout for this attempt

                    CancellationToken GetOperationStatusTimeoutToken = ApacheUtility.GetCancellationToken(_requestTimeoutSeconds, ApacheUtility.TimeUnit.Seconds);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, a cancellation token looks good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh just saw this comment, let me implement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually looks like TryAdd is better suited here

@toddmeng-db toddmeng-db force-pushed the toddmeng-db/operation-status-poller-error-handling branch from 9caf8db to 9fd9fea Compare July 30, 2025 04:32
@toddmeng-db toddmeng-db changed the title fix(csharp/src/Drivers/Databricks): Tighten OperationStatusPoller Disposal fix(csharp/src/Drivers/Databricks): Tighten Statement Disposal Jul 30, 2025
@toddmeng-db toddmeng-db changed the title fix(csharp/src/Drivers/Databricks): Tighten Statement Disposal fix(csharp/src/Drivers/Databricks): Tighten Statement, Reader, Poller Disposal Jul 30, 2025
@toddmeng-db toddmeng-db force-pushed the toddmeng-db/operation-status-poller-error-handling branch 9 times, most recently from d55808c to 74c6ee8 Compare July 30, 2025 22:49
{
var operationHandle = _statement.OperationHandle;
if (operationHandle == null) break;

Copy link
Contributor Author

@toddmeng-db toddmeng-db Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to use a timeout token here, instead of cancelling when canceltoken is triggered; if an interrupt is triggered prematurely, the TCLI client may still have unsent/unconsumed results in the buffers, affecting subsequent calls with that client (which is any future call in the same Session)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you able to repro this? should we do this to all the thrift rpc calls in the driver?

Copy link
Contributor Author

@toddmeng-db toddmeng-db Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is because in THTTPTransport (used by SparkHttpConnection -> DatabricksHttpconnection), a new Stream is created when the request is flushed. If cancellation happens before this, that stream doesn't get discarded:
https://github.com/apache/thrift/blob/master/lib/netstd/Thrift/Transport/Client/THttpTransport.cs#L281

Yes, during testing, got some errors. In the proxy logs, I remember seeing requests sent out with both GetOperationStatus and CloseOperationStatus (in the same request) while testing another PR

I think we are safe in HiveServer2Statement, but we might need to adjust CancellationToken in DatabricksReader, CloudFetchResultFetcher, and DatabricksCompositeReader

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think this depends a bit on how CancellationToken could be used by PBI, too
@CurtHagenlocher will mashup ever trigger cancellationTokens passed into IArrowStreamReader.ReadNextBatchAsync? Do we need to ensure that the connection still remains usable for subsequent statements?

Copy link
Contributor Author

@toddmeng-db toddmeng-db Aug 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least for now, I think we can operate this way:

  1. If the user cancels the token passed in to ReadNextBatchAsync, we should not to break the client
  2. Dispose() should not break the client either

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CurtHagenlocher will mashup ever trigger cancellationTokens passed into IArrowStreamReader.ReadNextBatchAsync? Do we need to ensure that the connection still remains usable for subsequent statements?

This is currently unimplemented but we'll need to implement it before GA for parity with the ODBC implementation. What is probably most important for cancellation is query execution, and unless we manage to push forward the proposed ADBC 1.1 API, currently the only way to cancel a running query is to call AdbcStatement.Cancel. There is currently no implementation of this method for any of the C#-implemented drivers :(.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a Power BI perspective, the most important use of cancellation is for Direct Query because users can generate a lot of queries simply by clicking around in a visual and in-progress queries will need to be cancelled if their output is no longer needed. DQ output tends to be relatively small, so being able to cancel in the middle of reading the output is arguably less important than being able to cancel before the results start coming back.

@toddmeng-db toddmeng-db force-pushed the toddmeng-db/operation-status-poller-error-handling branch 4 times, most recently from ecb0771 to 3263cef Compare July 31, 2025 04:52
@toddmeng-db toddmeng-db changed the title fix(csharp/src/Drivers/Databricks): Tighten Statement, Reader, Poller Disposal fix(csharp/src/Drivers/Databricks): Correct StatusPoller to Stop/Dispose Appropriately Aug 1, 2025
@toddmeng-db toddmeng-db force-pushed the toddmeng-db/operation-status-poller-error-handling branch 4 times, most recently from 8b88019 to 8e54490 Compare August 1, 2025 16:44
@toddmeng-db toddmeng-db force-pushed the toddmeng-db/operation-status-poller-error-handling branch from 579e26d to be06c48 Compare August 2, 2025 01:30
@toddmeng-db toddmeng-db requested a review from jadewang-db August 4, 2025 17:03

// use direct results if available
if (_statement.HasDirectResults && _statement.DirectResults != null && _statement.DirectResults.__isset.resultSet)
if (_statement.HasDirectResults && _statement.DirectResults != null && _statement.DirectResults.__isset.resultSet && statement.DirectResults?.ResultSet != null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be simplified to if (_statement.HasDirectResults)? It looks like that method is performing the same checks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a bit helpful for linter

_operationStatusPollingTask?.Wait();
try
{
if (_operationStatusPollingTask != null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar, this check looks redundant because you are already doing _operationStatusPollingTask?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah thanks, good catch

@CurtHagenlocher CurtHagenlocher changed the title fix(csharp/src/Drivers/Databricks): Correct DatabricksCompositeReader and StatusPoller to Stop/Dispose Appropriately fix(csharp/src/Drivers/Databricks): Correct DatabricksCompositeReader and StatusPoller to Stop/Dispose Appropriately Aug 5, 2025
@toddmeng-db toddmeng-db marked this pull request as ready for review August 6, 2025 21:35
@github-actions github-actions bot added this to the ADBC Libraries 20 milestone Aug 6, 2025
request.StartRowOffset = offset;

// Cancelling mid-request breaks the client; Dispose() should not break the underlying client
CancellationToken expiringToken = ApacheUtility.GetCancellationToken(DatabricksConstants.DefaultCloudFetchRequestTimeoutSeconds, ApacheUtility.TimeUnit.Seconds);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should you respect the connection parameter DatabricksParameters.CloudFetchTimeoutMinutes instead of the default value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean I shouldn't create a new constant here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, what I meant is if you should check the value of the connection parameter CloudFetchTimeoutMinutes (adbc.databricks.cloudfetch.timeout_minutes) which can be set by the client and customer.

Copy link
Contributor Author

@toddmeng-db toddmeng-db Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh got it, that makes sense, it should be a configurable parameter. To be consistent with the rest of HiveServer2Statement, I'm just using the QueryTimeout parameter (which is what other FetchResultsRequest uses)

I have some changes in a follow-up PR that will make this change easier to do for DatabricksReader, will leave this as a TODO

if (!statement.DirectResults.ResultSet.HasMoreRows)
{
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: return in the middle in the constructor may lead to some thing only partially initialized? Can we do the other way around?

Copy link
Contributor Author

@toddmeng-db toddmeng-db Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a bit more efficient for linting, otherwise requires a bunch of null checks

Copy link
Contributor

@jackyhu-db jackyhu-db Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I suggested is below

if (statement.DirectResults.ResultSet.HasMoreRows)
{
  operationStatusPoller = new DatabricksOperationStatusPoller(statement);
  operationStatusPoller.Start();
}

so if in the future, you add some other initialization on other private variables after this, they will not be missed when statement.DirectResults.ResultSet.HasMoreRows is false

@toddmeng-db toddmeng-db force-pushed the toddmeng-db/operation-status-poller-error-handling branch 2 times, most recently from 9242fd2 to efecc82 Compare August 7, 2025 19:40
@toddmeng-db toddmeng-db force-pushed the toddmeng-db/operation-status-poller-error-handling branch from efecc82 to 65f9d0d Compare August 7, 2025 19:41
Copy link
Contributor

@CurtHagenlocher CurtHagenlocher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! The linter error needs to be fixed and I made a few small low-priority suggestions.

}
catch (Exception)
{

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove blank line

if (_statement.DirectResults?.ResultSet.HasMoreRows ?? true)
{
operationStatusPoller = new DatabricksOperationStatusPoller(statement);
operationStatusPoller.Start();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some trailing spaces on this line that the linter doesn't like.


private void StopOperationStatusPoller()
{
operationStatusPoller?.Stop();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider setting to null here instead of the DisposeOperationStatusPoller method to avoid duplicate calls.

private Task? _operationStatusPollingTask;

public DatabricksOperationStatusPoller(IHiveServer2Statement statement, int heartbeatIntervalSeconds = DatabricksConstants.DefaultOperationStatusPollingIntervalSeconds)
public DatabricksOperationStatusPoller(IHiveServer2Statement statement, int heartbeatIntervalSeconds = DatabricksConstants.DefaultOperationStatusPollingIntervalSeconds, int requestTimeoutSeconds = DatabricksConstants.DefaultOperationStatusRequestTimeoutSeconds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public DatabricksOperationStatusPoller(IHiveServer2Statement statement, int heartbeatIntervalSeconds = DatabricksConstants.DefaultOperationStatusPollingIntervalSeconds, int requestTimeoutSeconds = DatabricksConstants.DefaultOperationStatusRequestTimeoutSeconds)
public DatabricksOperationStatusPoller(
IHiveServer2Statement statement,
int heartbeatIntervalSeconds = DatabricksConstants.DefaultOperationStatusPollingIntervalSeconds,
int requestTimeoutSeconds = DatabricksConstants.DefaultOperationStatusRequestTimeoutSeconds)

i.e. split across multiple lines

public async Task StopStopsPolling()
{
// Arrange
var poller = new DatabricksOperationStatusPoller(_mockStatement.Object, _heartbeatIntervalSeconds);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider a using block for poller instead of an explicit Dispose. This will ensure that the Dispose happens even if an exception is thrown inside the using block. Applies to many of the tests in this file, and probably not super-important unless the failure to Dispose in one test could cause another test to fail.

@toddmeng-db toddmeng-db force-pushed the toddmeng-db/operation-status-poller-error-handling branch 4 times, most recently from f559692 to 5a48ef2 Compare August 8, 2025 19:54
@toddmeng-db toddmeng-db force-pushed the toddmeng-db/operation-status-poller-error-handling branch from 5a48ef2 to 4130c83 Compare August 8, 2025 19:55
@CurtHagenlocher CurtHagenlocher merged commit f0f36da into apache:main Aug 8, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants